Skip to content

CesnetDatabase

cesnet_tszoo.datasets.cesnet_database.CesnetDatabase

Bases: ABC

Base class for cesnet databases. This class should not be used directly. Use it as base for adding new databases.

Derived databases are used by calling class method get_dataset which will create a new dataset instance of SeriesBasedCesnetDataset or TimeBasedCesnetDataset. Check them for more info about how to use them.

Intended usage:

When using TimeBasedCesnetDataset (is_series_based = False):

  1. Create an instance of the dataset with the desired data root by calling get_dataset. This will download the dataset if it has not been previously downloaded and return instance of dataset.
  2. Create an instance of TimeBasedConfig and set it using set_dataset_config_and_initialize. This initializes the dataset, including data splitting (train/validation/test/test_other), fitting scalers (if needed), selecting features, and more. This is cached for later use.
  3. Use get_train_dataloader/get_train_df/get_train_numpy to get training data for chosen model.
  4. Validate the model and perform the hyperparameter optimalization on get_val_dataloader/get_val_df/get_val_numpy.
  5. Evaluate the model on get_test_dataloader/get_test_df/get_test_numpy.
  6. (Optional) Evaluate the model on get_test_other_dataloader/get_test_other_df/get_test_other_numpy.

When using SeriesBasedCesnetDataset (is_series_based = True):

  1. Create an instance of the dataset with the desired data root by calling get_dataset. This will download the dataset if it has not been previously downloaded and return instance of dataset.
  2. Create an instance of SeriesBasedConfig and set it using set_dataset_config_and_initialize. This initializes the dataset, including data splitting (train/validation/test), fitting scalers (if needed), selecting features, and more. This is cached for later use.
  3. Use get_train_dataloader/get_train_df/get_train_numpy to get training data for chosen model.
  4. Validate the model and perform the hyperparameter optimalization on get_val_dataloader/get_val_df/get_val_numpy.
  5. Evaluate the model on get_test_dataloader/get_test_df/get_test_numpy.

Used class attributes:

Attributes:

Name Type Description
name str

Name of the database.

bucket_url str

URL of the bucket where the dataset is stored.

tszoo_root str

Path to folder where all databases are saved. Set after get_dataset was called at least once.

database_root str

Path to the folder where datasets belonging to the database are saved. Set after get_dataset was called at least once.

configs_root str

Path to the folder where configurations are saved. Set after get_dataset was called at least once.

benchmarks_root str

Path to the folder where benchmarks are saved. Set after get_dataset was called at least once.

annotations_root str

Path to the folder where annotations are saved. Set after get_dataset was called at least once.

id_names dict

Names for time series IDs for each source_type.

default_values dict

Default values for each available feature.

source_types list[SourceType]

Available source types for the database.

aggregations list[AgreggationType]

Available aggregations for the database.

additional_data dict[str, tuple]

Available small datasets for each dataset.

Source code in cesnet_tszoo\datasets\cesnet_database.py
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
class CesnetDatabase(ABC):
    """
    Base class for cesnet databases. This class should **not** be used directly. Use it as base for adding new databases.

    Derived databases are used by calling class method [`get_dataset`][cesnet_tszoo.datasets.cesnet_database.CesnetDatabase.get_dataset] which will create a new dataset instance of [`SeriesBasedCesnetDataset`][cesnet_tszoo.datasets.series_based_cesnet_dataset.SeriesBasedCesnetDataset] 
    or [`TimeBasedCesnetDataset`][cesnet_tszoo.datasets.time_based_cesnet_dataset.TimeBasedCesnetDataset]. Check them for more info about how to use them.

    **Intended usage:**

    When using [`TimeBasedCesnetDataset`][cesnet_tszoo.datasets.time_based_cesnet_dataset.TimeBasedCesnetDataset] (`is_series_based` = `False`):

    1. Create an instance of the dataset with the desired data root by calling [`get_dataset`][cesnet_tszoo.datasets.cesnet_database.CesnetDatabase.get_dataset]. This will download the dataset if it has not been previously downloaded and return instance of dataset.
    2. Create an instance of [`TimeBasedConfig`][cesnet_tszoo.configs.time_based_config.TimeBasedConfig] and set it using [`set_dataset_config_and_initialize`][cesnet_tszoo.datasets.time_based_cesnet_dataset.TimeBasedCesnetDataset.set_dataset_config_and_initialize]. 
       This initializes the dataset, including data splitting (train/validation/test/test_other), fitting scalers (if needed), selecting features, and more. This is cached for later use.
    3. Use [`get_train_dataloader`][cesnet_tszoo.datasets.time_based_cesnet_dataset.TimeBasedCesnetDataset.get_train_dataloader]/[`get_train_df`][cesnet_tszoo.datasets.time_based_cesnet_dataset.TimeBasedCesnetDataset.get_train_df]/[`get_train_numpy`][cesnet_tszoo.datasets.time_based_cesnet_dataset.TimeBasedCesnetDataset.get_train_numpy] to get training data for chosen model.
    4. Validate the model and perform the hyperparameter optimalization on [`get_val_dataloader`][cesnet_tszoo.datasets.time_based_cesnet_dataset.TimeBasedCesnetDataset.get_val_dataloader]/[`get_val_df`][cesnet_tszoo.datasets.time_based_cesnet_dataset.TimeBasedCesnetDataset.get_val_df]/[`get_val_numpy`][cesnet_tszoo.datasets.time_based_cesnet_dataset.TimeBasedCesnetDataset.get_val_numpy].
    5. Evaluate the model on [`get_test_dataloader`][cesnet_tszoo.datasets.time_based_cesnet_dataset.TimeBasedCesnetDataset.get_test_dataloader]/[`get_test_df`][cesnet_tszoo.datasets.time_based_cesnet_dataset.TimeBasedCesnetDataset.get_test_df]/[`get_test_numpy`][cesnet_tszoo.datasets.time_based_cesnet_dataset.TimeBasedCesnetDataset.get_test_numpy]. 
    6. (Optional) Evaluate the model on [`get_test_other_dataloader`][cesnet_tszoo.datasets.time_based_cesnet_dataset.TimeBasedCesnetDataset.get_test_other_dataloader]/[`get_test_other_df`][cesnet_tszoo.datasets.time_based_cesnet_dataset.TimeBasedCesnetDataset.get_test_other_df]/[`get_test_other_numpy`][cesnet_tszoo.datasets.time_based_cesnet_dataset.TimeBasedCesnetDataset.get_test_other_numpy].    

    When using [`SeriesBasedCesnetDataset`][cesnet_tszoo.datasets.series_based_cesnet_dataset.SeriesBasedCesnetDataset] (`is_series_based` = `True`):

    1. Create an instance of the dataset with the desired data root by calling [`get_dataset`][cesnet_tszoo.datasets.cesnet_database.CesnetDatabase.get_dataset]. This will download the dataset if it has not been previously downloaded and return instance of dataset.
    2. Create an instance of [`SeriesBasedConfig`][cesnet_tszoo.configs.series_based_config.SeriesBasedConfig] and set it using [`set_dataset_config_and_initialize`][cesnet_tszoo.datasets.series_based_cesnet_dataset.SeriesBasedCesnetDataset.set_dataset_config_and_initialize]. 
       This initializes the dataset, including data splitting (train/validation/test), fitting scalers (if needed), selecting features, and more. This is cached for later use.
    3. Use [`get_train_dataloader`][cesnet_tszoo.datasets.series_based_cesnet_dataset.SeriesBasedCesnetDataset.get_train_dataloader]/[`get_train_df`][cesnet_tszoo.datasets.series_based_cesnet_dataset.SeriesBasedCesnetDataset.get_train_df]/[`get_train_numpy`][cesnet_tszoo.datasets.series_based_cesnet_dataset.SeriesBasedCesnetDataset.get_train_numpy] to get training data for chosen model.
    4. Validate the model and perform the hyperparameter optimalization on [`get_val_dataloader`][cesnet_tszoo.datasets.series_based_cesnet_dataset.SeriesBasedCesnetDataset.get_val_dataloader]/[`get_val_df`][cesnet_tszoo.datasets.series_based_cesnet_dataset.SeriesBasedCesnetDataset.get_val_df]/[`get_val_numpy`][cesnet_tszoo.datasets.series_based_cesnet_dataset.SeriesBasedCesnetDataset.get_val_numpy].
    5. Evaluate the model on [`get_test_dataloader`][cesnet_tszoo.datasets.series_based_cesnet_dataset.SeriesBasedCesnetDataset.get_test_dataloader]/[`get_test_df`][cesnet_tszoo.datasets.series_based_cesnet_dataset.SeriesBasedCesnetDataset.get_test_df]/[`get_test_numpy`][cesnet_tszoo.datasets.series_based_cesnet_dataset.SeriesBasedCesnetDataset.get_test_numpy].   

    Used class attributes:

    Attributes:
        name: Name of the database.
        bucket_url: URL of the bucket where the dataset is stored.
        tszoo_root: Path to folder where all databases are saved. Set after [`get_dataset`][cesnet_tszoo.datasets.cesnet_database.CesnetDatabase.get_dataset] was called at least once.
        database_root: Path to the folder where datasets belonging to the database are saved. Set after [`get_dataset`][cesnet_tszoo.datasets.cesnet_database.CesnetDatabase.get_dataset] was called at least once.
        configs_root: Path to the folder where configurations are saved. Set after [`get_dataset`][cesnet_tszoo.datasets.cesnet_database.CesnetDatabase.get_dataset] was called at least once.
        benchmarks_root: Path to the folder where benchmarks are saved. Set after [`get_dataset`][cesnet_tszoo.datasets.cesnet_database.CesnetDatabase.get_dataset] was called at least once.
        annotations_root: Path to the folder where annotations are saved. Set after [`get_dataset`][cesnet_tszoo.datasets.cesnet_database.CesnetDatabase.get_dataset] was called at least once.
        id_names: Names for time series IDs for each `source_type`.
        default_values: Default values for each available feature.
        source_types: Available source types for the database.
        aggregations: Available aggregations for the database.   
        additional_data: Available small datasets for each dataset. 
    """

    name: str
    bucket_url: str

    tszoo_root: str
    database_root: str
    configs_root: str
    benchmarks_root: str
    annotations_root: str

    id_names: dict = None
    default_values: dict = None
    source_types: list[SourceType] = []
    aggregations: list[AgreggationType] = []
    additional_data: dict[str, tuple] = {}

    def __init__(self):
        raise ValueError("To create dataset instance use class method 'get_dataset' instead.")

    @classmethod
    def get_dataset(cls, data_root: str, source_type: SourceType | str, aggregation: AgreggationType | str, is_series_based: bool, check_errors: bool = False, display_details: bool = False) -> TimeBasedCesnetDataset | SeriesBasedCesnetDataset:
        """
        Create new dataset instance.

        Parameters:
            data_root: Path to the folder where the dataset will be stored. Each database has its own subfolder `data_root/tszoo/databases/database_name/`.
            source_type: The source type of the desired dataset.
            aggregation: The aggregation type for the selected source type.
            is_series_based: Whether you want to create series-based dataset or time-based dataset.
            check_errors: Whether to validate if the dataset is corrupted. `Default: False`
            display_details: Whether to display details about the available data in chosen dataset. `Default: False`

        Returns:
            TimeBasedCesnetDataset or SeriesBasedCesnetDataset.
        """

        logger = logging.getLogger("wrapper_dataset")

        source_type = SourceType(source_type)
        aggregation = AgreggationType(aggregation)

        if source_type not in cls.source_types:
            raise ValueError(f"Unsupported source type: {source_type}")

        if aggregation not in cls.aggregations:
            raise ValueError(f"Unsupported aggregation type: {aggregation}")

        # Dataset paths setup
        cls.tszoo_root = os.path.normpath(os.path.expanduser(os.path.join(data_root, "tszoo")))
        cls.database_root = os.path.join(cls.tszoo_root, "databases", cls.name)
        cls.configs_root = os.path.join(cls.tszoo_root, "configs")
        cls.benchmarks_root = os.path.join(cls.tszoo_root, "benchmarks")
        cls.annotations_root = os.path.join(cls.tszoo_root, "annotations")
        dataset_name = f"{cls.name}-{source_type.value}-{AgreggationType._to_str_without_number(aggregation)}"
        dataset_path = os.path.join(cls.database_root, f"{dataset_name}.h5")

        # Ensure necessary directories exist
        for directory in [cls.database_root, cls.configs_root, cls.annotations_root, cls.benchmarks_root]:
            if not os.path.exists(directory):
                logger.info("Creating directory: %s", directory)
                os.makedirs(directory)

        if not cls._is_downloaded(dataset_path):
            cls._download(dataset_name, dataset_path)

        if is_series_based:
            dataset = SeriesBasedCesnetDataset(cls.name, dataset_path, cls.configs_root, cls.benchmarks_root, cls.annotations_root, source_type, aggregation, cls.id_names[source_type], cls.default_values, cls.additional_data)
        else:
            dataset = TimeBasedCesnetDataset(cls.name, dataset_path, cls.configs_root, cls.benchmarks_root, cls.annotations_root, source_type, aggregation, cls.id_names[source_type], cls.default_values, cls.additional_data)

        if check_errors:
            dataset.check_errors()

        if display_details:
            dataset.display_dataset_details()

        if is_series_based:
            logger.info("Dataset is series-based. Use cesnet_tszoo.configs.SeriesBasedConfig")
        else:
            logger.info("Dataset is time-based. Use cesnet_tszoo.configs.TimeBasedConfig")

        return dataset

    @classmethod
    def _is_downloaded(cls, dataset_path: str) -> bool:
        """Check whether the dataset at path has already been downloaded. """

        return os.path.exists(dataset_path)

    @classmethod
    def _download(cls, dataset_name: str, dataset_path: str) -> None:
        """Download the dataset file. """

        logger = logging.getLogger("wrapper_dataset")

        logger.info("Downloading %s dataset.", dataset_name)
        database_url = f"{cls.bucket_url}&file={dataset_name}.h5"
        resumable_download(url=database_url, file_path=dataset_path, silent=False)

get_dataset classmethod

get_dataset(data_root: str, source_type: SourceType | str, aggregation: AgreggationType | str, is_series_based: bool, check_errors: bool = False, display_details: bool = False) -> TimeBasedCesnetDataset | SeriesBasedCesnetDataset

Create new dataset instance.

Parameters:

Name Type Description Default
data_root str

Path to the folder where the dataset will be stored. Each database has its own subfolder data_root/tszoo/databases/database_name/.

required
source_type SourceType | str

The source type of the desired dataset.

required
aggregation AgreggationType | str

The aggregation type for the selected source type.

required
is_series_based bool

Whether you want to create series-based dataset or time-based dataset.

required
check_errors bool

Whether to validate if the dataset is corrupted. Default: False

False
display_details bool

Whether to display details about the available data in chosen dataset. Default: False

False

Returns:

Type Description
TimeBasedCesnetDataset | SeriesBasedCesnetDataset

TimeBasedCesnetDataset or SeriesBasedCesnetDataset.

Source code in cesnet_tszoo\datasets\cesnet_database.py
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
@classmethod
def get_dataset(cls, data_root: str, source_type: SourceType | str, aggregation: AgreggationType | str, is_series_based: bool, check_errors: bool = False, display_details: bool = False) -> TimeBasedCesnetDataset | SeriesBasedCesnetDataset:
    """
    Create new dataset instance.

    Parameters:
        data_root: Path to the folder where the dataset will be stored. Each database has its own subfolder `data_root/tszoo/databases/database_name/`.
        source_type: The source type of the desired dataset.
        aggregation: The aggregation type for the selected source type.
        is_series_based: Whether you want to create series-based dataset or time-based dataset.
        check_errors: Whether to validate if the dataset is corrupted. `Default: False`
        display_details: Whether to display details about the available data in chosen dataset. `Default: False`

    Returns:
        TimeBasedCesnetDataset or SeriesBasedCesnetDataset.
    """

    logger = logging.getLogger("wrapper_dataset")

    source_type = SourceType(source_type)
    aggregation = AgreggationType(aggregation)

    if source_type not in cls.source_types:
        raise ValueError(f"Unsupported source type: {source_type}")

    if aggregation not in cls.aggregations:
        raise ValueError(f"Unsupported aggregation type: {aggregation}")

    # Dataset paths setup
    cls.tszoo_root = os.path.normpath(os.path.expanduser(os.path.join(data_root, "tszoo")))
    cls.database_root = os.path.join(cls.tszoo_root, "databases", cls.name)
    cls.configs_root = os.path.join(cls.tszoo_root, "configs")
    cls.benchmarks_root = os.path.join(cls.tszoo_root, "benchmarks")
    cls.annotations_root = os.path.join(cls.tszoo_root, "annotations")
    dataset_name = f"{cls.name}-{source_type.value}-{AgreggationType._to_str_without_number(aggregation)}"
    dataset_path = os.path.join(cls.database_root, f"{dataset_name}.h5")

    # Ensure necessary directories exist
    for directory in [cls.database_root, cls.configs_root, cls.annotations_root, cls.benchmarks_root]:
        if not os.path.exists(directory):
            logger.info("Creating directory: %s", directory)
            os.makedirs(directory)

    if not cls._is_downloaded(dataset_path):
        cls._download(dataset_name, dataset_path)

    if is_series_based:
        dataset = SeriesBasedCesnetDataset(cls.name, dataset_path, cls.configs_root, cls.benchmarks_root, cls.annotations_root, source_type, aggregation, cls.id_names[source_type], cls.default_values, cls.additional_data)
    else:
        dataset = TimeBasedCesnetDataset(cls.name, dataset_path, cls.configs_root, cls.benchmarks_root, cls.annotations_root, source_type, aggregation, cls.id_names[source_type], cls.default_values, cls.additional_data)

    if check_errors:
        dataset.check_errors()

    if display_details:
        dataset.display_dataset_details()

    if is_series_based:
        logger.info("Dataset is series-based. Use cesnet_tszoo.configs.SeriesBasedConfig")
    else:
        logger.info("Dataset is time-based. Use cesnet_tszoo.configs.TimeBasedConfig")

    return dataset