Benchmark

cesnet_tszoo.benchmarks

Benchmark

Used as wrapper for imported dataset, config, annotations and related_results.

Intended usage:

For time-based:

When using TimeBasedCesnetDataset (dataset_type = DatasetType.TIME_BASED):

Create an instance of the dataset with the desired data root by calling get_dataset. This will download the dataset if it has not been previously downloaded and return instance of dataset.
Create an instance of TimeBasedConfig and set it using set_dataset_config_and_initialize. This initializes the dataset, including data splitting (train/validation/test), fitting transformers (if needed), selecting features, and more. This is cached for later use.
Use get_train_dataloader/get_train_df/get_train_numpy to get training data for chosen model.
Validate the model and perform the hyperparameter optimalization on get_val_dataloader/get_val_df/get_val_numpy.
Evaluate the model on get_test_dataloader/get_test_df/get_test_numpy.

When using SeriesBasedCesnetDataset (dataset_type = DatasetType.SERIES_BASED):

Create an instance of the dataset with the desired data root by calling get_dataset. This will download the dataset if it has not been previously downloaded and return instance of dataset.
Create an instance of SeriesBasedConfig and set it using set_dataset_config_and_initialize. This initializes the dataset, including data splitting (train/validation/test), fitting transformers (if needed), selecting features, and more. This is cached for later use.
Use get_train_dataloader/get_train_df/get_train_numpy to get training data for chosen model.
Validate the model and perform the hyperparameter optimalization on get_val_dataloader/get_val_df/get_val_numpy.
Evaluate the model on get_test_dataloader/get_test_df/get_test_numpy.

When using DisjointTimeBasedCesnetDataset (dataset_type = DatasetType.DISJOINT_TIME_BASED):

Create an instance of the dataset with the desired data root by calling get_dataset. This will download the dataset if it has not been previously downloaded and return instance of dataset.
Create an instance of DisjointTimeBasedConfig and set it using set_dataset_config_and_initialize. This initializes the dataset, including data splitting (train/validation/test), fitting transformers (if needed), selecting features, and more. This is cached for later use.
Use get_train_dataloader/get_train_df/get_train_numpy to get training data for chosen model.
Validate the model and perform the hyperparameter optimalization on get_val_dataloader/get_val_df/get_val_numpy.
Evaluate the model on get_test_dataloader/get_test_df/get_test_numpy.

You can create custom time-based benchmarks with save_benchmark, series-based benchmarks with save_benchmark or disjoint-time-based with save_benchmark. They will be saved to "data_root"/tszoo/benchmarks/ directory, where data_root was set when you created instance of dataset.

Source code in cesnet_tszoo\benchmarks.py

class Benchmark:
    """
    Used as wrapper for imported `dataset`, `config`, `annotations` and `related_results`.

    **Intended usage:**

    For time-based:

    When using [`TimeBasedCesnetDataset`][cesnet_tszoo.datasets.time_based_cesnet_dataset.TimeBasedCesnetDataset] (`dataset_type` = `DatasetType.TIME_BASED`):

    1. Create an instance of the dataset with the desired data root by calling [`get_dataset`][cesnet_tszoo.datasets.cesnet_database.CesnetDatabase.get_dataset]. This will download the dataset if it has not been previously downloaded and return instance of dataset.
    2. Create an instance of [`TimeBasedConfig`][cesnet_tszoo.configs.time_based_config.TimeBasedConfig] and set it using [`set_dataset_config_and_initialize`][cesnet_tszoo.datasets.time_based_cesnet_dataset.TimeBasedCesnetDataset.set_dataset_config_and_initialize]. 
       This initializes the dataset, including data splitting (train/validation/test), fitting transformers (if needed), selecting features, and more. This is cached for later use.
    3. Use [`get_train_dataloader`][cesnet_tszoo.datasets.time_based_cesnet_dataset.TimeBasedCesnetDataset.get_train_dataloader]/[`get_train_df`][cesnet_tszoo.datasets.time_based_cesnet_dataset.TimeBasedCesnetDataset.get_train_df]/[`get_train_numpy`][cesnet_tszoo.datasets.time_based_cesnet_dataset.TimeBasedCesnetDataset.get_train_numpy] to get training data for chosen model.
    4. Validate the model and perform the hyperparameter optimalization on [`get_val_dataloader`][cesnet_tszoo.datasets.time_based_cesnet_dataset.TimeBasedCesnetDataset.get_val_dataloader]/[`get_val_df`][cesnet_tszoo.datasets.time_based_cesnet_dataset.TimeBasedCesnetDataset.get_val_df]/[`get_val_numpy`][cesnet_tszoo.datasets.time_based_cesnet_dataset.TimeBasedCesnetDataset.get_val_numpy].
    5. Evaluate the model on [`get_test_dataloader`][cesnet_tszoo.datasets.time_based_cesnet_dataset.TimeBasedCesnetDataset.get_test_dataloader]/[`get_test_df`][cesnet_tszoo.datasets.time_based_cesnet_dataset.TimeBasedCesnetDataset.get_test_df]/[`get_test_numpy`][cesnet_tszoo.datasets.time_based_cesnet_dataset.TimeBasedCesnetDataset.get_test_numpy].     

    When using [`SeriesBasedCesnetDataset`][cesnet_tszoo.datasets.series_based_cesnet_dataset.SeriesBasedCesnetDataset] (`dataset_type` = `DatasetType.SERIES_BASED`):

    1. Create an instance of the dataset with the desired data root by calling [`get_dataset`][cesnet_tszoo.datasets.cesnet_database.CesnetDatabase.get_dataset]. This will download the dataset if it has not been previously downloaded and return instance of dataset.
    2. Create an instance of [`SeriesBasedConfig`][cesnet_tszoo.configs.series_based_config.SeriesBasedConfig] and set it using [`set_dataset_config_and_initialize`][cesnet_tszoo.datasets.series_based_cesnet_dataset.SeriesBasedCesnetDataset.set_dataset_config_and_initialize]. 
       This initializes the dataset, including data splitting (train/validation/test), fitting transformers (if needed), selecting features, and more. This is cached for later use.
    3. Use [`get_train_dataloader`][cesnet_tszoo.datasets.series_based_cesnet_dataset.SeriesBasedCesnetDataset.get_train_dataloader]/[`get_train_df`][cesnet_tszoo.datasets.series_based_cesnet_dataset.SeriesBasedCesnetDataset.get_train_df]/[`get_train_numpy`][cesnet_tszoo.datasets.series_based_cesnet_dataset.SeriesBasedCesnetDataset.get_train_numpy] to get training data for chosen model.
    4. Validate the model and perform the hyperparameter optimalization on [`get_val_dataloader`][cesnet_tszoo.datasets.series_based_cesnet_dataset.SeriesBasedCesnetDataset.get_val_dataloader]/[`get_val_df`][cesnet_tszoo.datasets.series_based_cesnet_dataset.SeriesBasedCesnetDataset.get_val_df]/[`get_val_numpy`][cesnet_tszoo.datasets.series_based_cesnet_dataset.SeriesBasedCesnetDataset.get_val_numpy].
    5. Evaluate the model on [`get_test_dataloader`][cesnet_tszoo.datasets.series_based_cesnet_dataset.SeriesBasedCesnetDataset.get_test_dataloader]/[`get_test_df`][cesnet_tszoo.datasets.series_based_cesnet_dataset.SeriesBasedCesnetDataset.get_test_df]/[`get_test_numpy`][cesnet_tszoo.datasets.series_based_cesnet_dataset.SeriesBasedCesnetDataset.get_test_numpy].   

    When using [`DisjointTimeBasedCesnetDataset`][cesnet_tszoo.datasets.disjoint_time_based_cesnet_dataset.DisjointTimeBasedCesnetDataset] (`dataset_type` = `DatasetType.DISJOINT_TIME_BASED`):

    1. Create an instance of the dataset with the desired data root by calling [`get_dataset`][cesnet_tszoo.datasets.cesnet_database.CesnetDatabase.get_dataset]. This will download the dataset if it has not been previously downloaded and return instance of dataset.
    2. Create an instance of [`DisjointTimeBasedConfig`][cesnet_tszoo.configs.disjoint_time_based_config.DisjointTimeBasedConfig] and set it using [`set_dataset_config_and_initialize`][cesnet_tszoo.datasets.disjoint_time_based_cesnet_dataset.DisjointTimeBasedCesnetDataset.set_dataset_config_and_initialize]. 
       This initializes the dataset, including data splitting (train/validation/test), fitting transformers (if needed), selecting features, and more. This is cached for later use.
    3. Use [`get_train_dataloader`][cesnet_tszoo.datasets.disjoint_time_based_cesnet_dataset.DisjointTimeBasedCesnetDataset.get_train_dataloader]/[`get_train_df`][cesnet_tszoo.datasets.disjoint_time_based_cesnet_dataset.DisjointTimeBasedCesnetDataset.get_train_df]/[`get_train_numpy`][cesnet_tszoo.datasets.disjoint_time_based_cesnet_dataset.DisjointTimeBasedCesnetDataset.get_train_numpy] to get training data for chosen model.
    4. Validate the model and perform the hyperparameter optimalization on [`get_val_dataloader`][cesnet_tszoo.datasets.disjoint_time_based_cesnet_dataset.DisjointTimeBasedCesnetDataset.get_val_dataloader]/[`get_val_df`][cesnet_tszoo.datasets.disjoint_time_based_cesnet_dataset.DisjointTimeBasedCesnetDataset.get_val_df]/[`get_val_numpy`][cesnet_tszoo.datasets.disjoint_time_based_cesnet_dataset.DisjointTimeBasedCesnetDataset.get_val_numpy].
    5. Evaluate the model on [`get_test_dataloader`][cesnet_tszoo.datasets.disjoint_time_based_cesnet_dataset.DisjointTimeBasedCesnetDataset.get_test_dataloader]/[`get_test_df`][cesnet_tszoo.datasets.disjoint_time_based_cesnet_dataset.DisjointTimeBasedCesnetDataset.get_test_df]/[`get_test_numpy`][cesnet_tszoo.datasets.disjoint_time_based_cesnet_dataset.DisjointTimeBasedCesnetDataset.get_test_numpy].      

    You can create custom time-based benchmarks with [`save_benchmark`][cesnet_tszoo.datasets.time_based_cesnet_dataset.TimeBasedCesnetDataset.save_benchmark], series-based benchmarks with [`save_benchmark`][cesnet_tszoo.datasets.series_based_cesnet_dataset.SeriesBasedCesnetDataset.save_benchmark] or disjoint-time-based with [`save_benchmark`][cesnet_tszoo.datasets.disjoint_time_based_cesnet_dataset.DisjointTimeBasedCesnetDataset.save_benchmark].
    They will be saved to `"data_root"/tszoo/benchmarks/` directory, where `data_root` was set when you created instance of dataset.
    """

    def __init__(self, config: DatasetConfig, dataset: CesnetDataset, description: str = None):
        self.config = config
        self.dataset = dataset
        self.description = description
        self.related_results = None
        self.logger = logging.getLogger("benchmark")

    def get_config(self) -> SeriesBasedConfig | TimeBasedConfig | DisjointTimeBasedConfig:
        """Returns config made for this benchmark. """

        return self.config

    def get_initialized_dataset(self, display_config_details: bool = True, check_errors: bool = False, workers: Literal["config"] | int = "config") -> TimeBasedCesnetDataset | SeriesBasedCesnetDataset | DisjointTimeBasedCesnetDataset:
        """
        Returns dataset with intialized sets, transformers, fillers etc..

        This method uses following config attributes:

        | Dataset config                    | Description                                                                                    |
        | --------------------------------- | ---------------------------------------------------------------------------------------------- |
        | `init_workers`                    | Specifies the number of workers to use for initialization. Applied when `workers` = "config". |
        | `partial_fit_initialized_transformers` | Determines whether initialized transformers should be partially fitted on the training data.        |
        | `nan_threshold`                   | Filters out time series with missing values exceeding the specified threshold.                 |

        Parameters:
            display_config_details: Flag indicating whether to display the configuration values after initialization. `Default: True`   
            check_errors: Whether to validate if dataset is not corrupted. `Default: False`
            workers: The number of workers to use during initialization. `Default: "config"`        

        Returns:
            Returns initialized dataset.
        """

        if check_errors:
            self.dataset.check_errors()

        self.dataset.set_dataset_config_and_initialize(self.config, display_config_details, workers)

        return self.dataset

    def get_dataset(self, check_errors: bool = False) -> TimeBasedCesnetDataset | SeriesBasedCesnetDataset | DisjointTimeBasedCesnetDataset:
        """Returns dataset without initializing it.

        Parameters:
            check_errors: Whether to validate if dataset is not corrupted. `Default: False`

        Returns:
            Returns dataset used for this benchmark.
        """

        if check_errors:
            self.dataset.check_errors()

        return self.dataset

    def get_annotations(self, on: AnnotationType | Literal["id_time", "ts_id", "both"]) -> pd.DataFrame:
        """ 
        Returns the annotations as a Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).

        Parameters:
            on: Specifies which annotations to return. If set to `"both"`, annotations will be applied as if `id_time` and `ts_id` were both set.         

        Returns:
            A Pandas DataFrame containing the selected annotations.      
        """

        return self.dataset.get_annotations(on)

    def get_related_results(self) -> pd.DataFrame | None:
        """
        Returns the related results as a Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html), if they exist. 

        Returns:
            A Pandas DataFrame containing related results or None if not related results exist. 
        """

        return self.related_results

get_annotations

get_annotations(on: AnnotationType | Literal['id_time', 'ts_id', 'both']) -> pd.DataFrame

Returns the annotations as a Pandas DataFrame.

Parameters:

Name	Type	Description	Default
`on`	`AnnotationType \| Literal['id_time', 'ts_id', 'both']`	Specifies which annotations to return. If set to `"both"`, annotations will be applied as if `id_time` and `ts_id` were both set.	required

Returns:

Type	Description
`DataFrame`	A Pandas DataFrame containing the selected annotations.

Source code in cesnet_tszoo\benchmarks.py

def get_annotations(self, on: AnnotationType | Literal["id_time", "ts_id", "both"]) -> pd.DataFrame:
    """ 
    Returns the annotations as a Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).

    Parameters:
        on: Specifies which annotations to return. If set to `"both"`, annotations will be applied as if `id_time` and `ts_id` were both set.         

    Returns:
        A Pandas DataFrame containing the selected annotations.      
    """

    return self.dataset.get_annotations(on)

get_config

get_config() -> SeriesBasedConfig | TimeBasedConfig | DisjointTimeBasedConfig

Returns config made for this benchmark.

Source code in cesnet_tszoo\benchmarks.py

def get_config(self) -> SeriesBasedConfig | TimeBasedConfig | DisjointTimeBasedConfig:
    """Returns config made for this benchmark. """

    return self.config

get_dataset

get_dataset(check_errors: bool = False) -> TimeBasedCesnetDataset | SeriesBasedCesnetDataset | DisjointTimeBasedCesnetDataset

Returns dataset without initializing it.

Parameters:

Name	Type	Description	Default
`check_errors`	`bool`	Whether to validate if dataset is not corrupted. `Default: False`	`False`

Returns:

Type	Description
`TimeBasedCesnetDataset \| SeriesBasedCesnetDataset \| DisjointTimeBasedCesnetDataset`	Returns dataset used for this benchmark.

Source code in cesnet_tszoo\benchmarks.py

def get_dataset(self, check_errors: bool = False) -> TimeBasedCesnetDataset | SeriesBasedCesnetDataset | DisjointTimeBasedCesnetDataset:
    """Returns dataset without initializing it.

    Parameters:
        check_errors: Whether to validate if dataset is not corrupted. `Default: False`

    Returns:
        Returns dataset used for this benchmark.
    """

    if check_errors:
        self.dataset.check_errors()

    return self.dataset

get_initialized_dataset

get_initialized_dataset(display_config_details: bool = True, check_errors: bool = False, workers: Literal['config'] | int = 'config') -> TimeBasedCesnetDataset | SeriesBasedCesnetDataset | DisjointTimeBasedCesnetDataset

Returns dataset with intialized sets, transformers, fillers etc..

This method uses following config attributes:

Dataset config	Description
`init_workers`	Specifies the number of workers to use for initialization. Applied when `workers` = "config".
`partial_fit_initialized_transformers`	Determines whether initialized transformers should be partially fitted on the training data.
`nan_threshold`	Filters out time series with missing values exceeding the specified threshold.

Parameters:

Name	Type	Description	Default
`display_config_details`	`bool`	Flag indicating whether to display the configuration values after initialization. `Default: True`	`True`
`check_errors`	`bool`	Whether to validate if dataset is not corrupted. `Default: False`	`False`
`workers`	`Literal['config'] \| int`	The number of workers to use during initialization. `Default: "config"`	`'config'`

Returns:

Type	Description
`TimeBasedCesnetDataset \| SeriesBasedCesnetDataset \| DisjointTimeBasedCesnetDataset`	Returns initialized dataset.

Source code in cesnet_tszoo\benchmarks.py

def get_initialized_dataset(self, display_config_details: bool = True, check_errors: bool = False, workers: Literal["config"] | int = "config") -> TimeBasedCesnetDataset | SeriesBasedCesnetDataset | DisjointTimeBasedCesnetDataset:
    """
    Returns dataset with intialized sets, transformers, fillers etc..

    This method uses following config attributes:

    | Dataset config                    | Description                                                                                    |
    | --------------------------------- | ---------------------------------------------------------------------------------------------- |
    | `init_workers`                    | Specifies the number of workers to use for initialization. Applied when `workers` = "config". |
    | `partial_fit_initialized_transformers` | Determines whether initialized transformers should be partially fitted on the training data.        |
    | `nan_threshold`                   | Filters out time series with missing values exceeding the specified threshold.                 |

    Parameters:
        display_config_details: Flag indicating whether to display the configuration values after initialization. `Default: True`   
        check_errors: Whether to validate if dataset is not corrupted. `Default: False`
        workers: The number of workers to use during initialization. `Default: "config"`        

    Returns:
        Returns initialized dataset.
    """

    if check_errors:
        self.dataset.check_errors()

    self.dataset.set_dataset_config_and_initialize(self.config, display_config_details, workers)

    return self.dataset

get_related_results

get_related_results() -> pd.DataFrame | None

Returns the related results as a Pandas DataFrame, if they exist.

Returns:

Type	Description
`DataFrame \| None`	A Pandas DataFrame containing related results or None if not related results exist.

Source code in cesnet_tszoo\benchmarks.py

def get_related_results(self) -> pd.DataFrame | None:
    """
    Returns the related results as a Pandas [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html), if they exist. 

    Returns:
        A Pandas DataFrame containing related results or None if not related results exist. 
    """

    return self.related_results

load_benchmark

load_benchmark(identifier: str, data_root: str) -> Benchmark

Load a benchmark using the identifier.

First, it attempts to load the built-in benchmark, if no built-in benchmark with such an identifier exists, it attempts to load a custom benchmark from the "data_root"/tszoo/benchmarks/ directory.

Parameters:

Name	Type	Description	Default
`identifier`	`str`	The name of the benchmark YAML file.	required
`data_root`	`str`	Path to the folder where the dataset will be stored. Each database has its own subfolder `"data_root"/tszoo/databases/database_name/`.	required

Returns:

Type	Description
`Benchmark`	Returns benchmark with `config`, `annotations`, `dataset` and `related_results`.

Source code in cesnet_tszoo\benchmarks.py

def load_benchmark(identifier: str, data_root: str) -> Benchmark:
    """
    Load a benchmark using the identifier.

    First, it attempts to load the built-in benchmark, if no built-in benchmark with such an identifier exists, it attempts to load a custom benchmark from the `"data_root"/tszoo/benchmarks/` directory.

    Parameters:
        identifier: The name of the benchmark YAML file.
        data_root: Path to the folder where the dataset will be stored. Each database has its own subfolder `"data_root"/tszoo/databases/database_name/`.

    Returns:
        Returns benchmark with `config`, `annotations`, `dataset` and `related_results`.
    """

    logger = logging.getLogger("benchmark")

    data_root = os.path.normpath(os.path.expanduser(data_root))

    # For anything else
    if isinstance(identifier, str):
        _, is_built_in = get_benchmark_path_and_whether_it_is_built_in(identifier, data_root, logger)

        if is_built_in:
            logger.info("Built-in benchmark found: %s. Loading it.", identifier)
            return _get_built_in_benchmark(identifier, data_root)
        else:
            logger.info("Custom benchmark found: %s. Loading it.", identifier)
            return _get_custom_benchmark(identifier, data_root)

    else:
        logger.error("Invalid identifier.")
        raise ValueError("Invalid identifier.")