Skip to content

Choosing data

This tutorial will look at some configuration options for choosing data you wish to load.

Each dataset type will have its own part because of multiple differences of available configuration values.

TimeBasedCesnetDataset dataset

Note

For every configuration and more detailed examples refer to Jupyter notebook time_based_choosing_data

Relevant configuration values:

  • ts_ids - Defines which time series IDs are used for train/val/test/all.
  • test_ts_ids - Defines which time series IDs are used in the test_other set.
  • train_time_period/val_time_period/test_time_period - Defines time periods for train/val/test sets.
  • features_to_take - Defines which features are used.
  • include_time - If True, time data is included in the returned values.
  • include_ts_id - If True, time series IDs are included in the returned values.
  • time_format - Format for the returned time data.
  • random_state - Fixes randomness for reproducibility when setting ts_ids or test_ts_ids

Selecting which time series to load

  • Sets time series that will be used for train/val/test/all sets
from cesnet_tszoo.configs import TimeBasedConfig

# Sets time series used in sets with count. Chosen randomly from available time series.
# Affected by random_state.
config = TimeBasedConfig(ts_ids=54, random_state = 111)

# Sets time series used in sets with percentage of time series in dataset. Chosen randomly from available time series.
# Affected by random_state.
config = TimeBasedConfig(ts_ids=0.1, random_state = 111)

# Sets ts_ids with specific time series
config = TimeBasedConfig(ts_ids=[0,1,2,3,4,5])

# Call on time-based dataset to use created config
time_based_dataset.set_dataset_config_and_initialize(config)

You can also specify time series for test_ts_ids, but they will only be used when test_time_period is set. They can be set the same way as ts_ids

from cesnet_tszoo.configs import TimeBasedConfig

# Both ts_ids and test_ts_ids will contain unique time series.
config = TimeBasedConfig(ts_ids=54, test_ts_ids=20, test_time_period=range(0, 1000), random_state = 111)

time_based_dataset.set_dataset_config_and_initialize(config)

Creating train/val/test sets

  • Sets time period in set for every time series in ts_ids
  • You can leave any set value set as None.
  • Can use nan_threshold to set how many nan values will be tolerated.
    • nan_threshold = 1.0, means that time series can be completely empty.
    • is applied after sets.
    • Is checked seperately for every set.
  • Sets must follow these rules:
    • Used time periods must be connected.
    • Sets can share subset of times.
    • start of train_time_period < start of val_time_period < start of test_time_period.
from datetime import datetime

from cesnet_tszoo.configs import TimeBasedConfig

# Sets sets as range of time indices.
config = TimeBasedConfig(ts_ids=54, train_time_period=range(0, 2000), val_time_period=range(2000, 4000), test_time_period=range(4000, 5000))

# Sets sets with tuple of datetime objects.
# Datetime objects are expected to be of UTC.
config = TimeBasedConfig(ts_ids=54, train_time_period=(datetime(2023, 10, 9, 0), datetime(2023, 11, 9, 23)), val_time_period=(datetime(2023, 11, 9, 23), datetime(2023, 12, 9, 23)), test_time_period=(datetime(2023, 12, 9, 23), datetime(2023, 12, 25, 23)))

# Sets sets a percentage of whole time period from dataset.
# Always starts from first time.
config = TimeBasedConfig(ts_ids=54, train_time_period=0.5, val_time_period=0.3, test_time_period=0.2)

time_based_dataset.set_dataset_config_and_initialize(config)

Selecting features

  • Affects which features will be returned when loading data.
  • Setting include_time as True will add time to features that return when loading data.
  • Setting include_ts_id as True will add time series id to features that return when loading data.
from cesnet_tszoo.utils.enums import TimeFormat
from cesnet_tszoo.configs import TimeBasedConfig

config = TimeBasedConfig(ts_ids=54, features_to_take="all")

config = TimeBasedConfig(ts_ids=54, features_to_take=["n_flows", "n_packets"])

config = TimeBasedConfig(ts_ids=54, features_to_take=["n_flows", "n_packets"], include_time=True, include_ts_id=True, time_format=TimeFormat.ID_TIME)

time_based_dataset.set_dataset_config_and_initialize(config)

Selecting all set

  • Contains time series from ts_ids.
from cesnet_tszoo.configs import TimeBasedConfig

# All set will contain whole time period of dataset.
config = TimeBasedConfig(ts_ids=54, train_time_period=None, val_time_period=None, test_time_period=None)

# All set will contain total time period of train + val + test.
config = TimeBasedConfig(ts_ids=54, train_time_period=0.5, val_time_period=0.3, test_time_period=0.2)

time_based_dataset.set_dataset_config_and_initialize(config)

SeriesBasedCesnetDataset dataset

Note

For every configuration and more detailed examples refer to Jupyter notebook series_based_choosing_data

Relevant configuration values:

  • time_period - Defines the time period for train/val/test/all sets.
  • train_ts/val_ts/test_ts - Defines time series for train/val/test
  • features_to_take - Defines which features are used.
  • include_time - If True, time data is included in the returned values.
  • include_ts_id - If True, time series IDs are included in the returned values.
  • time_format - Format for the returned time data.
  • random_state - Fixes randomness for reproducibility when setting train_ts, val_ts, test_ts.

Selecting time period

  • time_period sets time period for all sets (used time series).
from datetime import datetime

from cesnet_tszoo.configs import SeriesBasedConfig

# Sets time period for time series as a whole time period from dataset.
config = SeriesBasedConfig(time_period="all")

# Sets time period for time series as range of time indices.
config = SeriesBasedConfig(time_period=range(0, 2000))

# Sets time period for time series with tuple of datetime objects.
# Datetime objects are expected to be of UTC.
config = SeriesBasedConfig(time_period=(datetime(2023, 10, 9, 0), datetime(2023, 11, 9, 23)))

# Sets time period for time series as a percentage of whole time period from dataset.
# Always starts from first time.
config = SeriesBasedConfig(time_period=0.5)

# Call on series-based dataset to use created config
series_based_dataset.set_dataset_config_and_initialize(config)

Creating train/val/test sets

  • Sets how many time series will be in each set.
  • You can leave any set value set as None.
  • Each set must have unique time series
  • Can use nan_threshold to set how many nan values will be tolerated.
    • nan_threshold = 1.0, means that time series can be completely empty.
    • is applied after sets.
from cesnet_tszoo.configs import SeriesBasedConfig

# Sets time series in set with count. Chosen randomly from available time series.
# Each set will contain unique time series.
# Affected by random_state.
config = SeriesBasedConfig(time_period=0.5, train_ts=54, val_ts=25, test_ts=10, random_state=None, nan_threshold=1.0)

# Sets time series in set with percentage of time series in dataset. Chosen randomly from available time series.
# Each set will contain unique time series.
# Affected by random_state.
config = SeriesBasedConfig(time_period=0.5, train_ts=0.5, val_ts=0.2, test_ts=0.1, random_state=None, nan_threshold=1.0)

# Sets sets with specific time series
config = SeriesBasedConfig(time_period=0.5, train_ts=[0,1,2,3,4], val_ts=[5,6,7,8,9], test_ts=[10,11,12,13,14], nan_threshold=1.0)

series_based_dataset.set_dataset_config_and_initialize(config)

Selecting features

  • Affects which features will be returned when loading data.
  • Setting include_time as True will add time to features that return when loading data.
  • Setting include_ts_id as True will add time series id to features that return when loading data.
from cesnet_tszoo.utils.enums import TimeFormat
from cesnet_tszoo.configs import SeriesBasedConfig

config = SeriesBasedConfig(time_period=0.5, features_to_take="all")

config = SeriesBasedConfig(time_period=0.5, features_to_take=["n_flows", "n_packets"])

config = SeriesBasedConfig(time_period=0.5, features_to_take=["n_flows", "n_packets"], include_time=True, include_ts_id=True, time_format=TimeFormat.ID_TIME)

# Call on series-based dataset to use created config
series_based_dataset.set_dataset_config_and_initialize(config)

Selecting all set

from cesnet_tszoo.configs import SeriesBasedConfig

# All set will contain all time series from dataset.
config = SeriesBasedConfig(time_period=0.5, train_ts=None, val_ts=None, test_ts=None)

# All set will contain all time series that were set by other sets.
config = SeriesBasedConfig(time_period=0.5, train_ts=54, val_ts=25, test_ts=10)

series_based_dataset.set_dataset_config_and_initialize(config)