Choosing data
This tutorial will look at some configuration options for choosing data you wish to load.
Each dataset type will have its own part because of multiple differences of available configuration values.
TimeBasedCesnetDataset
dataset
Note
For every configuration and more detailed examples refer to Jupyter notebook time_based_choosing_data
Relevant configuration values:
ts_ids
- Defines which time series IDs are used for train/val/test/all.test_ts_ids
- Defines which time series IDs are used in the test_other set.train_time_period
/val_time_period
/test_time_period
- Defines time periods for train/val/test sets.features_to_take
- Defines which features are used.include_time
- If True, time data is included in the returned values.include_ts_id
- If True, time series IDs are included in the returned values.time_format
- Format for the returned time data.random_state
- Fixes randomness for reproducibility when settingts_ids
ortest_ts_ids
Selecting which time series to load
- Sets time series that will be used for train/val/test/all sets
from cesnet_tszoo.configs import TimeBasedConfig
# Sets time series used in sets with count. Chosen randomly from available time series.
# Affected by random_state.
config = TimeBasedConfig(ts_ids=54, random_state = 111)
# Sets time series used in sets with percentage of time series in dataset. Chosen randomly from available time series.
# Affected by random_state.
config = TimeBasedConfig(ts_ids=0.1, random_state = 111)
# Sets ts_ids with specific time series
config = TimeBasedConfig(ts_ids=[0,1,2,3,4,5])
# Call on time-based dataset to use created config
time_based_dataset.set_dataset_config_and_initialize(config)
You can also specify time series for test_ts_ids
, but they will only be used when test_time_period
is set. They can be set the same way as ts_ids
from cesnet_tszoo.configs import TimeBasedConfig
# Both ts_ids and test_ts_ids will contain unique time series.
config = TimeBasedConfig(ts_ids=54, test_ts_ids=20, test_time_period=range(0, 1000), random_state = 111)
time_based_dataset.set_dataset_config_and_initialize(config)
Creating train/val/test sets
- Sets time period in set for every time series in
ts_ids
- You can leave any set value set as None.
- Can use
nan_threshold
to set how many nan values will be tolerated.nan_threshold
= 1.0, means that time series can be completely empty.- is applied after sets.
- Is checked seperately for every set.
- Sets must follow these rules:
- Used time periods must be connected.
- Sets can share subset of times.
- start of
train_time_period
< start ofval_time_period
< start oftest_time_period
.
from datetime import datetime
from cesnet_tszoo.configs import TimeBasedConfig
# Sets sets as range of time indices.
config = TimeBasedConfig(ts_ids=54, train_time_period=range(0, 2000), val_time_period=range(2000, 4000), test_time_period=range(4000, 5000))
# Sets sets with tuple of datetime objects.
# Datetime objects are expected to be of UTC.
config = TimeBasedConfig(ts_ids=54, train_time_period=(datetime(2023, 10, 9, 0), datetime(2023, 11, 9, 23)), val_time_period=(datetime(2023, 11, 9, 23), datetime(2023, 12, 9, 23)), test_time_period=(datetime(2023, 12, 9, 23), datetime(2023, 12, 25, 23)))
# Sets sets a percentage of whole time period from dataset.
# Always starts from first time.
config = TimeBasedConfig(ts_ids=54, train_time_period=0.5, val_time_period=0.3, test_time_period=0.2)
time_based_dataset.set_dataset_config_and_initialize(config)
Selecting features
- Affects which features will be returned when loading data.
- Setting
include_time
as True will add time to features that return when loading data. - Setting
include_ts_id
as True will add time series id to features that return when loading data.
from cesnet_tszoo.utils.enums import TimeFormat
from cesnet_tszoo.configs import TimeBasedConfig
config = TimeBasedConfig(ts_ids=54, features_to_take="all")
config = TimeBasedConfig(ts_ids=54, features_to_take=["n_flows", "n_packets"])
config = TimeBasedConfig(ts_ids=54, features_to_take=["n_flows", "n_packets"], include_time=True, include_ts_id=True, time_format=TimeFormat.ID_TIME)
time_based_dataset.set_dataset_config_and_initialize(config)
Selecting all set
- Contains time series from
ts_ids
.
from cesnet_tszoo.configs import TimeBasedConfig
# All set will contain whole time period of dataset.
config = TimeBasedConfig(ts_ids=54, train_time_period=None, val_time_period=None, test_time_period=None)
# All set will contain total time period of train + val + test.
config = TimeBasedConfig(ts_ids=54, train_time_period=0.5, val_time_period=0.3, test_time_period=0.2)
time_based_dataset.set_dataset_config_and_initialize(config)
SeriesBasedCesnetDataset
dataset
Note
For every configuration and more detailed examples refer to Jupyter notebook series_based_choosing_data
Relevant configuration values:
time_period
- Defines the time period for train/val/test/all sets.train_ts
/val_ts
/test_ts
- Defines time series for train/val/testfeatures_to_take
- Defines which features are used.include_time
- If True, time data is included in the returned values.include_ts_id
- If True, time series IDs are included in the returned values.time_format
- Format for the returned time data.random_state
- Fixes randomness for reproducibility when settingtrain_ts
,val_ts
,test_ts
.
Selecting time period
time_period
sets time period for all sets (used time series).
from datetime import datetime
from cesnet_tszoo.configs import SeriesBasedConfig
# Sets time period for time series as a whole time period from dataset.
config = SeriesBasedConfig(time_period="all")
# Sets time period for time series as range of time indices.
config = SeriesBasedConfig(time_period=range(0, 2000))
# Sets time period for time series with tuple of datetime objects.
# Datetime objects are expected to be of UTC.
config = SeriesBasedConfig(time_period=(datetime(2023, 10, 9, 0), datetime(2023, 11, 9, 23)))
# Sets time period for time series as a percentage of whole time period from dataset.
# Always starts from first time.
config = SeriesBasedConfig(time_period=0.5)
# Call on series-based dataset to use created config
series_based_dataset.set_dataset_config_and_initialize(config)
Creating train/val/test sets
- Sets how many time series will be in each set.
- You can leave any set value set as None.
- Each set must have unique time series
- Can use
nan_threshold
to set how many nan values will be tolerated.nan_threshold
= 1.0, means that time series can be completely empty.- is applied after sets.
from cesnet_tszoo.configs import SeriesBasedConfig
# Sets time series in set with count. Chosen randomly from available time series.
# Each set will contain unique time series.
# Affected by random_state.
config = SeriesBasedConfig(time_period=0.5, train_ts=54, val_ts=25, test_ts=10, random_state=None, nan_threshold=1.0)
# Sets time series in set with percentage of time series in dataset. Chosen randomly from available time series.
# Each set will contain unique time series.
# Affected by random_state.
config = SeriesBasedConfig(time_period=0.5, train_ts=0.5, val_ts=0.2, test_ts=0.1, random_state=None, nan_threshold=1.0)
# Sets sets with specific time series
config = SeriesBasedConfig(time_period=0.5, train_ts=[0,1,2,3,4], val_ts=[5,6,7,8,9], test_ts=[10,11,12,13,14], nan_threshold=1.0)
series_based_dataset.set_dataset_config_and_initialize(config)
Selecting features
- Affects which features will be returned when loading data.
- Setting
include_time
as True will add time to features that return when loading data. - Setting
include_ts_id
as True will add time series id to features that return when loading data.
from cesnet_tszoo.utils.enums import TimeFormat
from cesnet_tszoo.configs import SeriesBasedConfig
config = SeriesBasedConfig(time_period=0.5, features_to_take="all")
config = SeriesBasedConfig(time_period=0.5, features_to_take=["n_flows", "n_packets"])
config = SeriesBasedConfig(time_period=0.5, features_to_take=["n_flows", "n_packets"], include_time=True, include_ts_id=True, time_format=TimeFormat.ID_TIME)
# Call on series-based dataset to use created config
series_based_dataset.set_dataset_config_and_initialize(config)
Selecting all set
from cesnet_tszoo.configs import SeriesBasedConfig
# All set will contain all time series from dataset.
config = SeriesBasedConfig(time_period=0.5, train_ts=None, val_ts=None, test_ts=None)
# All set will contain all time series that were set by other sets.
config = SeriesBasedConfig(time_period=0.5, train_ts=54, val_ts=25, test_ts=10)
series_based_dataset.set_dataset_config_and_initialize(config)