Using scalers
This tutorial will look at some configuration options for using scalers.
Each dataset type will have its own part because of multiple differences of available configuration values.
TimeBasedCesnetDataset
dataset
Note
For every configuration and more detailed examples refer to Jupyter notebook time_based_using_scalers
Relevant configuration values:
scale_with
- Defines the scaler used to transform the dataset.create_scaler_per_time_series
- If True, a separate scaler is created for each time series and scalers wont be used for time series on 'test_ts_id'.partial_fit_initialized_scalers
- If True, partial fitting on train set is performed when using initiliazed scalers.
Scalers
- Scalers are implemented as class.
- You can create your own or use built-in one.
- Scaler must implement
transform
. - Scalers are applied after
default_values
and fillers took care of missing values. - To use scalers, train set must be implemented (unless scalers are already fitted and
partial_fit_initialized_scalers
is False). fit
method on scaler:- must be implemented when
create_scaler_per_time_series
is True and scalers are not already fitted.
- must be implemented when
partial_fit
method on scaler:- must be implemented when
create_scaler_per_time_series
is False or using already fitted scalers withpartial_fit_initialized_scalers
set to True.
- must be implemented when
- You can change used scaler later with
update_dataset_config_and_initialize
orapply_scaler
.
Built-in
To see all built-in scalers refer to Scalers
.
from cesnet_tszoo.utils.enums import ScalerType
from cesnet_tszoo.configs import TimeBasedConfig
config = TimeBasedConfig(ts_ids=[1367, 1368], train_time_period=0.5, val_time_period=0.2, test_time_period=0.1, test_ts_ids=[1370], features_to_take=['n_flows', 'n_packets'],
scale_with=ScalerType.MIN_MAX_SCALER, create_scaler_per_time_series=True)
# Call on time-based dataset to use created config
time_based_dataset.set_dataset_config_and_initialize(config)
Or later with:
time_based_dataset.update_dataset_config_and_initialize(scale_with=ScalerType.MIN_MAX_SCALER, create_scaler_per_time_series=True, partial_fit_initialized_scalers="config", workers=0)
# Or
time_based_dataset.apply_scaler(scale_with=ScalerType.MIN_MAX_SCALER, create_scaler_per_time_series=True, partial_fit_initialized_scalers="config", workers=0)
Custom
You can create your own custom scaler. It is recommended to derive from 'Scaler' base class.
To check Scaler base class refer to Scaler
from cesnet_tszoo.utils.scaler import Scaler
from cesnet_tszoo.configs import TimeBasedConfig
class CustomScaler(Scaler):
def __init__(self):
super().__init__()
self.max = None
self.min = None
def transform(self, data):
return (data - self.min) / (self.max - self.min)
def fit(self, data):
self.partial_fit(data)
def partial_fit(self, data):
if self.max is None and self.min is None:
self.max = np.max(data, axis=0)
self.min = np.min(data, axis=0)
return
temp_max = np.max(data, axis=0)
temp = np.vstack((self.max, temp_max))
self.max = np.max(temp, axis=0)
temp_min = np.min(data, axis=0)
temp = np.vstack((self.min, temp_min))
self.min = np.min(temp, axis=0)
config = TimeBasedConfig(ts_ids=[1367, 1368], train_time_period=0.5, val_time_period=0.2, test_time_period=0.1, test_ts_ids=[1370], features_to_take=['n_flows', 'n_packets'],
scale_with=CustomScaler, create_scaler_per_time_series=True)
time_based_dataset.set_dataset_config_and_initialize(config)
Or later with:
time_based_dataset.update_dataset_config_and_initialize(scale_with=CustomScaler, create_scaler_per_time_series=True, partial_fit_initialized_scalers="config", workers=0)
# Or
time_based_dataset.apply_scaler(scale_with=CustomScaler, create_scaler_per_time_series=True, partial_fit_initialized_scalers="config", workers=0)
Using already fitted scalers
from cesnet_tszoo.configs import TimeBasedConfig
config = TimeBasedConfig(ts_ids=[103, 118], train_time_period=0.5, val_time_period=0.2, test_time_period=0.1, test_ts_ids=[1370], features_to_take=['n_flows', 'n_packets'],
scale_with=list_of_fitted_scalers, create_scaler_per_time_series=True)
# Length of list_of_fitted_scalers must be equal to number of time series in ts_ids
# All scalers in list_of_fitted_scalers must be of same type
config = TimeBasedConfig(ts_ids=[103, 118], train_time_period=0.5, val_time_period=0.2, test_time_period=0.1, test_ts_ids=[1370], features_to_take=['n_flows', 'n_packets'],
scale_with=one_prefitted_scaler, create_scaler_per_time_series=True)
# one_prefitted_scaler must be just one scaler (not a list)
time_based_dataset.set_dataset_config_and_initialize(config)
SeriesBasedCesnetDataset
dataset
Note
For every configuration and more detailed examples refer to Jupyter notebook series_based_using_scalers
Relevant configuration values:
scale_with
- Defines the scaler used to transform the dataset.partial_fit_initialized_scalers
- If True, partial fitting on train set is performed when using initiliazed scaler.
Scalers
- Scalers are implemented as class.
- You can create your own or use built-in one.
- Scaler is applied after
default_values
and fillers took care of missing values. - One scaler is used for all time series.
- Scaler must implement
transform
. - Scaler must implement
partial_fit
(unless scaler is already fitted andpartial_fit_initialized_scalers
is False). - To use scaler, train set must be implemented (unless scaler is already fitted and
partial_fit_initialized_scalers
is False). - You can change used scaler later with
update_dataset_config_and_initialize
orapply_scaler
.
Built-in
To see all built-in scalers refer to Scalers
.
from cesnet_tszoo.utils.enums import ScalerType
from cesnet_tszoo.configs import SeriesBasedConfig
config = SeriesBasedConfig(time_period=0.5, train_ts=500, features_to_take=["n_flows", "n_packets"],
scale_with=ScalerType.MIN_MAX_SCALER, nan_threshold=0.5, random_state=1500)
# Call on series-based dataset to use created config
series_based_dataset.set_dataset_config_and_initialize(config)
Or later with:
series_based_dataset.update_dataset_config_and_initialize(scale_with=ScalerType.MIN_MAX_SCALER, partial_fit_initialized_scalers="config", workers=0)
# Or
series_based_dataset.apply_scaler(scale_with=ScalerType.MIN_MAX_SCALER, partial_fit_initialized_scalers="config", workers=0)
Custom
You can create your own custom scaler. It is recommended to derive from 'Scaler' base class.
To check Scaler base class refer to Scaler
from cesnet_tszoo.utils.scaler import Scaler
from cesnet_tszoo.configs import SeriesBasedConfig
class CustomScaler(Scaler):
def __init__(self):
super().__init__()
self.max = None
self.min = None
def transform(self, data):
return (data - self.min) / (self.max - self.min)
def fit(self, data):
self.partial_fit(data)
def partial_fit(self, data):
if self.max is None and self.min is None:
self.max = np.max(data, axis=0)
self.min = np.min(data, axis=0)
return
temp_max = np.max(data, axis=0)
temp = np.vstack((self.max, temp_max))
self.max = np.max(temp, axis=0)
temp_min = np.min(data, axis=0)
temp = np.vstack((self.min, temp_min))
self.min = np.min(temp, axis=0)
config = SeriesBasedConfig(time_period=0.5, train_ts=500, features_to_take=["n_flows", "n_packets"],
scale_with=CustomScaler, nan_threshold=0.5, random_state=1500)
series_based_dataset.set_dataset_config_and_initialize(config)
Or later with:
series_based_dataset.update_dataset_config_and_initialize(scale_with=CustomScaler, partial_fit_initialized_scalers="config", workers=0)
# Or
series_based_dataset.apply_scaler(scale_with=CustomScaler, partial_fit_initialized_scalers="config", workers=0)
Using already fitted scalers
from cesnet_tszoo.configs import SeriesBasedConfig
config = SeriesBasedConfig(time_period=0.5, val_ts=500, features_to_take=["n_flows", "n_packets"],
scale_with=fitted_scaler, nan_threshold=0.5, random_state=999)
# fitted_scaler must be just one scaler (not a list)
series_based_dataset.set_dataset_config_and_initialize(config)