Handling missing data
This tutorial will look at some configuration options used for handling missing data.
Only time-based will be used, because all methods work almost the same way for series-based.
Note
For every configuration and more detailed examples refer to Jupyter notebook handling_missing_data
Relevant configuration values:
default_values
- Default values for missing data, applied before fillers.fill_missing_with
- Defines how to fill missing values in the dataset. Can pass FillerType enum or custom Filler type.
Default values
- Default values are set to missing values before filler is used.
- You can change used default values later with
update_dataset_config_and_initialize
orset_default_values
.
from cesnet_tszoo.configs import TimeBasedConfig
# Default values are provided from used dataset.
config = TimeBasedConfig(ts_ids=[1200], train_time_period=range(0, 30), test_time_period=range(30, 80), features_to_take=['n_flows', 'n_packets'],
default_values="default")
# All missing values will be set as None
config = TimeBasedConfig(ts_ids=[1200], train_time_period=range(0, 30), test_time_period=range(30, 80), features_to_take=['n_flows', 'n_packets'],
default_values=None)
# All missing values will be set with 0
config = TimeBasedConfig(ts_ids=[1200], train_time_period=range(0, 30), test_time_period=range(30, 80), features_to_take=['n_flows', 'n_packets'],
default_values=0)
# Using list to specify default values for each used feature
# Position of values in list correspond to order of features in `features_to_take`.
config = TimeBasedConfig(ts_ids=[1200], train_time_period=range(0, 30), test_time_period=range(30, 80), features_to_take=['n_flows', 'n_packets'],
default_values=[1, None])
# Using dictionary with key as name for used feature and value as a default value for missing data
# Dictionary must contain key and value for every feature in `features_to_take`.
config = TimeBasedConfig(ts_ids=[1200], train_time_period=range(0, 30), test_time_period=range(30, 80), features_to_take=['n_flows', 'n_packets'],
default_values={"n_flows" : 1, "n_packets": None})
# Call on time-based dataset to use created config
time_based_dataset.set_dataset_config_and_initialize(config)
time_based_dataset.update_dataset_config_and_initialize(default_values="default", workers=0)
# Or
time_based_dataset.set_default_values(default_values="default", workers=0)
Fillers
- Fillers are implemented as classes.
- You can create your own or use built-in one.
- One filler per time series is created.
- Filler is applied after default values and usually overrides them.
- Fillers in time-based dataset can carry over values from train -> val -> test. Example is in Jupyter notebook.
- You can change used filler later with
update_dataset_config_and_initialize
orapply_filler
.
Built-in
To see all built-in fillers refer to Fillers
.
from cesnet_tszoo.utils.enums import FillerType
from cesnet_tszoo.configs import TimeBasedConfig
config = TimeBasedConfig(ts_ids=[1200], train_time_period=range(0, 30), test_time_period=range(30, 80), features_to_take=['n_flows', 'n_packets'],
default_values=None, fill_missing_with=FillerType.FORWARD_FILLER)
# Call on time-based dataset to use created config
time_based_dataset.set_dataset_config_and_initialize(config)
Or later with:
time_based_dataset.update_dataset_config_and_initialize(fill_missing_with=FillerType.FORWARD_FILLER, workers=0)
# Or
time_based_dataset.apply_filler(fill_missing_with=FillerType.FORWARD_FILLER, workers=0)
Custom
You can create your own custom filler, which must derive from 'Filler' base class.
To check Filler base class refer to Filler
from cesnet_tszoo.utils.filler import Filler
from cesnet_tszoo.configs import TimeBasedConfig
class CustomFiller(Filler):
def fill(self, batch_values: np.ndarray, existing_indices: np.ndarray, missing_indices: np.ndarray, **kwargs):
batch_values[missing_indices] = -1
config = TimeBasedConfig(ts_ids=[1200], train_time_period=range(0, 30), test_time_period=range(30, 80), features_to_take=['n_flows', 'n_packets'],
default_values=None, fill_missing_with=CustomFiller)
# Call on time-based dataset to use created config
time_based_dataset.set_dataset_config_and_initialize(config)
Or later with:
time_based_dataset.update_dataset_config_and_initialize(fill_missing_with=CustomFiller, workers=0)
# Or
time_based_dataset.apply_filler(CustomFiller, workers=0)