Skip to content

Handling missing data

This tutorial will look at some configuration options used for handling missing data.

Only time-based will be used, because all methods work almost the same way for series-based.

Note

For every configuration and more detailed examples refer to Jupyter notebook handling_missing_data

Relevant configuration values:

  • default_values - Default values for missing data, applied before fillers.
  • fill_missing_with - Defines how to fill missing values in the dataset. Can pass FillerType enum or custom Filler type.

Default values

  • Default values are set to missing values before filler is used.
  • You can change used default values later with update_dataset_config_and_initialize or set_default_values.
from cesnet_tszoo.configs import TimeBasedConfig

# Default values are provided from used dataset.
config = TimeBasedConfig(ts_ids=[1200], train_time_period=range(0, 30), test_time_period=range(30, 80), features_to_take=['n_flows', 'n_packets'],
                         default_values="default")

# All missing values will be set as None
config = TimeBasedConfig(ts_ids=[1200], train_time_period=range(0, 30), test_time_period=range(30, 80), features_to_take=['n_flows', 'n_packets'],
                         default_values=None)     

# All missing values will be set with 0
config = TimeBasedConfig(ts_ids=[1200], train_time_period=range(0, 30), test_time_period=range(30, 80), features_to_take=['n_flows', 'n_packets'],
                         default_values=0) 

# Using list to specify default values for each used feature
# Position of values in list correspond to order of features in `features_to_take`.
config = TimeBasedConfig(ts_ids=[1200], train_time_period=range(0, 30), test_time_period=range(30, 80), features_to_take=['n_flows', 'n_packets'],
                         default_values=[1, None])       

# Using dictionary with key as name for used feature and value as a default value for missing data
# Dictionary must contain key and value for every feature in `features_to_take`.
config = TimeBasedConfig(ts_ids=[1200], train_time_period=range(0, 30), test_time_period=range(30, 80), features_to_take=['n_flows', 'n_packets'],
                         default_values={"n_flows" : 1, "n_packets": None})                                                                                       

# Call on time-based dataset to use created config
time_based_dataset.set_dataset_config_and_initialize(config)

time_based_dataset.update_dataset_config_and_initialize(default_values="default", workers=0)
# Or
time_based_dataset.set_default_values(default_values="default", workers=0)

Fillers

  • Fillers are implemented as classes.
    • You can create your own or use built-in one.
  • One filler per time series is created.
  • Filler is applied after default values and usually overrides them.
  • Fillers in time-based dataset can carry over values from train -> val -> test. Example is in Jupyter notebook.
  • You can change used filler later with update_dataset_config_and_initialize or apply_filler.

Built-in

To see all built-in fillers refer to Fillers.

from cesnet_tszoo.utils.enums import FillerType
from cesnet_tszoo.configs import TimeBasedConfig

config = TimeBasedConfig(ts_ids=[1200], train_time_period=range(0, 30), test_time_period=range(30, 80), features_to_take=['n_flows', 'n_packets'],
                         default_values=None, fill_missing_with=FillerType.FORWARD_FILLER)                                                                                

# Call on time-based dataset to use created config
time_based_dataset.set_dataset_config_and_initialize(config)

Or later with:

time_based_dataset.update_dataset_config_and_initialize(fill_missing_with=FillerType.FORWARD_FILLER, workers=0)
# Or
time_based_dataset.apply_filler(fill_missing_with=FillerType.FORWARD_FILLER, workers=0)

Custom

You can create your own custom filler, which must derive from 'Filler' base class.

To check Filler base class refer to Filler

from cesnet_tszoo.utils.filler import Filler
from cesnet_tszoo.configs import TimeBasedConfig

class CustomFiller(Filler):
    def fill(self, batch_values: np.ndarray, existing_indices: np.ndarray, missing_indices: np.ndarray, **kwargs):
        batch_values[missing_indices] = -1

config = TimeBasedConfig(ts_ids=[1200], train_time_period=range(0, 30), test_time_period=range(30, 80), features_to_take=['n_flows', 'n_packets'],
                         default_values=None, fill_missing_with=CustomFiller)                                                                            

# Call on time-based dataset to use created config
time_based_dataset.set_dataset_config_and_initialize(config)

Or later with:

time_based_dataset.update_dataset_config_and_initialize(fill_missing_with=CustomFiller, workers=0)
# Or
time_based_dataset.apply_filler(CustomFiller, workers=0)