Skip to content

Config class

config.DatasetConfig

The main class for the configuration of:

  • Train, validation, test sets (dates, sizes, validation approach).
  • Application selection — either the standard closed-world setting (only known classes) or the open-world setting (known and unknown classes).
  • Data transformations. See the transforms page for more information.
  • Dataloader options like batch sizes, order of loading, or number of workers.

When initializing this class, pass a CesnetDataset instance to be configured and the desired configuration. Available options are here.

Attributes:

Name Type Description
dataset InitVar[CesnetDataset]

The dataset instance to be configured.

data_root str

Taken from the dataset instance.

database_filename str

Taken from the dataset instance.

database_path str

Taken from the dataset instance.

servicemap_path str

Taken from the dataset instance.

flowstats_features list[str]

Taken from dataset.metadata.flowstats_features.

flowstats_features_boolean list[str]

Taken from dataset.metadata.flowstats_features_boolean.

flowstats_features_phist list[str]

Taken from dataset.metadata.packet_histograms if use_packet_histograms is true, otherwise an empty list.

other_fields list[str]

Taken from dataset.metadata.other_fields if return_other_fields is true, otherwise an empty list.

sni_column list[str]

Database column with SNI domains, can be None for datasets without SNI domains.

Configuration options

Attributes:

Name Type Description
need_train_set bool

Use to disable the train set. Default: True

need_val_set bool

Use to disable the validation set. Default: True

need_test_set bool

Use to disable the test set. Default: True

train_period_name str

Name of the train period. See instructions.

train_dates list[str]

Dates used for creating a train set.

train_dates_weigths Optional[list[int]]

To use a non-uniform distribution of samples across train dates.

val_approach ValidationApproach

How a validation set should be created. Either split train data into train and validation or have a separate validation period. Default: SPLIT_FROM_TRAIN

train_val_split_fraction float

The fraction of validation samples when splitting from the train set. Default: 0.2

val_period_name str

Name of the validation period. See instructions.

val_dates list[str]

Dates used for creating a validation set.

test_period_name str

Name of the test period. See instructions.

test_dates list[str]

Dates used for creating a test set.

apps_selection AppSelection

How to select application classes. Default: ALL_KNOWN

apps_selection_topx int

Take top X as known.

apps_selection_background_unknown list[str]

Provide a list of background traffic classes to be used as unknown.

apps_selection_fixed_known list[str]

Provide a list of manually selected known applications.

apps_selection_fixed_unknown list[str]

Provide a list of manually selected unknown applications.

disabled_apps list[str]

List of applications to be disabled and not used at all.

min_train_samples_check MinTrainSamplesCheck

How to handle applications with not enough training samples. Default: DISABLE_APPS

min_train_samples_per_app int

Defines the threshold for not enough. Default: 100

random_state int

Fix all random processes performed during dataset initialization. Default: 420

fold_id int

To perform N-fold cross-validation, set this to 1..N. Each fold will use the same configuration but a different random seed. Default: 0

train_workers int

Number of workers for loading train data. 0 means that the data will be loaded in the main process. Default: 4

test_workers int

Number of workers for loading test data. 0 means that the data will be loaded in the main process. Default: 1

val_workers int

Number of workers for loading validation data. 0 means that the data will be loaded in the main process. Default: 1

batch_size int

Number of samples per batch. Default: 192

test_batch_size int

Number of samples per batch for loading validation and test data. Default: 2048

preload_val bool

Whether to dump the validation set with numpy.savez_compressed and preload it in future runs. Useful when running a lot of experiments with the same dataset configuration. Default: False

preload_test bool

Whether to dump the test set with numpy.savez_compressed and preload it in future runs. Default: False

train_size int | Literal['all']

Size of the train set. See instructions. Default: all

val_known_size int | Literal['all']

Size of the validation set. See instructions. Default: all

test_known_size int | Literal['all']

Size of the test set. See instructions. Default: all

val_unknown_size int | Literal['all']

Size of the unknown classes validation set. Use for evaluation in the open-world setting. Default: 0

test_unknown_size int | Literal['all']

Size of the unknown classes test set. Use for evaluation in the open-world setting. Default: 0

train_dataloader_order DataLoaderOrder

Whether to load train data in sequential or random order. Default: RANDOM

train_dataloader_seed Optional[int]

Seed for loading train data in random order. Default: None

return_other_fields bool

Whether to return auxiliary fields, such as communicating hosts, flow times, and more fields extracted from the ClientHello message. Default: False

return_tensors bool

Use for returning torch.Tensor from dataloaders. Dataframes are not available when this option is used. Default: False

use_packet_histograms bool

Whether to use packet histogram features, if available in the dataset. Default: True

use_tcp_features bool

Whether to use TCP features, if available in the dataset. Default: True

use_push_flags bool

Whether to use push flags in packet sequences, if available in the dataset. Default: False

fit_scalers_samples int | float

Used when scaling transformation is configured and requires fitting. Fraction of train samples used for fitting, if float. The absolute number of samples otherwise. Default: 0.25

ppi_transform Optional[Callable]

Transform function for PPI sequences. See the transforms page for more information. Default: None

flowstats_transform Optional[Callable]

Transform function for flow statistics. See the transforms page for more information. Default: None

flowstats_phist_transform Optional[Callable]

Transform function for packet histograms. See the transforms page for more information. Default: None

How to configure train, validation, and test sets

There are three options for how to define train/validation/test dates.

  1. Choose a predefined time period (train_period_name, val_period_name, or test_period_name) available in dataset.time_periods and leave the list of dates (train_dates, val_dates, or test_dates) empty.
  2. Provide a list of dates and a name for the time period. The dates are checked against dataset.available_dates.
  3. Do not specify anything and use the dataset's defaults dataset.default_train_period_name and dataset.default_test_period_name.

There are two options for configuring sizes of train/validation/test sets.

  1. Select an appropriate dataset size (default is S) when creating the CesnetDataset instance and leave train_size, val_known_size, and test_known_size with their default all value. This will create train/validation/test sets with all samples available in the selected dataset size (of course, depending on the selected dates and validation approach).
  2. Provide exact sizes in train_size, val_known_size, and test_known_size. This will create train/validation/test sets of the given sizes by doing a random subset. This is especially useful when using the ORIG dataset size and want to control the size of experiments.

Tip

The default approach for creating a validation set is to randomly split the train data into train and validation. The second approach is to define separate validation dates. See ValidationApproach.

Source code in cesnet_datazoo\config.py
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
@dataclass(config=C)
class DatasetConfig():
    """
    The main class for the configuration of:

    - Train, validation, test sets (dates, sizes, validation approach).
    - Application selection — either the standard closed-world setting (only *known* classes) or the open-world setting (*known* and *unknown* classes).
    - Data transformations. See the [transforms][transforms] page for more information.
    - Dataloader options like batch sizes, order of loading, or number of workers.

    When initializing this class, pass a [`CesnetDataset`][datasets.cesnet_dataset.CesnetDataset] instance to be configured and the desired configuration. Available options are [here][config.DatasetConfig--configuration-options].

    Attributes:
        dataset: The dataset instance to be configured.
        data_root: Taken from the dataset instance.
        database_filename: Taken from the dataset instance.
        database_path: Taken from the dataset instance.
        servicemap_path: Taken from the dataset instance.
        flowstats_features: Taken from `dataset.metadata.flowstats_features`.
        flowstats_features_boolean: Taken from `dataset.metadata.flowstats_features_boolean`.
        flowstats_features_phist: Taken from `dataset.metadata.packet_histograms` if `use_packet_histograms` is true, otherwise an empty list.
        other_fields: Taken from `dataset.metadata.other_fields` if `return_other_fields` is true, otherwise an empty list.
        sni_column: Database column with SNI domains, can be None for datasets without SNI domains.

    # Configuration options

    Attributes:
        need_train_set: Use to disable the train set. `Default: True`
        need_val_set: Use to disable the validation set. `Default: True`
        need_test_set: Use to disable the test set. `Default: True`
        train_period_name: Name of the train period. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets].
        train_dates: Dates used for creating a train set.
        train_dates_weigths: To use a non-uniform distribution of samples across train dates.
        val_approach: How a validation set should be created. Either split train data into train and validation or have a separate validation period. `Default: SPLIT_FROM_TRAIN`
        train_val_split_fraction: The fraction of validation samples when splitting from the train set. `Default: 0.2`
        val_period_name: Name of the validation period. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets].
        val_dates: Dates used for creating a validation set.
        test_period_name: Name of the test period. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets].
        test_dates: Dates used for creating a test set.

        apps_selection: How to select application classes. `Default: ALL_KNOWN`
        apps_selection_topx: Take top X as known.
        apps_selection_background_unknown: Provide a list of background traffic classes to be used as unknown.
        apps_selection_fixed_known: Provide a list of manually selected known applications.
        apps_selection_fixed_unknown: Provide a list of manually selected unknown applications.
        disabled_apps: List of applications to be disabled and not used at all.
        min_train_samples_check: How to handle applications with *not enough* training samples. `Default: DISABLE_APPS`
        min_train_samples_per_app: Defines the threshold for *not enough*. `Default: 100`

        random_state: Fix all random processes performed during dataset initialization. `Default: 420`
        fold_id: To perform N-fold cross-validation, set this to `1..N`. Each fold will use the same configuration but a different random seed. `Default: 0`
        train_workers: Number of workers for loading train data. `0` means that the data will be loaded in the main process. `Default: 4`
        test_workers: Number of workers for loading test data. `0` means that the data will be loaded in the main process. `Default: 1`
        val_workers: Number of workers for loading validation data. `0` means that the data will be loaded in the main process. `Default: 1`
        batch_size: Number of samples per batch. `Default: 192`
        test_batch_size: Number of samples per batch for loading validation and test data. `Default: 2048`
        preload_val: Whether to dump the validation set with `numpy.savez_compressed` and preload it in future runs. Useful when running a lot of experiments with the same dataset configuration. `Default: False`
        preload_test: Whether to dump the test set with `numpy.savez_compressed` and preload it in future runs. `Default: False`
        train_size: Size of the train set. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets]. `Default: all`
        val_known_size: Size of the validation set. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets]. `Default: all`
        test_known_size: Size of the test set. See [instructions][config.DatasetConfig--how-to-configure-train-validation-and-test-sets]. `Default: all`
        val_unknown_size: Size of the unknown classes validation set. Use for evaluation in the open-world setting. `Default: 0`
        test_unknown_size: Size of the unknown classes test set. Use for evaluation in the open-world setting. `Default: 0`
        train_dataloader_order: Whether to load train data in sequential or random order. `Default: RANDOM`
        train_dataloader_seed: Seed for loading train data in random order. `Default: None`

        return_other_fields: Whether to return [auxiliary fields][other-fields], such as communicating hosts, flow times, and more fields extracted from the ClientHello message. `Default: False`
        return_tensors: Use for returning `torch.Tensor` from dataloaders. Dataframes are not available when this option is used. `Default: False`
        use_packet_histograms: Whether to use packet histogram features, if available in the dataset. `Default: True`
        use_tcp_features: Whether to use TCP features, if available in the dataset. `Default: True`
        use_push_flags: Whether to use push flags in packet sequences, if available in the dataset. `Default: False`
        fit_scalers_samples: Used when scaling transformation is configured and requires fitting. Fraction of train samples used for fitting, if float. The absolute number of samples otherwise. `Default: 0.25`
        ppi_transform: Transform function for PPI sequences. See the [transforms][transforms] page for more information. `Default: None`
        flowstats_transform: Transform function for flow statistics. See the [transforms][transforms] page for more information. `Default: None`
        flowstats_phist_transform: Transform function for packet histograms. See the [transforms][transforms] page for more information. `Default: None`

    # How to configure train, validation, and test sets
    There are three options for how to define train/validation/test dates.

    1. Choose a predefined time period (`train_period_name`, `val_period_name`, or `test_period_name`) available in `dataset.time_periods` and leave the list of dates (`train_dates`, `val_dates`, or `test_dates`) empty.
    2. Provide a list of dates and a name for the time period. The dates are checked against `dataset.available_dates`.
    3. Do not specify anything and use the dataset's defaults `dataset.default_train_period_name` and `dataset.default_test_period_name`.

    There are two options for configuring sizes of train/validation/test sets.

    1. Select an appropriate dataset size (default is `S`) when creating the [`CesnetDataset`][datasets.cesnet_dataset.CesnetDataset] instance and leave `train_size`, `val_known_size`, and `test_known_size` with their default `all` value.
    This will create train/validation/test sets with all samples available in the selected dataset size (of course, depending on the selected dates and validation approach).
    2. Provide exact sizes in `train_size`, `val_known_size`, and `test_known_size`. This will create train/validation/test sets of the given sizes by doing a random subset.
    This is especially useful when using the `ORIG` dataset size and want to control the size of experiments.

    !!! tip Validation set
        The default approach for creating a validation set is to randomly split the train data into train and validation. The second approach is to define separate validation dates. See [ValidationApproach][config.ValidationApproach].

    """
    dataset: InitVar[CesnetDataset]
    data_root: str = field(init=False)
    database_filename: str =  field(init=False)
    database_path: str =  field(init=False)
    servicemap_path: str = field(init=False)
    flowstats_features: list[str] = field(init=False)
    flowstats_features_boolean: list[str] = field(init=False)
    flowstats_features_phist: list[str] = field(init=False)
    other_fields: list[str] = field(init=False)

    need_train_set: bool = True
    need_val_set: bool = True
    need_test_set: bool = True
    train_period_name: str = ""
    train_dates: list[str] = field(default_factory=list)
    train_dates_weigths: Optional[list[int]] = None
    val_approach: ValidationApproach = ValidationApproach.SPLIT_FROM_TRAIN
    train_val_split_fraction: float = 0.2
    val_period_name: str = ""
    val_dates: list[str] = field(default_factory=list)
    test_period_name: str = ""
    test_dates: list[str] = field(default_factory=list)

    apps_selection: AppSelection = AppSelection.ALL_KNOWN
    apps_selection_topx: int = 0
    apps_selection_background_unknown: list[str] = field(default_factory=list)
    apps_selection_fixed_known: list[str] = field(default_factory=list)
    apps_selection_fixed_unknown: list[str] = field(default_factory=list)
    disabled_apps: list[str] = field(default_factory=list)
    min_train_samples_check: MinTrainSamplesCheck = MinTrainSamplesCheck.DISABLE_APPS
    min_train_samples_per_app: int = 100

    random_state: int = 420
    fold_id: int = 0
    train_workers: int = 4
    test_workers: int = 1
    val_workers: int = 1
    batch_size: int = 192
    test_batch_size: int = 2048
    preload_val: bool = False
    preload_test: bool = False
    train_size: int | Literal["all"] = "all"
    val_known_size: int | Literal["all"] = "all"
    test_known_size: int | Literal["all"] = "all"
    val_unknown_size: int | Literal["all"] = 0
    test_unknown_size: int | Literal["all"] = 0
    train_dataloader_order: DataLoaderOrder = DataLoaderOrder.RANDOM
    train_dataloader_seed: Optional[int] = None

    return_other_fields: bool = False
    return_tensors: bool = False
    use_packet_histograms: bool = False
    use_tcp_features: bool = False
    use_push_flags: bool = False
    fit_scalers_samples: int | float = 0.25
    ppi_transform: Optional[Callable] = None
    flowstats_transform: Optional[Callable] = None
    flowstats_phist_transform: Optional[Callable] = None

    def __post_init__(self, dataset: CesnetDataset):
        """
        Ensures valid configuration. Catches all incompatible options and raise exceptions as soon as possible.
        """
        self.data_root = dataset.data_root
        self.servicemap_path = dataset.servicemap_path
        self.database_filename = dataset.database_filename
        self.database_path = dataset.database_path

        if not self.need_train_set:
            if self.apps_selection != AppSelection.FIXED:
                raise ValueError("Application selection has to be fixed when need_train_set is false")
            if (len(self.train_dates) > 0 or self.train_period_name != ""):
                raise ValueError("train_dates and train_period_name cannot be specified when need_train_set is false")
        else:
            # Configure train dates
            if len(self.train_dates) > 0 and self.train_period_name == "":
                raise ValueError("train_period_name has to be specified when train_dates are set")
            if len(self.train_dates) == 0 and self.train_period_name != "":
                if self.train_period_name not in dataset.time_periods:
                    raise ValueError(f"Unknown train_period_name {self.train_period_name}. Use time period available in dataset.time_periods")
                self.train_dates = dataset.time_periods[self.train_period_name]
            if len(self.train_dates) == 0 and self.train_period_name == "":
                self.train_period_name = dataset.default_train_period_name
                self.train_dates = dataset.time_periods[dataset.default_train_period_name]
        # Configure test dates
        if not self.need_test_set:
            if (len(self.test_dates) > 0 or self.test_period_name != ""):
                raise ValueError("test_dates and test_period_name cannot be specified when need_test_set is false")
        else:
            if len(self.test_dates) > 0 and self.test_period_name == "":
                raise ValueError("test_period_name has to be specified when test_dates are set")
            if len(self.test_dates) == 0 and self.test_period_name != "":
                if self.test_period_name not in dataset.time_periods:
                    raise ValueError(f"Unknown test_period_name {self.test_period_name}. Use time period available in dataset.time_periods")
                self.test_dates = dataset.time_periods[self.test_period_name]
            if len(self.test_dates) == 0 and self.test_period_name == "":
                self.test_period_name = dataset.default_test_period_name
                self.test_dates = dataset.time_periods[dataset.default_test_period_name]
        # Configure val dates
        if not self.need_val_set:
            if len(self.val_dates) > 0 or self.val_period_name != "" or self.val_approach != ValidationApproach.SPLIT_FROM_TRAIN:
                raise ValueError("val_dates, val_period_name, and val_approach cannot be specified when need_val_set is false")
        else:
            if self.val_approach == ValidationApproach.SPLIT_FROM_TRAIN:
                if len(self.val_dates) > 0 or self.val_period_name != "":
                    raise ValueError("val_dates and val_period_name cannot be specified when the validation approach is split-from-train")
                if not self.need_train_set:
                    raise ValueError("Cannot use the split-from-train validation approach when need_train_set is false. Either use the validation-dates approach or set need_val_set to false.")
            elif self.val_approach == ValidationApproach.VALIDATION_DATES:
                if len(self.val_dates) > 0 and self.val_period_name == "":
                    raise ValueError("val_period_name has to be specified when val_dates are set")
                if len(self.val_dates) == 0 and self.val_period_name != "":
                    if self.val_period_name not in dataset.time_periods:
                        raise ValueError(f"Unknown val_period_name {self.val_period_name}. Use time period available in dataset.time_periods")
                    self.val_dates = dataset.time_periods[self.val_period_name]
                if len(self.val_dates) == 0 and self.val_period_name == "":
                    raise ValueError("val_period_name and val_dates (or val_period_name from dataset.time_periods) have to be specified when the validation approach is validation-dates")
        # Check if train, val, and test dates are available in the dataset
        bad_train_dates = [t for t in self.train_dates if t not in dataset.available_dates]
        bad_val_dates = [t for t in self.val_dates if t not in dataset.available_dates]
        bad_test_dates = [t for t in self.test_dates if t not in dataset.available_dates]
        if len(bad_train_dates) > 0:
            raise ValueError(f"Bad train dates {bad_train_dates}. Use dates available in dataset.available_dates (collection period {dataset.metadata.collection_period})" \
                            + (f". These dates are missing from the dataset collection period {dataset.metadata.missing_dates_in_collection_period}" if dataset.metadata.missing_dates_in_collection_period else ""))
        if len(bad_val_dates) > 0:
            raise ValueError(f"Bad validation dates {bad_val_dates}. Use dates available in dataset.available_dates (collection period {dataset.metadata.collection_period})" \
                            + (f". These dates are missing from the dataset collection period {dataset.metadata.missing_dates_in_collection_period}" if dataset.metadata.missing_dates_in_collection_period else ""))
        if len(bad_test_dates) > 0:
            raise ValueError(f"Bad test dates {bad_test_dates}. Use dates available in dataset.available_dates (collection period {dataset.metadata.collection_period})" \
                            + (f". These dates are missing from the dataset collection period {dataset.metadata.missing_dates_in_collection_period}" if dataset.metadata.missing_dates_in_collection_period else ""))
        # Check time order of train, val, and test periods
        train_dates = [datetime.strptime(date_str, "%Y%m%d").date() for date_str in self.train_dates]
        test_dates = [datetime.strptime(date_str, "%Y%m%d").date() for date_str in self.test_dates]
        if len(train_dates) > 0 and len(test_dates) > 0 and min(test_dates) <= max(train_dates):
            warnings.warn(f"Some test dates ({min(test_dates).strftime('%Y%m%d')}) are before or equal to the last train date ({max(train_dates).strftime('%Y%m%d')}). This might lead to improper evaluation and should be avoided.")
        if self.val_approach == ValidationApproach.VALIDATION_DATES:
            val_dates = [datetime.strptime(date_str, "%Y%m%d").date() for date_str in self.val_dates]
            if len(train_dates) > 0 and min(val_dates) <= max(train_dates):
                warnings.warn(f"Some validation dates ({min(val_dates).strftime('%Y%m%d')}) are before or equal to the last train date ({max(train_dates).strftime('%Y%m%d')}). This might lead to improper evaluation and should be avoided.")
            if len(test_dates) > 0 and min(test_dates) <= max(val_dates):
                warnings.warn(f"Some test dates ({min(test_dates).strftime('%Y%m%d')}) are before or equal to the last validation date ({max(val_dates).strftime('%Y%m%d')}). This might lead to improper evaluation and should be avoided.")
        # Configure features
        self.flowstats_features = dataset.metadata.flowstats_features
        self.flowstats_features_boolean = dataset.metadata.flowstats_features_boolean
        sni_column = TLS_SNI_COLUMN if dataset.metadata.protocol == Protocol.TLS else QUIC_SNI_COLUMN
        self.sni_column = sni_column if sni_column in dataset.metadata.other_fields else None
        self.other_fields = dataset.metadata.other_fields if self.return_other_fields else []
        if self.use_packet_histograms:
            if len(dataset.metadata.packet_histograms) == 0:
                raise ValueError("This dataset does not support use_packet_histograms")
            self.flowstats_features_phist = dataset.metadata.packet_histograms
        else:
            self.flowstats_features_phist = []
            if self.flowstats_phist_transform is not None:
                raise ValueError("flowstats_phist_transform cannot be specified when use_packet_histograms is false")
        if dataset.metadata.protocol == Protocol.TLS:
            if self.use_tcp_features:
                self.flowstats_features_boolean = self.flowstats_features_boolean + SELECTED_TCP_FLAGS
            if self.use_push_flags and "PUSH_FLAG" not in dataset.metadata.ppi_features:
                raise ValueError("This TLS dataset does not support use_push_flags")
        if dataset.metadata.protocol == Protocol.QUIC:
            if self.use_tcp_features:
                raise ValueError("QUIC datasets do not support use_tcp_features")
            if self.use_push_flags:
                raise ValueError("QUIC datasets do not support use_push_flags")
        # When train_dates_weigths are used, train_size and val_known_size have to be specified
        if self.train_dates_weigths is not None:
            if not self.need_train_set:
                raise ValueError("train_dates_weigths cannot be specified when need_train_set is false")
            if len(self.train_dates_weigths) != len(self.train_dates):
                raise ValueError("train_dates_weigths has to have the same length as train_dates")
            if self.train_size == "all":
                raise ValueError("train_size cannot be 'all' when train_dates_weigths are speficied")
            if self.val_approach == ValidationApproach.SPLIT_FROM_TRAIN and self.val_known_size == "all":
                raise ValueError("val_known_size cannot be 'all' when train_dates_weigths are speficied and validation_approach is split-from-train")
        # App selection
        if self.apps_selection == AppSelection.ALL_KNOWN:
            self.val_unknown_size = 0
            self.test_unknown_size = 0
            if self.apps_selection_topx != 0 or len(self.apps_selection_background_unknown) > 0 or len(self.apps_selection_fixed_known) > 0 or len(self.apps_selection_fixed_unknown) > 0:
                raise ValueError("apps_selection_topx, apps_selection_background_unknown, apps_selection_fixed_known, and apps_selection_fixed_unknown cannot be specified when application selection is all-known")
        if self.apps_selection == AppSelection.TOPX_KNOWN:
            if self.apps_selection_topx == 0:
                raise ValueError("apps_selection_topx has to be greater than 0 when application selection is top-x-known")
            if len(self.apps_selection_background_unknown) > 0 or len(self.apps_selection_fixed_known) > 0 or len(self.apps_selection_fixed_unknown) > 0:
                raise ValueError("apps_selection_background_unknown, apps_selection_fixed_known, and apps_selection_fixed_unknown cannot be specified when application selection is top-x-known")
        if self.apps_selection == AppSelection.BACKGROUND_UNKNOWN:
            if len(self.apps_selection_background_unknown) == 0:
                raise ValueError("apps_selection_background_unknown has to be specified when application selection is background-unknown")
            bad_apps = [a for a in self.apps_selection_background_unknown if a not in dataset.available_classes]
            if len(bad_apps) > 0:
                raise ValueError(f"Bad applications in apps_selection_background_unknown {bad_apps}. Use applications available in dataset.available_classes")
            if self.apps_selection_topx != 0 or len(self.apps_selection_fixed_known) > 0 or len(self.apps_selection_fixed_unknown) > 0:
                raise ValueError("apps_selection_topx, apps_selection_fixed_known, and apps_selection_fixed_unknown cannot be specified when application selection is background-unknown")
        if self.apps_selection == AppSelection.FIXED:
            if len(self.apps_selection_fixed_known) == 0:
                raise ValueError("apps_selection_fixed_known has to be specified when application selection is fixed")
            bad_apps = [a for a in self.apps_selection_fixed_known + self.apps_selection_fixed_unknown if a not in dataset.available_classes]
            if len(bad_apps) > 0:
                raise ValueError(f"Bad applications in apps_selection_fixed_known or apps_selection_fixed_unknown {bad_apps}. Use applications available in dataset.available_classes")
            if len(self.disabled_apps) > 0:
                raise ValueError("disabled_apps cannot be specified when application selection is fixed")
            if self.min_train_samples_per_app != 0 and self.min_train_samples_per_app != 100:
                warnings.warn("min_train_samples_per_app is not used when application selection is fixed")
            if self.apps_selection_topx != 0 or len(self.apps_selection_background_unknown) > 0:
                raise ValueError("apps_selection_topx and apps_selection_background_unknown cannot be specified when application selection is fixed")
        # More asserts
        bad_disabled_apps = [a for a in self.disabled_apps if a not in dataset.available_classes]
        if len(bad_disabled_apps) > 0:
            raise ValueError(f"Bad applications in disabled_apps {bad_disabled_apps}. Use applications available in dataset.available_classes")
        if isinstance(self.fit_scalers_samples, float) and (self.fit_scalers_samples <= 0 or self.fit_scalers_samples > 1):
            raise ValueError("fit_scalers_samples has to be either float between 0 and 1 (giving the fraction of training samples used for fitting scalers) or an integer")

    def get_flowstats_features_len(self) -> int:
        """Gets the number of flow statistics features."""
        return len(self.flowstats_features) + len(self.flowstats_features_boolean) + PHIST_BIN_COUNT * len(self.flowstats_features_phist)

    def get_flowstats_feature_names_expanded(self, shorter_names: bool = False) -> list[str]:
        """Gets names of flow statistics features. Packet histograms are expanded into bin features."""
        phist_mapping = {
            "PHIST_SRC_SIZES": [f"PSIZE_BIN{i}" for i in range(1, PHIST_BIN_COUNT + 1)],
            "PHIST_DST_SIZES": [f"PSIZE_BIN{i}_REV" for i in range(1, PHIST_BIN_COUNT + 1)],
            "PHIST_SRC_IPT": [f"IPT_BIN{i}" for i in range(1, PHIST_BIN_COUNT + 1)],
            "PHIST_DST_IPT": [f"IPT_BIN{i}_REV" for i in range(1, PHIST_BIN_COUNT + 1)],
        }
        short_names_mapping = {
            "FLOW_ENDREASON_IDLE": "FEND_IDLE",
            "FLOW_ENDREASON_ACTIVE": "FEND_ACTIVE",
            "FLOW_ENDREASON_END": "FEND_END",
            "FLOW_ENDREASON_OTHER": "FEND_OTHER",
            "FLAG_CWR": "F_CWR",
            "FLAG_CWR_REV": "F_CWR_REV",
            "FLAG_ECE": "F_ECE",
            "FLAG_ECE_REV": "F_ECE_REV",
            "FLAG_PSH_REV": "F_PSH_REV",
            "FLAG_RST": "F_RST",
            "FLAG_RST_REV": "F_RST_REV",
            "FLAG_FIN": "F_FIN",
            "FLAG_FIN_REV": "F_FIN_REV",
        }
        feature_names = self.flowstats_features[:]
        for f in self.flowstats_features_boolean:
            if shorter_names and f in short_names_mapping:
                feature_names.append(short_names_mapping[f])
            else:
                feature_names.append(f)
        for f in self.flowstats_features_phist:
            feature_names.extend(phist_mapping[f])
        assert len(feature_names) == self.get_flowstats_features_len()
        return feature_names

    def get_ppi_feature_names(self) -> list[str]:
        """Gets the names of flattened PPI features."""
        ppi_feature_names = [f"IPT_{i}" for i in range(1, PPI_MAX_LEN + 1)] + \
                               [f"DIR_{i}" for i in range(1, PPI_MAX_LEN + 1)] + \
                               [f"SIZE_{i}" for i in range(1, PPI_MAX_LEN + 1)]
        if self.use_push_flags:
            ppi_feature_names += [f"PUSH_{i}" for i in range(1, PPI_MAX_LEN + 1)]
        return ppi_feature_names

    def get_ppi_channels(self) -> list[int]:
        """Gets the available features (channels) in PPI sequences."""
        if self.use_push_flags:
            return TCP_PPI_CHANNELS
        else:
            return UDP_PPI_CHANNELS

    def get_feature_names(self, flatten_ppi: bool = False, shorter_names: bool = False) -> list[str]:
        """
        Gets feature names.

        Parameters:
            flatten_ppi: Whether to flatten PPI into individual feature names or keep one `PPI` column.
        """
        feature_names = self.get_ppi_feature_names() if flatten_ppi else ["PPI"]
        feature_names += self.get_flowstats_feature_names_expanded(shorter_names=shorter_names)
        return feature_names

    def _get_train_tables_paths(self) -> list[str]:
        return list(map(lambda t: f"/flows/D{t}", self.train_dates))

    def _get_val_tables_paths(self) -> list[str]:
        if self.val_approach == ValidationApproach.SPLIT_FROM_TRAIN:
            return self._get_train_tables_paths()
        return list(map(lambda t: f"/flows/D{t}", self.val_dates))

    def _get_test_tables_paths(self) -> list[str]:
        return list(map(lambda t: f"/flows/D{t}", self.test_dates))

    def _get_train_data_hash(self) -> str:
        train_data_params = self._get_train_data_params()
        params_hash = hashlib.sha256(json.dumps(dataclasses.asdict(train_data_params), sort_keys=True, default=str).encode()).hexdigest()
        params_hash = params_hash[:10]
        return params_hash

    def _get_train_data_path(self) -> str:
        params_hash = self._get_train_data_hash()
        return os.path.join(self.data_root, "train-data", f"{params_hash}_{self.random_state}", f"fold_{self.fold_id}")

    def _get_train_data_params(self) -> TrainDataParams:
        return TrainDataParams(
            datazoo_version=version("cesnet_datazoo"),
            database_filename=self.database_filename,
            train_period_name=self.train_period_name,
            train_tables_paths=self._get_train_tables_paths(),
            apps_selection=self.apps_selection,
            apps_selection_topx=self.apps_selection_topx,
            apps_selection_background_unknown=self.apps_selection_background_unknown,
            apps_selection_fixed_known=self.apps_selection_fixed_known,
            apps_selection_fixed_unknown=self.apps_selection_fixed_unknown,
            disabled_apps=self.disabled_apps,
            min_train_samples_per_app=self.min_train_samples_per_app,
            min_train_samples_check=self.min_train_samples_check,)

    def _get_val_data_params_and_path(self, known_apps: list[str], unknown_apps: list[str]) -> tuple[TestDataParams, str]:
        assert self.val_approach == ValidationApproach.VALIDATION_DATES
        val_data_params = TestDataParams(
            datazoo_version=version("cesnet_datazoo"),
            database_filename=self.database_filename,
            test_period_name=self.val_period_name,
            test_tables_paths=self._get_val_tables_paths(),
            known_apps=known_apps,
            unknown_apps=unknown_apps,)
        params_hash = hashlib.sha256(json.dumps(dataclasses.asdict(val_data_params), sort_keys=True).encode()).hexdigest()
        params_hash = params_hash[:10]
        val_data_path = os.path.join(self.data_root, "val-data", f"{params_hash}_{self.random_state}")
        return val_data_params, val_data_path

    def _get_test_data_params_and_path(self, known_apps: list[str], unknown_apps: list[str]) -> tuple[TestDataParams, str]:
        test_data_params = TestDataParams(
            datazoo_version=version("cesnet_datazoo"),
            database_filename=self.database_filename,
            test_period_name=self.test_period_name,
            test_tables_paths=self._get_test_tables_paths(),
            known_apps=known_apps,
            unknown_apps=unknown_apps,)
        params_hash = hashlib.sha256(json.dumps(dataclasses.asdict(test_data_params), sort_keys=True).encode()).hexdigest()
        params_hash = params_hash[:10]
        test_data_path = os.path.join(self.data_root, "test-data", f"{params_hash}_{self.random_state}")
        return test_data_params, test_data_path

    @model_validator(mode="before") # type: ignore
    @classmethod
    def check_deprecated_args(cls, values):
        kwargs = values.kwargs
        if "train_period" in kwargs:
            warnings.warn("train_period is deprecated. Use train_period_name instead.")
            kwargs["train_period_name"] = kwargs["train_period"]
        if "val_period" in kwargs:
            warnings.warn("val_period is deprecated. Use val_period_name instead.")
            kwargs["val_period_name"] = kwargs["val_period"]
        if "test_period" in kwargs:
            warnings.warn("test_period is deprecated. Use test_period_name instead.")
            kwargs["test_period_name"] = kwargs["test_period"]
        return values

    def __str__(self):
        _process_tag = yaml.emitter.Emitter.process_tag
        _ignore_aliases = yaml.Dumper.ignore_aliases
        yaml.emitter.Emitter.process_tag = lambda self, *args, **kw: None
        yaml.Dumper.ignore_aliases = lambda self, *args, **kw: True
        s = yaml.dump(dataclasses.asdict(self), sort_keys=False)
        yaml.emitter.Emitter.process_tag = _process_tag
        yaml.Dumper.ignore_aliases = _ignore_aliases
        return s

Functions

get_flowstats_features_len

get_flowstats_features_len() -> int

Gets the number of flow statistics features.

Source code in cesnet_datazoo\config.py
420
421
422
def get_flowstats_features_len(self) -> int:
    """Gets the number of flow statistics features."""
    return len(self.flowstats_features) + len(self.flowstats_features_boolean) + PHIST_BIN_COUNT * len(self.flowstats_features_phist)

get_flowstats_feature_names_expanded

get_flowstats_feature_names_expanded(
    shorter_names: bool = False,
) -> list[str]

Gets names of flow statistics features. Packet histograms are expanded into bin features.

Source code in cesnet_datazoo\config.py
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
def get_flowstats_feature_names_expanded(self, shorter_names: bool = False) -> list[str]:
    """Gets names of flow statistics features. Packet histograms are expanded into bin features."""
    phist_mapping = {
        "PHIST_SRC_SIZES": [f"PSIZE_BIN{i}" for i in range(1, PHIST_BIN_COUNT + 1)],
        "PHIST_DST_SIZES": [f"PSIZE_BIN{i}_REV" for i in range(1, PHIST_BIN_COUNT + 1)],
        "PHIST_SRC_IPT": [f"IPT_BIN{i}" for i in range(1, PHIST_BIN_COUNT + 1)],
        "PHIST_DST_IPT": [f"IPT_BIN{i}_REV" for i in range(1, PHIST_BIN_COUNT + 1)],
    }
    short_names_mapping = {
        "FLOW_ENDREASON_IDLE": "FEND_IDLE",
        "FLOW_ENDREASON_ACTIVE": "FEND_ACTIVE",
        "FLOW_ENDREASON_END": "FEND_END",
        "FLOW_ENDREASON_OTHER": "FEND_OTHER",
        "FLAG_CWR": "F_CWR",
        "FLAG_CWR_REV": "F_CWR_REV",
        "FLAG_ECE": "F_ECE",
        "FLAG_ECE_REV": "F_ECE_REV",
        "FLAG_PSH_REV": "F_PSH_REV",
        "FLAG_RST": "F_RST",
        "FLAG_RST_REV": "F_RST_REV",
        "FLAG_FIN": "F_FIN",
        "FLAG_FIN_REV": "F_FIN_REV",
    }
    feature_names = self.flowstats_features[:]
    for f in self.flowstats_features_boolean:
        if shorter_names and f in short_names_mapping:
            feature_names.append(short_names_mapping[f])
        else:
            feature_names.append(f)
    for f in self.flowstats_features_phist:
        feature_names.extend(phist_mapping[f])
    assert len(feature_names) == self.get_flowstats_features_len()
    return feature_names

get_ppi_feature_names

get_ppi_feature_names() -> list[str]

Gets the names of flattened PPI features.

Source code in cesnet_datazoo\config.py
458
459
460
461
462
463
464
465
def get_ppi_feature_names(self) -> list[str]:
    """Gets the names of flattened PPI features."""
    ppi_feature_names = [f"IPT_{i}" for i in range(1, PPI_MAX_LEN + 1)] + \
                           [f"DIR_{i}" for i in range(1, PPI_MAX_LEN + 1)] + \
                           [f"SIZE_{i}" for i in range(1, PPI_MAX_LEN + 1)]
    if self.use_push_flags:
        ppi_feature_names += [f"PUSH_{i}" for i in range(1, PPI_MAX_LEN + 1)]
    return ppi_feature_names

get_ppi_channels

get_ppi_channels() -> list[int]

Gets the available features (channels) in PPI sequences.

Source code in cesnet_datazoo\config.py
467
468
469
470
471
472
def get_ppi_channels(self) -> list[int]:
    """Gets the available features (channels) in PPI sequences."""
    if self.use_push_flags:
        return TCP_PPI_CHANNELS
    else:
        return UDP_PPI_CHANNELS

get_feature_names

get_feature_names(
    flatten_ppi: bool = False, shorter_names: bool = False
) -> list[str]

Gets feature names.

Parameters:

Name Type Description Default
flatten_ppi bool

Whether to flatten PPI into individual feature names or keep one PPI column.

False
Source code in cesnet_datazoo\config.py
474
475
476
477
478
479
480
481
482
483
def get_feature_names(self, flatten_ppi: bool = False, shorter_names: bool = False) -> list[str]:
    """
    Gets feature names.

    Parameters:
        flatten_ppi: Whether to flatten PPI into individual feature names or keep one `PPI` column.
    """
    feature_names = self.get_ppi_feature_names() if flatten_ppi else ["PPI"]
    feature_names += self.get_flowstats_feature_names_expanded(shorter_names=shorter_names)
    return feature_names

Enums for configuration

The following enums are used for dataset configuration.

config.ValidationApproach

The validation approach defines which samples should be used for creating a validation set.

SPLIT_FROM_TRAIN class-attribute instance-attribute
SPLIT_FROM_TRAIN = 'split-from-train'

Split train data into train and validation. Scikit-learn train_test_split is used to create a random stratified validation set. The fraction of validation samples is defined in train_val_split_fraction.

VALIDATION_DATES class-attribute instance-attribute
VALIDATION_DATES = 'validation-dates'

Use separate validation dates to create a validation set. Validation dates need to be specified in val_dates, and the name of the validation period in val_period_name.

config.AppSelection

Applications can be divided into known and unknown classes. To use a dataset in the standard closed-world setting, use ALL_KNOWN to select all the applications as known. Use TOPX_KNOWN or BACKGROUND_UNKNOWN for the open-world setting and evaluation of out-of-distribution or open-set recognition methods. The FIXED is for manual selection of known and unknown applications.

ALL_KNOWN class-attribute instance-attribute
ALL_KNOWN = 'all-known'

Use all applications as known.

TOPX_KNOWN class-attribute instance-attribute
TOPX_KNOWN = 'topx-known'

Use the first X (apps_selection_topx) most frequent (with the most samples) applications as known, and the rest as unknown. Applications with the same provider are never separated, i.e., all applications of a given provider are either known or unknown.

BACKGROUND_UNKNOWN class-attribute instance-attribute
BACKGROUND_UNKNOWN = 'background-unknown'

Use the list of background traffic classes (apps_selection_background_unknown) as unknown, and the rest as known.

FIXED class-attribute instance-attribute
FIXED = 'fixed'

Manual application selection. Provide lists of known applications (apps_selection_fixed_known) and unknown applications (apps_selection_fixed_unknown).

config.MinTrainSamplesCheck

Depending on the selected train dates, there might be applications with not enough samples for training (what is not enough will depend on the selected classification model). The threshold for the minimum number of samples can be set with min_train_samples_per_app, and its default value is 100. With the DISABLE_APPS approach, these applications will be disabled and not used for training or testing. With the WARN_AND_EXIT approach, the script will print a warning and exit if applications with not enough samples are encountered. To disable this check, set min_train_samples_per_app to 0.

WARN_AND_EXIT class-attribute instance-attribute
WARN_AND_EXIT = 'warn-and-exit'

Warn and exit if there are not enough training samples for some applications. It is up to the user to manually add these applications to disabled_apps.

DISABLE_APPS class-attribute instance-attribute
DISABLE_APPS = 'disable-apps'

Disable applications with not enough training samples.

config.DataLoaderOrder

Validation and test sets are always loaded in sequential order — sequential meaning in the order of dates and time. However, for the train set, it is sometimes required to iterate it in random order (for example, for training a neural network). Thus, use RANDOM if your classification model requires it; SEQUENTIAL otherwise. This setting affects only train_dataloader. Dataframe get_train_df is always created in sequential order.

RANDOM class-attribute instance-attribute
RANDOM = 'random'

Iterate train data in random order.

SEQUENTIAL class-attribute instance-attribute
SEQUENTIAL = 'sequential'

Iterate train data in sequential (datetime) order.