Adding dataset

Follow these three steps to contribute a dataset to the repository.

Step 1: Prepare Your Dataset Files

Your dataset must follow the directory structure below. Replace placeholder names (e.g. <dataset_name>, <aggregation_1>, <subset_1> ...) with your actual names - use underscores instead of spaces (e.g. my_dataset, 5_minutes, vpn ...).

<dataset_name>
├─ <subset_1>
│  ├─ <source_type_1>
│  │  ├─ identifiers.csv
│  │  ├─ <aggregation_1>
│  │  │  ├─ times.csv
│  │  │  ├─ main_data
│  │  │  │  ├─ ts_info.yaml
│  │  │  │  ├─ <folder_id_1>
│  │  │  │  │  ├─ <ts_id_1>
│  │  │  │  │  │  ├─ data_1.csv
│  │  │  │  │  │  ├─ data_2.csv
│  │  │  │  │  │  ├─ ...
│  │  │  │  │  │  ├─ data_V.csv
│  │  │  │  │  │  ├─ matrix_<feature_1>_1.txt
│  │  │  │  │  │  ├─ matrix_<feature_1>_2.txt
│  │  │  │  │  │  ├─ ...
│  │  │  │  │  │  ├─ matrix_<feature_1>_E.txt
│  │  │  │  │  │  ├─ matrix_<feature_2>_1.txt
│  │  │  │  │  │  ├─ ...
│  │  │  │  │  │  └─ matrix_<feature_T>_1.txt
│  │  │  │  │  ├─ <ts_id_2>
│  │  │  │  │  ├─ ...
│  │  │  │  │  └─ <ts_id_P>
│  │  │  │  ├─ <folder_id_2>
│  │  │  │  ├─ ...
│  │  │  │  └─ <folder_id_K>
│  │  │  └─ additional_data
│  │  │     ├─ <additional_data_1>
│  │  │     │  ├─ data.csv
│  │  │     │  ├─ data_info.yaml
│  │  │     │  ├─ matrix_<feature_1>.txt
│  │  │     │  ├─ matrix_<feature_2>.txt
│  │  │     │  ├─ ...
│  │  │     │  └─ matrix_<feature_T>.txt
│  │  │     ├─ <additional_data_2>
│  │  │     ├─ ...
│  │  │     └─ <additional_data_M>
│  │  ├─ <aggregation_2>
│  │  ├─ ...
│  │  └─ <aggregation_N>
│  ├─ <source_type_2>
│  ├─ ...
│  └─ <source_type_P>
├─ <subset_2>
├─ ...
└─ <subset_R>

Naming convention: Folder names for subsets, source types, and aggregation types are used for internal HDF5 naming. Do not use spaces - separate words with underscores (e.g. holidays_and_weekdays, 5_minutes).

File Specifications

`identifiers.csv`

Maps each time series to its folder. Required columns:

Column	Type	Description
`ts_id`	integer	Unique time series ID
`folder_id`	integer	Folder the time series lives in

IDs must match those used in main_data and additional_data.

`times.csv`

Maps time step IDs to UTC timestamps. Required columns:

Column	Type	Description
`id_time`	integer	Unique time step ID
`time`	integer or null	UTC Unix epoch timestamp. Use `None` if unavailable (e.g. `5, None`)

IDs must match those used in main_data and additional_data.

`main_data` vs `additional_data`

Files in main_data must be of time series nature
Files in additional_data can be of any nature
Folders <folder_id_1> in main_data are meant to help when there is too much time series to have in one folder
- Always use them, even when you do not need them

`<ts_id>` (one per time series, inside `main_data`)

Contains data for a single time series. The folder name must match the ts_id from identifiers.csv.

First column in data csv file must be id_time, referencing IDs from times.csv.
Gaps in time are permitted.
If the file is too large, split it across multiple numbered files (data_1.csv, data_2.csv, ...) in ascending time order (lower number = earlier times).

`data_<num>.csv` or `data.csv`

Main files for holding data
For main_data files always use the numbered version even though no splitting was needed
Always include a header row with column names.
No spaces in column names - use underscores (e.g. id_time, not id time).

`ts_info.yaml`

Specifies the data type for each column in main_data CSV files, in the same order as the columns appear. Example:

id_time:
  type: uint32
KPI:
  type: uint32

`data_info.yaml`

Same structure as ts_info.yaml, but describes the columns of additional_data CSV files.

Supported Data Types

Type	Variants / Notes
`float`	`float16`, `float32`, `float64`
`int`	`int16`, `int32`, `int64`
`uint`	`uint16`, `uint32`, `uint64`
`str`	Must specify max length: `strN` (e.g. `str64`)
`bool`	-
`time`	Unix epoch UTC

Matrix Features

Some columns reference external matrix files. These must be declared in both the CSV and the YAML as a pair of entries - one for the row index ID, and one describing the matrix itself.

Naming convention: - The index column in the CSV must be named id_matrix_<feature>. - Corresponding matrix files must be named matrix_<feature>_<num>.txt.

YAML declaration (add to both ts_info.yaml and data_info.yaml as applicable):

id_matrix_<feature_1>:
  type: uint32

matrix_<feature_1>:
  matrix:
    columns: 12
    rows: 12
  type: float64

Rules:

The CSV (data.csv or time series data file) must include the id_matrix_<feature> index column for each matrix feature.
Matrix files can also be split, following the same logic as for normal data.
Matrices in their files are expected to be flattened by rows, where each value in row is separated by comma (i.e. one row == rows * columns)

Step 2: Create Documentation

Create two Markdown files describing your dataset:

Overview entry - a short summary for inclusion in the datasets overview page. Use ./docs/datasets_overview.md as a reference for style and format.
Standalone dataset page - a full description of your dataset. Use ./docs/cesnet_timeseries24.md as a reference for style and format.

Step 3: Open an Issue

Create a GitHub issue that includes:

Any other relevant information about the dataset.
A download link for the dataset files.
Links to both Markdown documentation files created in Step 2.

Adding dataset

Step 1: Prepare Your Dataset Files

File Specifications

identifiers.csv

times.csv

main_data vs additional_data

<ts_id> (one per time series, inside main_data)

data_<num>.csv or data.csv

ts_info.yaml

data_info.yaml