Adding dataset
Follow these three steps to contribute a dataset to the repository.
Step 1: Prepare Your Dataset Files
Your dataset must follow the directory structure below. Replace placeholder names (e.g. <dataset_name>, <aggregation_1>, <subset_1> ...) with your actual names - use underscores instead of spaces (e.g. my_dataset, 5_minutes, vpn ...).
<dataset_name>
├─ <subset_1>
│ ├─ <source_type_1>
│ │ ├─ identifiers.csv
│ │ ├─ <aggregation_1>
│ │ │ ├─ times.csv
│ │ │ ├─ main_data
│ │ │ │ ├─ ts_info.yaml
│ │ │ │ ├─ <folder_id_1>
│ │ │ │ │ ├─ <ts_id_1>
│ │ │ │ │ │ ├─ data_1.csv
│ │ │ │ │ │ ├─ data_2.csv
│ │ │ │ │ │ ├─ ...
│ │ │ │ │ │ ├─ data_V.csv
│ │ │ │ │ │ ├─ matrix_<feature_1>_1.txt
│ │ │ │ │ │ ├─ matrix_<feature_1>_2.txt
│ │ │ │ │ │ ├─ ...
│ │ │ │ │ │ ├─ matrix_<feature_1>_E.txt
│ │ │ │ │ │ ├─ matrix_<feature_2>_1.txt
│ │ │ │ │ │ ├─ ...
│ │ │ │ │ │ └─ matrix_<feature_T>_1.txt
│ │ │ │ │ ├─ <ts_id_2>
│ │ │ │ │ ├─ ...
│ │ │ │ │ └─ <ts_id_P>
│ │ │ │ ├─ <folder_id_2>
│ │ │ │ ├─ ...
│ │ │ │ └─ <folder_id_K>
│ │ │ └─ additional_data
│ │ │ ├─ <additional_data_1>
│ │ │ │ ├─ data.csv
│ │ │ │ ├─ data_info.yaml
│ │ │ │ ├─ matrix_<feature_1>.txt
│ │ │ │ ├─ matrix_<feature_2>.txt
│ │ │ │ ├─ ...
│ │ │ │ └─ matrix_<feature_T>.txt
│ │ │ ├─ <additional_data_2>
│ │ │ ├─ ...
│ │ │ └─ <additional_data_M>
│ │ ├─ <aggregation_2>
│ │ ├─ ...
│ │ └─ <aggregation_N>
│ ├─ <source_type_2>
│ ├─ ...
│ └─ <source_type_P>
├─ <subset_2>
├─ ...
└─ <subset_R>
Naming convention: Folder names for subsets, source types, and aggregation types are used for internal HDF5 naming. Do not use spaces - separate words with underscores (e.g.
holidays_and_weekdays,5_minutes).
File Specifications
identifiers.csv
Maps each time series to its folder. Required columns:
| Column | Type | Description |
|---|---|---|
ts_id |
integer | Unique time series ID |
folder_id |
integer | Folder the time series lives in |
IDs must match those used in main_data and additional_data.
times.csv
Maps time step IDs to UTC timestamps. Required columns:
| Column | Type | Description |
|---|---|---|
id_time |
integer | Unique time step ID |
time |
integer or null | UTC Unix epoch timestamp. Use None if unavailable (e.g. 5, None) |
IDs must match those used in main_data and additional_data.
main_data vs additional_data
- Files in
main_datamust be of time series nature - Files in
additional_datacan be of any nature - Folders
<folder_id_1>inmain_dataare meant to help when there is too much time series to have in one folder- Always use them, even when you do not need them
<ts_id> (one per time series, inside main_data)
Contains data for a single time series. The folder name must match the ts_id from identifiers.csv.
- First column in data csv file must be
id_time, referencing IDs fromtimes.csv. - Gaps in time are permitted.
- If the file is too large, split it across multiple numbered files (
data_1.csv,data_2.csv, ...) in ascending time order (lower number = earlier times).
data_<num>.csv or data.csv
- Main files for holding data
- For
main_datafiles always use the numbered version even though no splitting was needed - Always include a header row with column names.
- No spaces in column names - use underscores (e.g.
id_time, notid time).
ts_info.yaml
Specifies the data type for each column in main_data CSV files, in the same order as the columns appear. Example:
id_time:
type: uint32
KPI:
type: uint32
data_info.yaml
Same structure as ts_info.yaml, but describes the columns of additional_data CSV files.
Supported Data Types
| Type | Variants / Notes |
|---|---|
float |
float16, float32, float64 |
int |
int16, int32, int64 |
uint |
uint16, uint32, uint64 |
str |
Must specify max length: strN (e.g. str64) |
bool |
- |
time |
Unix epoch UTC |
Matrix Features
Some columns reference external matrix files. These must be declared in both the CSV and the YAML as a pair of entries - one for the row index ID, and one describing the matrix itself.
Naming convention:
- The index column in the CSV must be named id_matrix_<feature>.
- Corresponding matrix files must be named matrix_<feature>_<num>.txt.
YAML declaration (add to both ts_info.yaml and data_info.yaml as applicable):
id_matrix_<feature_1>:
type: uint32
matrix_<feature_1>:
matrix:
columns: 12
rows: 12
type: float64
Rules:
- The CSV (
data.csvor time series data file) must include theid_matrix_<feature>index column for each matrix feature. - Matrix files can also be split, following the same logic as for normal data.
- Matrices in their files are expected to be flattened by rows, where each value in row is separated by comma (i.e. one row == rows * columns)
Step 2: Create Documentation
Create two Markdown files describing your dataset:
-
Overview entry - a short summary for inclusion in the datasets overview page. Use
./docs/datasets_overview.mdas a reference for style and format. -
Standalone dataset page - a full description of your dataset. Use
./docs/cesnet_timeseries24.mdas a reference for style and format.
Step 3: Open an Issue
Create a GitHub issue that includes:
- Any other relevant information about the dataset.
- A download link for the dataset files.
- Links to both Markdown documentation files created in Step 2.