Skip to content

Features

This page provides a description of individual data features in the datasets. Features available in each dataset are listed on the dataset metadata page.

PPI sequence

A per-packet information (PPI) sequence is a 2D matrix describing the first 30 packets of a flow. For flows shorter than 30 packets, the PPI sequence is padded with zeros. Set use_push_flags to include PUSH flags in PPI sequences, if available in the dataset.

Name Description
SIZE Size of the transport payload
IPT Inter-packet time in milliseconds. The IPT of the first packet is set to zero
DIR Direction of the packet encoded as ±1
PUSH_FLAG Whether the push flag was set in the TCP packet

Flow statistics

Flow statistics are standard features describing the entire flow (with exceptions of PPI_ features that relate to the PPI sequence of the given flow). _REV features correspond to the reverse (server to client) direction.

Name Description
DURATION Duration of the flow in seconds
BYTES Number of transmitted bytes from client to server
BYTES_REV Number of transmitted bytes from server to client
PACKETS Number of packets transmitted from client to server
PACKETS_REV Number of packets transmitted from server to client
PPI_LEN Number of packets in the PPI sequence
PPI_DURATION Duration of the PPI sequence in seconds
PPI_ROUNDTRIPS Number of roundtrips in the PPI sequence
FLOW_ENDREASON_IDLE Flow was terminated because it was idle
FLOW_ENDREASON_ACTIVE Flow was terminated because it reached the active timeout
FLOW_ENDREASON_OTHER Flow was terminated for other reasons

Packet histograms

Packet histograms include binned counts of packet sizes and inter-packet times of the entire flow. There are 8 bins with a logarithmic scale; the intervals are 0–15, 16–31, 32–63, 64–127, 128–255, 256–511, 512–1024, >1024 [ms or B]. The units are milliseconds for inter-packet times and bytes for packet sizes. The histograms are built from all packets of the entire flow, unlike PPI sequences that describe the first 30 packets. Set use_packet_histograms for using packet histograms features, if available in the dataset.

Name Description
PSIZE_BIN{x} Packet sizes histogram x-th bin for the forward direction
PSIZE_BIN{x}_REV Packet sizes histogram x-th bin for the reverse direction
IPT_BIN{x} Inter-packet times histogram x-th bin for the forward direction
IPT_BIN{x}_REV Inter-packet times histogram x-th bin for the reverse direction

On the dataset metadata page, packet histogram features are called PHIST_SRC_SIZES, PHIST_DST_SIZES, PHIST_SRC_IPT, PHIST_DST_IPT. Those are the names of database columns that are flattened to the _BIN{x} features.

TCP features

Datasets with TLS over TCP traffic contain features indicating the presence of individual TCP flags in the flow. Set use_tcp_features to use a subset of flags defined in cesnet_datazoo.constants.SELECTED_TCP_FLAGS.

Name Description
FLAG_{F} Whether F flag was present in the forward (client to server) direction
FLAG_{F}_REV Whether F flag was present in the reverse (server to client) direction

Other fields

Datasets contain auxiliary information about samples, such as communicating hosts, flow times, and more fields extracted from the ClientHello message. The dataset metadata page lists available fields in individual datasets. Set return_other_fields to include those fields in returned dataframes. See using dataloaders for how other fields are handled in dataloaders.

Name Description
ID Per-dataset unique flow identifier
TIME_FIRST Timestamp of the first packet
TIME_LAST Timestamp of the last packet
SRC_IP Source IP address
DST_IP Destination IP address
DST_ASN Destination Autonomous System number
SRC_PORT Source port
DST_PORT Destination port
PROTOCOL Transport protocol
TLS_SNI / QUIC_SNI Server Name Indication domain
TLS_JA3 JA3 fingerprint
QUIC_VERSION QUIC protocol version
QUIC_USER_AGENT User agent string if available in the QUIC Initial Packet

Details about packet histograms and PPI

Due to differences in implementation between packet sequences (pstats.cpp) and packet histogram (phist.cpp) plugins of the ipfixprobe exporter, the number of packets in PPI and histograms can differ (even for flows shorter than 30 packets). The differences are summarized in the following table. Note that this is related to TLS over TCP datasets.

TLS over TCP datasets Packet histograms PPI sequence PACKETS and PACKET_REV
Zero-length packets
(without L4 payload, e.g. ACKs)
Not included Not included Included
Retransmissions
(and out-of-order packets)
Included Not included* Included
Computed from Entire flow First 30 packets Entire flow

*The implementation for the detection of TCP retransmissions and out-of-order packets is far from perfect. Packets with a non-increasing SEQ number are skipped.

For QUIC, there is no detection of retransmissions or out-of-order packets, and QUIC acknowledgment packets are included in both packet sequences and packet histograms.