"Is our Ground-Truth for Traffic Classification Reliable?" Dataset
This dataset is derived from the papers:
Valentín Carela-Español, Tomasz Bujlow, and Pere Barlet-Ros: "Is Our Ground-Truth for Traffic Classification Reliable?", In Proc. of the Passive and Active Measurements Conference (PAM'14), Los Angeles, CA, USA, March 2014. [pdf] [doi]
Tomasz Bujlow, Valentín Carela-Español, and Pere Barlet-Ros: "Comparison of Deep Packet Inspection (DPI) tools for traffic classification" , Technical Report, UPC-DAC-RR-CBA-2013-3, June 2013. [pdf]
ABSTRACT
The validation of the different proposals in the traffic classification literature is a controversial issue. Usually, these works base their results on a ground-truth built from private datasets and labeled by techniques of unknown reliability. This makes the validation and comparison with other solutions an extremely difficult task. This paper aims to be a first step towards addressing the validation and trustworthiness problem of network traffic classifiers. We perform a comparison between 6 well-known DPI-based techniques, which are frequently used in the literature for ground-truth generation. In order to evaluate these tools we have carefully built a labeled dataset of more than 500 000 flows, which contains traffic from popular applications. Our results present PACE, a commercial tool, as the most reliable solution for ground-truth generation. However, among the open-source tools available, NDPI and especially Libprotoident, also achieve very high precision, while other, more frequently used tools (e.g., L7-Filter ) are not reliable enough and should not be used for ground-truth generation in their current form.
DATASET
The dataset used in the paper "Is our ground-truth for traffic classification relaible?" consists of 1 262 022 flows captured during 66 days, between February 25, 2013 and May 1, 2013, which account for 35.69 GB of pure packet data. The dataset has been artificially built in order to allow us its publication with full packet payload. However, we have manually simulated different human behaviours for each application studied in order to make it as representative as possible. The selected applications are shown below:
- Web browsers: based on w3schools statistics: Chrome and Firefox (W7, XP, LX), Internet Explorer (W7, XP).
- BitTorrent clients: based on CNET ranking: uTorrent and Bittorrent (W7, XP), Frostwire and Vuze (W7, XP, LX)
- eDonkey clients: based on CNET ranking: eMule (W7, XP), aMule (LX)
- FTP clients: based on CNET ranking: FileZilla (W7, XP, LX), SmartFTP Client (W7, XP), CuteFTP (W7, XP), WinSCP (W7, XP)
- Remote Desktop servers: built-in (W7, XP), xrdp (LX)
- SSH servers: sshd (LX)
- Background traffic: DNS and NTP (W7, XP, LX), NETBIOS (W7, XP)
The dataset consists of three pcap traces, one for each OS used (LX: Linux, W7: Windows 7, XP: Windows XP), and three INFO files, one for each pcap trace. Each line in the INFO file corresponds to a flow in the pcap trace and is described as follows:
flow_id + "#" + start_time + "#" + end_time + "#" + local_ip + "#" + remote_ip + "#" + local_port + "#" + remote_port + "#" + transport_protocol + "#" + operating_system + "#" + process_name + "#" + HTTP Url + "#" + HTTP Referer + "#" + HTTP Content-type +"#" .
The process name was present for 520 993 flows (41.28 % of all the flows), which account for 32.33 GB (90.59 %) of the data volume. Additionally, 14 445 flows (1.14 % of all the flows), accounting for 0.28 GB (0.78 %) of data volume, could be identified based on the HTTP content-type field extracted from the packets. Therefore, we were able to successfully establish the ground truth for 535 438 flows (42.43 % of all the flows), accounting for 32.61 GB (91.37 %) of data volume. The remaining flows are unlabeled due to their short lifetime (below <1 s), which made VBS, our ground-truth generator, incapable to reliably establish the corresponding sockets. Only these successfully classified flows will be taken into account during the evaluation of the classifiers. However, all the flows are included in the publicly available traces. This ensures data integrity and the proper work of the classifiers, which may rely on coexistence of different flows. We isolated several application classes based on the information stored in the database (e.g., application labels, HTTP content-type field). The classes together with the number of flows and the data volume are shown in the next table:
Application |
#Flows |
#Megabytes |
Edonkey |
176 581 |
2 823.88 |
BitTorrent |
62 845 |
2 621.37 |
FTP |
876 |
3 089.06 |
DNS |
6 600 |
1.74 |
NTP |
27 786 |
4.03 |
RDP |
132 907 |
13 218.47 |
NETBIOS |
9 445 |
5.17 |
SSH |
26 219 |
91.80 |
Browser HTTP |
46 669 |
5 757.32 |
Browser RTMP |
427 |
5 907.15 |
Unclassified |
771 667 |
3 026.57 |
For a more detailed description of the dataset we refer the reader to our paper and technical report cited before.
GROUND-TRUTH METHODOLOGY
To collect and accurately label the flows, we adapted Volunteer-Based System (VBS) developed at Aalborg University. The task of VBS is to collect information about Internet traffic flows (i.e., start time of the flow, number of packets contained by the flow, local and remote IP addresses, local and remote ports, transport layer protocol) together with detailed information about each packet (i.e., direction, size, TCP flags, and relative timestamp to the previous packet in the flow). For each flow, the system also collects the process name associated with that flow. The process name is obtained from the system sockets. This way, we can ensure the application associated to a particular traffic. Additionally, the system collects some information about the HTTP content type (e.g., text/html, video/x-flv ). The captured information is transmitted to the VBS server, which stores the data in a MySQL database. The source code was published under a GPL license. The modified version of the VBS client captures full Ethernet frames for each packet, extracts HTTP URL and Referer fields. We added a module called pcapBuilder, which is responsible for dumping the packets from the database to PCAP files. At the same time, INFO files are generated to provide detailed information about each flow, which allows us to assign each packet from the PCAP file to an individual flow.
TRACES PETITION
If you are interested in any of these labeled traces send an email to: