Broadband Communications Systems and Architectures Research Group

CBA Website Archive (Until 2023)

This website is the archive of the CBA group website until 2023. For the latest updates, please visit our new website here.

Contact Webmaster

Traffic Classification

The identification of applications in network traffic has become a prolific research topic during the last years. The classification of the traffic is crucial for classic network management tasks, such as traffic engineering and capacity planning. Traditional techniques relying on transport-level protocol ports are no longer reliable due to the ever-changing nature of Internet traffic and applications and their techniques to avoid the detection (e.g., encryption, obfuscation). As a consequence, researchers are working and proposing a wide range of traffic classification solutions. However, although some proposals achieve high accuracy, the  problem is far from being completely solved. The lack of shared tools and reference data makes the comparison and validation of the proposed techniques very difficult. Thus, difficulting the better assesment of the present achievements in this field. 

Our group is involved in many projects doing research in the traffic classification field. Our area of research covers many aspects in this field, however, we have special expertise in these topics:

  • Machine Learning-based techniques
  • Deep Packet Inspection-based techniques
  • Traffic classification with sampled traffic (e.g., Sampled Netflow)
  • Traffic classification in high demanding networks (e.g., backbone networks)
  • Stream Machine Learning-based techinques
  • Ground-truth techniques 

 

DATASETS

Probably the biggest problem to compare and validate the different techniques proposed for network traffic classification is the lack of publicly available datasets. Mainly because of privacy issues, researchers and practitioners are not allowed to share their datasets with the research community. In order to address, or at least mitigate, this problem, our group is usually publishing the datasets used in their works. Next, the publicly available datasets related to our works are described. Special mention for the "Is our Ground-Truth for Traffic Classification Reliable?" dataset that provides a set of reliably labeled pcap traces with full payload.

 

"Analysis of the impact of sampling on NetFlow traffic classification" Dataset

 This dataset is derived from the paper: 

Valentín Carela-Español, Pere Barlet-Ros, Albert Cabellos-Aparicio, and Josep Solé-Pareta: "Analysis of the impact of sampling on NetFlow traffic classification", Computer Networks 55 (2011), pp. 1083-1099. [pdf] [doi]

 

ABSTRACT

The traffic classification problem has recently attracted the interest of both network operators and researchers, given the limitations of traditional techniques when applied to current Internet traffic. Several machine learning (ML) methods have been proposed in the literature as a promising solution to this problem. However, very few can be applied to NetFlow data, while fewer works have analyzed their performance under traffic sampling. In this paper, we address the traffic classification problem with Sampled NetFlow, which is a widely extended protocol among network operators, but scarcely investigated by the research community. In particular, we adapt one of the most popular ML methods to operate with NetFlow data and analyze the impact of traffic sampling on its performance.

Our results show that our ML method is able to obtain similar accuracy than previous packet-based methods, but using only the limited information reported by NetFlow. Conversely, our results indicate that the accuracy of standard ML techniques degrades drastically with sampling. In order to reduce this impact, we propose an automatic ML process that does not rely on any human intervention and significantly improves the classification accuracy in the presence of traffic sampling

 

DATASET

The evaluation dataset used in the paper "Analysis of the impact of sampling on NetFlow traffic classification" consists of seven traces collected at the Gigabit access link of the Universitat Politècnica de Catalunya (UPC), which connects about 25 faculties and 40 departments (geographically distributed in 10 campuses) to the Internet through the Spanish Research and Education network (RedIRIS). 

 

Name Flows Date Time
UPC-I 2 985 098 11-12-08 10:00 (15 min.)
UPC-II 3 369 105 11-12-08 12:00 (15 min.)
UPC-III 3 474 603 12-12-08 16:00 (15 min.)
UPC-IV 3 020 114 12-12-08 18:30 (15 min.)
UPC-V 7 146 336 21-12-08 16:00 (1 h.)
UPC-VI 9 718 077 22-12-08 12:30 (1 h.)
UPC-VII 5 510 999 10-03-09 03:00 (1 h.)

 

The format of the labeled traces available consists of a plain text file similar to a NetFlow v5 flow-print output without IP information and the correspondent application label obtained by L7-Filter.

 

Pr SrcP DstP Pkts Octets StartTime EndTime Active B/Pk Ts Fl Application
06 50 114f 2 3000 0901.00:59:15.924 0901.00:59:17.924 2.000 1500 00 10 skypetoskype

 

GROUND-TRUTH METHODOLOGY

In order to reduce the inaccuracy of L7-filter we use 3 rules:

  • We apply the patterns in a priority order depending on the degree of overmatching of each pattern (e.g., skypeout patterns are in the latest positions of the rule list).
  • We do not label those packets that do not agree with the rules given by pattern creators (e.g., packets detected as NTP with a size different than 48 bytes are not labeled).
  • In the case of multiple matches, we label the flow with the application with more priority based on the quality of each pattern reported in the L7-filter documentation. If the quality of the patterns is equal, the label with more occurrences is chosen.

We also perform a sanitization process in order to remove incorrect or incomplete flows that may confuse or bias the training phase. The sanitization process removes those TCP flows that are not properly formed (e.g., without TCP establishment or termination, and flows with packet loss or with out-of-order packets) from the training set. However, no sanitization process is applied to UDP traffic.

  

TRACES PETITION 

If you are interested in any of these labeled traces send an email to:monitoring email

  

"Is our Ground-Truth for Traffic Classification Reliable?" Dataset

 This dataset is derived from the papers: 

Valentín Carela-Español, Tomasz Bujlow, and Pere Barlet-Ros: "Is Our Ground-Truth for Traffic Classification Reliable?",  In Proc. of the Passive and Active Measurements Conference (PAM'14), Los Angeles, CA, USA, March 2014. [pdf] [doi]

Tomasz Bujlow, Valentín Carela-Español, and Pere Barlet-Ros: "Comparison of Deep Packet Inspection (DPI) tools for traffic classification" , Technical Report, UPC-DAC-RR-CBA-2013-3, June 2013. [pdf]

 

ABSTRACT

The validation of the different proposals in the traffic classification literature is a controversial issue. Usually, these works base their results on a ground-truth built from private datasets and labeled by techniques of unknown reliability. This makes the validation and comparison with other solutions an extremely difficult task. This paper aims to be a first step towards addressing the validation and trustworthiness problem of network traffic classifiers. We perform a comparison between 6 well-known DPI-based techniques, which are frequently used in the literature for ground-truth generation. In order to evaluate these tools we have carefully built a labeled dataset of more than 500 000 flows, which contains traffic from popular applications. Our results present PACE, a commercial tool, as the most reliable solution for ground-truth generation. However, among the open-source tools available, NDPI and especially Libprotoident, also achieve very high precision, while other, more frequently used tools (e.g., L7-Filter ) are not reliable enough and should not be used for ground-truth generation in their current form.

  

DATASET

The dataset used in the paper "Is our ground-truth for traffic classification relaible?" consists of  1 262 022 flows captured during 66 days, between February 25, 2013 and May 1, 2013, which account for 35.69 GB of pure packet data. The dataset has been artificially built in order to allow us its publication with full packet payload. However, we have manually simulated different human behaviours for each application studied in order to make it as representative as possible. The selected applications are shown below:

  • Web browsers: based on w3schools statistics: Chrome and Firefox (W7, XP, LX), Internet Explorer (W7, XP).
  • BitTorrent clients: based on CNET ranking: uTorrent and Bittorrent (W7, XP), Frostwire and Vuze (W7, XP, LX)
  • eDonkey clients: based on CNET ranking: eMule (W7, XP), aMule (LX)
  • FTP clients: based on CNET ranking: FileZilla (W7, XP, LX), SmartFTP Client (W7, XP), CuteFTP (W7, XP), WinSCP (W7, XP)
  • Remote Desktop servers: built-in (W7, XP), xrdp (LX)
  • SSH servers: sshd (LX)
  • Background traffic: DNS and NTP (W7, XP, LX), NETBIOS (W7, XP) 

The dataset consists of three pcap traces, one for each OS used (LX: Linux, W7: Windows 7, XP: Windows XP), and three INFO files, one for each pcap trace. Each line in the INFO file corresponds to a flow in the pcap trace and is described as follows:

 flow_id + "#" + start_time + "#" + end_time + "#" + local_ip + "#" + remote_ip + "#" + local_port + "#" + remote_port + "#" + transport_protocol + "#" + operating_system + "#" + process_name + "#" + HTTP Url + "#" + HTTP Referer + "#" + HTTP Content-type +"#" .

  The process name was present for 520 993 flows (41.28 % of all the flows), which account for 32.33 GB (90.59 %) of the data volume. Additionally, 14 445 flows (1.14 % of all the flows), accounting for 0.28 GB (0.78 %) of data volume, could be identified based on the HTTP content-type field extracted from the packets. Therefore, we were able to successfully establish the ground truth for 535 438 flows (42.43 % of all the flows), accounting for 32.61 GB (91.37 %) of data volume. The remaining flows are unlabeled due to their short lifetime (below <1 s), which made VBS, our ground-truth generator, incapable to reliably establish the corresponding sockets. Only these successfully classified flows will be taken into account during the evaluation of the classifiers. However, all the flows are included in the publicly available traces. This ensures data integrity and the proper work of the classifiers, which may rely on coexistence of different flows. We isolated several application classes based on the information stored in the database (e.g., application labels, HTTP content-type field). The classes together with the number of flows and the data volume are shown in the next table: 

 

Application #Flows #Megabytes
Edonkey 176 581 2 823.88
BitTorrent 62 845  2 621.37
FTP 876   3 089.06
DNS 6 600  1.74 
NTP 27 786  4.03 
RDP  132 907 13 218.47 
NETBIOS 9 445  5.17 
SSH  26 219 91.80 
Browser HTTP  46 669 5 757.32 
Browser RTMP 427  5 907.15 
Unclassified 771 667  3 026.57

 

For a more detailed description of the dataset we refer the reader to our paper and technical report cited before.

  

GROUND-TRUTH METHODOLOGY

To collect and accurately label the flows, we adapted Volunteer-Based System (VBS) developed at Aalborg University. The task of VBS is to collect information about Internet traffic flows (i.e., start time of the flow, number of packets contained by the flow, local and remote IP addresses, local and remote ports, transport layer protocol) together with detailed information about each packet (i.e., direction, size, TCP flags, and relative timestamp to the previous packet in the flow). For each flow, the system also collects the process name associated with that flow. The process name is obtained from the system sockets. This way, we can ensure the application associated to a particular traffic. Additionally, the system collects some information about the HTTP content type (e.g., text/html, video/x-flv ). The captured information is transmitted to the VBS server, which stores the data in a MySQL database. The source code was published under a GPL license. The modified version of the VBS client captures full Ethernet frames for each packet, extracts HTTP URL and Referer fields. We added a module called pcapBuilder, which is responsible for dumping the packets from the database to PCAP files. At the same time, INFO files are generated to provide detailed information about each flow, which allows us to assign each packet from the PCAP file to an individual flow.

 

TRACES PETITION 

If you are interested in any of these labeled traces send an email to:monitoring email

 

PUBLICATIONS

The complete list of publications related to this group can be found here.

 

 

Comunicacions de Banda Ampla © 2009