Difference between revisions of "Traces"

From SimpleWiki
Jump to navigationJump to search
m (Reverted edits by Idiliod (talk) to last revision by Pras)
Line 1: Line 1:
Personal cloud storage is becoming more and more popular, with Dropbox certainly being the best known example. It generates a huge amount of Internet traffic, but how does it works? How is it used? What are the possible improvements?
+
From this location you can download several traces, including anonymized packet headers (tcpdump/libcap), Netflow version 5 data, a labeled dataset for intrusion detection, and Dropbox traffic traces. More information on the data collection and on the anonymization procedures can be found below. When using these traces, please refer to the [[Acceptable Use policy]].
  
We have been doing research on the usage of Dropbox ([http://eprints.eemcs.utwente.nl/22286/01/imc140-drago.pdf see our results here]). In this experiment, we collected basic statistics of what files are stored in Dropbox folders.
 
  
== Datasets ==
 
  
Download our datasets from here:
+
 
 +
= Dropbox Traces =
 +
 
 +
=== [[Dropbox_Traces | Dropbox Traffic Traces]]  ===
 +
 
 +
You can download from this page the flow data used in the following paper:
 +
 
 +
* [http://eprints.eemcs.utwente.nl/22286/01/imc140-drago.pdf '''Drago, I. and Mellia, M. and Munafò, M. M. and Sperotto, A. and Sadre, R. and Pras, A. (2012) Inside Dropbox: Understanding Personal Cloud Storage Services. Proceedings of the 12th ACM Internet Measurement Conference - IMC'12, Boston, Nov. 2012''']
 +
 
 +
Check [[Dropbox_Traces | here]] for more details. Several scripts used to process the data are also available [[Dropbox_Traces#Sample_Scripts | here]].
 +
 
 +
==== First Data Capture ====
 +
 
 +
These datasets were captured from March 24, 2012 to May 5, 2012.
 +
 
 +
{| class="wikitable" style="text-align: center; width: 400px; height: 100px;"
 +
|-
 +
! scope="col" | Name
 +
! scope="col" | File Size
 +
! scope="col" | Flows
 +
! scope="col" | Devices
 +
|-
 +
! scope="row" | [http://traces.simpleweb.org/dropbox/campus1_dataset1.log.gz Campus 1]
 +
|  21MB || 167,189 || 283
 +
|-
 +
! scope="row" | [http://traces.simpleweb.org/dropbox/campus2_dropbox.log.gz Campus 2]
 +
|  262M || 1,902,824 || 6,609
 +
|-
 +
! scope="row" | [http://traces.simpleweb.org/dropbox/home1_dropbox.log.gz Home 1]
 +
|  181M || 1,438,369 || 3,350
 +
|-
 +
! scope="row" | [http://traces.simpleweb.org/dropbox/home2_dropbox.log.gz Home 2]
 +
|  82M || 693,086 || 1,313
 +
|}
 +
 
 +
==== Second Data Capture ====
 +
 
 +
This dataset was captured from June 01, 2012 to July 31, 2012.
  
 
{| class="wikitable" style="text-align: center; width: 400px; height: 40px;"
 
{| class="wikitable" style="text-align: center; width: 400px; height: 40px;"
Line 11: Line 46:
 
! scope="col" | Name
 
! scope="col" | Name
 
! scope="col" | File Size
 
! scope="col" | File Size
! scope="col" | Volunteers
+
! scope="col" | Flows
 +
! scope="col" | Devices
 
|-
 
|-
! scope="row" | [http://traces.simpleweb.org/dropbox/crawler/dropbox_crawler.tar.gz Crawler Dataset]
+
! scope="row" | [http://traces.simpleweb.org/dropbox/campus1_dataset2.log.gz Campus 1]
| 219M || 333
+
| 32M || 264,131 || 270
 
|}
 
|}
  
Some results derived from these data can be found in [http://eprints.eemcs.utwente.nl/24136/01/2013_drago_thesis.pdf here].
 
  
In particular, the figures presented in Sect. 5.3 of the linked document are obtained using the
+
= Labeled Dataset for Intrusion Detection =
scripts available in the folder "scripts" inside the tarball.
+
=== [http://traces.simpleweb.org/traces/netflow/netflow2/ Labeled Dataset for Intrusion Detection] ===
 +
 
 +
In this scenario, a honeypot (running in a virtual machine) ran for 6 days, from Tuesday 23 September 2008 12:40:00 GMT to Monday 29 September 2008 22:40:00 GMT. The honeypot was hosted in the University of Twente network and directly connected to the Internet. The monitoring window is comprehen- sive of both working days and weekend days. The data collection resulted in a 24 GB dump file containing 155.2 million packets. The processing of the dumped data and logs, collected over a period of 6 days, resulted in 14.2M flows and 7.6M alerts. More information on the labeling procedure can be found [[Labeled_Dataset_for_Intrusion_Detection | here]].
 +
 
 +
 
 +
= Pcap Traces =
 +
 
 +
These datasets are a collection of anonymized packet headers (tcpdump/libcap) and NetFlow data collected from various locations in the Netherlands. More information on the data collection and anonymization procedures can be found [http://traces.simpleweb.org/traces/TCP-IP/background/simpleweb.pdf here]. You can find bellow a short description of the scenarios where the datasets where collected.
 +
 
 +
 
 +
=== [http://traces.simpleweb.org/traces/TCP-IP/location1/ Trace 1 - Packet Headers] ===
  
== How our data collection work? ==
+
In scenario 1, the 300 Mbit/s (a trunk of 3 x 100 Mbit/s) ethernet link has been measured, which connects a residential network of a university to the core network of this university. On the residential network, about 2000 students are connected, each having a 100 Mbit/s ethernet access link. The residential network itself consists of 100 and 300 Mbit/s links to the various switches, depending on the aggregation level. The measured link has an average load of about 60%. Measurements have taken place in July 2002.
  
* It scans Dropbox folders
+
=== [http://traces.simpleweb.org/traces/TCP-IP/location2/ Trace 2 - Packet Headers] ===
* Calculates basic statistics
 
* Shows what has been collected for approval
 
* Sends the statistics to us
 
  
== What has been logged? ==
+
In the second scenario, the 1 Gbit/s ethernet link connecting a research institute to the Dutch academic and research network has been measured. There are about 200 researchers and support staff working at this institute. They all have a 100 Mbit/s access link, and the core network of the institute consists of 1 Gbit/s links. The measured link is only mildly loaded, usually around 1%. The measurements are from May - August 2003.
  
For each file/folder in a Dropbox, the program collects:
+
=== [http://traces.simpleweb.org/traces/TCP-IP/location3/ Trace 3 - Packet Headers] ===
<pre>
 
* Size in bytes
 
* Last modification time
 
* Mime type of the file
 
* File extension
 
* MD5 Hash of both initial and final 8 kbytes of the file
 
* MD5 Hash of the file name/path
 
</pre>
 
  
The program also sends to us:
+
This dataset was collected in a large college. Their 1 Gbit/s link (i.e., the link that has been measured) to the Dutch academic and research network carries traffic for over 1000 students and staff concurrently, during busy hours. The access link speed on this network is, in general, 100 Mbit/s. The average load on the 1 Gbit/s link usually is around 10-15%. These measurements have been done from September - December 2003.
<pre>
 
* MD5 Hash of Dropbox configuration files (or MAC address if we cannot read the former)
 
* MD5 Hash of the path of your Dropbox home folder
 
* Your IP address and operating system version
 
* Error logs, in case something goes wrong during the data collection
 
</pre>
 
  
Collected information is sent via plain HTTP to a centralized collection server.
+
=== [http://traces.simpleweb.org/traces/TCP-IP/location4/ Trace 4 - Packet Headers] ===
  
== Client source code ==
+
In scenario 4, the 1 Gbit/s aggregated uplink of an ADSL access network has been monitored. A couple of hundred ADSL customers, mostly student dorms, are connected to this access network. Access link speeds vary from 256 kbit/s (down and up) to 8 Mbit/s (down) and 1 Mbit/s (up). The average load on the aggregated uplink is around 150 Mbit/s. These measurements are from February - July 2004.
  
Download the source code by clicking [http://www.simpleweb.org/dropbox/source_python.zip here] for the native versions (you will need Python 2.7 and [http://www.pyinstaller.org/ PyInstaller] for building these versions), or [http://www.simpleweb.org/dropbox/source_java.zip here] for the Java version.
+
=== [http://traces.simpleweb.org/traces/TCP-IP/location5/ Trace 5 - Packet Headers] ===
  
== More information ==
+
The dataset Packet Headers 5 was collected in a hosting-provider, i.e. a commercial party that offers floor- and rack-space to clients who want to connect, for example, their WWW-servers to the Internet. At this hosting-provider, these servers are connected at (in most cases) 100 Mbit/s to the core network of the provider. The bandwidth capacity level of this hosting-provider’s uplink (that we have measured) is around 50 Mbit/s. These measurements are from December 2003 - February 2004.
  
The datasets in this page are used in the following publications:
+
=== [http://traces.simpleweb.org/traces/TCP-IP/location6/ Trace 6 - Packet Headers] ===
  
  @phdthesis{drago_understanding_2013,
+
In scenario 6, a 100 Mbit/s Ethernet link connecting an educational organization to the internet has been measured. This is a relatively small organization with around 35 employees and a little over 100 students working and studying at this site (the headquarter location of this organization). All workstations at this location ( 100 in total) have a 100Mbit/s Lan connection. The core network consists of a 1 Gbit/s connection. The recordings took place between the external optical fiber modem and the first firewall. The measured link was only mildly loaded during this period. These measurements are from May - June 2007.
          author      = {Idilio Drago},
 
          title        = {Understanding and Monitoring Cloud Services},
 
          school      = {University of Twente},
 
          url          = {<nowiki>\url{http://eprints.eemcs.utwente.nl/24136/</nowiki>}},
 
          year        = {2013},
 
  },
 
  
  @inproceedings{drago_caracterizacao_2013,
 
          author      = {Idilio Drago and Alex Borges Vieira and Ana Paula Couto da Silva},
 
          title        = {Caracteriza{\c c}{\~a}o dos Arquivos Armazenados no Dropbox},
 
          booktitle    = {Anais do Workshop de Redes {P2P}, Din{\^a}micas, Sociais e Orientadas a Conte{\'u}do},
 
          series      = <nowiki>{{WP2P+}}</nowiki>,
 
          pages        = {109--114},
 
          year        = {2013},
 
  },
 
  
 +
= Netflow Traces =
  
== More information ==
+
=== [http://traces.simpleweb.org/traces/netflow/netflow1/ Trace 7 - Netflow Data] ===
  
* More information about our work is found on this paper:
+
The Netflow version 5 data was recorded in the access router connecting a university to its ISP. It contains flow information about most of the incoming and outgoing university’s traffic and some internal traffic as well. The traces cover a period of time of two working days, namely between Wednesday August 1st 2007, 00:00 and Thursday August 2nd 2007, 23:59. The university has /16 network providing connectivity to the employees and the students on its buildings and the campus. The university is connected to its ISP through a 10 Gbps optical link with an average load of 650 Mbps and peaks up to 1.0 Gbps.
  
[http://eprints.eemcs.utwente.nl/22286/01/imc140-drago.pdf '''Drago, I. and Mellia, M. and Munafò, M. M. and Sperotto, A. and Sadre, R. and Pras, A. (2012) Inside Dropbox: Understanding Personal Cloud Storage Services. Proceedings of the 12th ACM Internet Measurement Conference - IMC'12, Boston, Nov. 2012''']
 
  
* [[Dropbox Traces|This page]] has more information about the data we used in our research so far.
+
= Software =
  
 +
Some analysis software is described in [http://traces.simpleweb.org/traces/TCP-IP/background/m2c-D16.pdf this PDF document] and can be downloaded from [[Analysis Software|here]].
  
== External Links ==
+
The source code of the application based in AnonTool API, used to anonymize the Netflow data can be found [http://traces.simpleweb.org/traces/TCP-IP/background/tools/anon_nflow.tar.gz here].
  
These institutes involved in this research:
+
= Other Trace Sources =
* [http://www.utwente.nl/ewi/dacs/ DACS - University of Twente] - Contact: Idilio Drago - idilio.drago@polito.it
+
* [http://www.caida.org/data/overview/ CAIDA]
* [http://www.ufjf.br/portal/ Universidade Federal de Juiz de Fora] Contact: Alex Vieira - alex.borges@ufjf.edu.br
+
* [http://www.ist-mome.org/database/MeasurementData/?cmd=datatypes MOME]
* [http://www.tlc-networks.polito.it/ Telecommunication Networks Group - Politecnico di Torino] - Marco Mellia - mellia@tlc.polito.it
+
* [http://wand.cs.waikato.ac.nz/wits/ Waikato Internet Traffic Storage project]
 +
* [https://data-repository.ripe.net/ RIPE]
 +
* [http://totem.info.ucl.ac.be/dataset.html TOTEM project]
 +
* [http://ita.ee.lbl.gov/ Internet Traffic Archive]
 +
* [http://tracer.csl.sony.co.jp/mawi/ MAWI]
 +
* [http://traces.cs.umass.edu/ UMass Trace Repository]
 +
* [http://crawdad.cs.dartmouth.edu/ CRAWDAD]
 +
* [http://www.predict.org/ PREDICT]

Revision as of 10:28, 9 May 2014

From this location you can download several traces, including anonymized packet headers (tcpdump/libcap), Netflow version 5 data, a labeled dataset for intrusion detection, and Dropbox traffic traces. More information on the data collection and on the anonymization procedures can be found below. When using these traces, please refer to the Acceptable Use policy.



Dropbox Traces

Dropbox Traffic Traces

You can download from this page the flow data used in the following paper:

Check here for more details. Several scripts used to process the data are also available here.

First Data Capture

These datasets were captured from March 24, 2012 to May 5, 2012.

Name File Size Flows Devices
Campus 1 21MB 167,189 283
Campus 2 262M 1,902,824 6,609
Home 1 181M 1,438,369 3,350
Home 2 82M 693,086 1,313

Second Data Capture

This dataset was captured from June 01, 2012 to July 31, 2012.

Name File Size Flows Devices
Campus 1 32M 264,131 270


Labeled Dataset for Intrusion Detection

Labeled Dataset for Intrusion Detection

In this scenario, a honeypot (running in a virtual machine) ran for 6 days, from Tuesday 23 September 2008 12:40:00 GMT to Monday 29 September 2008 22:40:00 GMT. The honeypot was hosted in the University of Twente network and directly connected to the Internet. The monitoring window is comprehen- sive of both working days and weekend days. The data collection resulted in a 24 GB dump file containing 155.2 million packets. The processing of the dumped data and logs, collected over a period of 6 days, resulted in 14.2M flows and 7.6M alerts. More information on the labeling procedure can be found here.


Pcap Traces

These datasets are a collection of anonymized packet headers (tcpdump/libcap) and NetFlow data collected from various locations in the Netherlands. More information on the data collection and anonymization procedures can be found here. You can find bellow a short description of the scenarios where the datasets where collected.


Trace 1 - Packet Headers

In scenario 1, the 300 Mbit/s (a trunk of 3 x 100 Mbit/s) ethernet link has been measured, which connects a residential network of a university to the core network of this university. On the residential network, about 2000 students are connected, each having a 100 Mbit/s ethernet access link. The residential network itself consists of 100 and 300 Mbit/s links to the various switches, depending on the aggregation level. The measured link has an average load of about 60%. Measurements have taken place in July 2002.

Trace 2 - Packet Headers

In the second scenario, the 1 Gbit/s ethernet link connecting a research institute to the Dutch academic and research network has been measured. There are about 200 researchers and support staff working at this institute. They all have a 100 Mbit/s access link, and the core network of the institute consists of 1 Gbit/s links. The measured link is only mildly loaded, usually around 1%. The measurements are from May - August 2003.

Trace 3 - Packet Headers

This dataset was collected in a large college. Their 1 Gbit/s link (i.e., the link that has been measured) to the Dutch academic and research network carries traffic for over 1000 students and staff concurrently, during busy hours. The access link speed on this network is, in general, 100 Mbit/s. The average load on the 1 Gbit/s link usually is around 10-15%. These measurements have been done from September - December 2003.

Trace 4 - Packet Headers

In scenario 4, the 1 Gbit/s aggregated uplink of an ADSL access network has been monitored. A couple of hundred ADSL customers, mostly student dorms, are connected to this access network. Access link speeds vary from 256 kbit/s (down and up) to 8 Mbit/s (down) and 1 Mbit/s (up). The average load on the aggregated uplink is around 150 Mbit/s. These measurements are from February - July 2004.

Trace 5 - Packet Headers

The dataset Packet Headers 5 was collected in a hosting-provider, i.e. a commercial party that offers floor- and rack-space to clients who want to connect, for example, their WWW-servers to the Internet. At this hosting-provider, these servers are connected at (in most cases) 100 Mbit/s to the core network of the provider. The bandwidth capacity level of this hosting-provider’s uplink (that we have measured) is around 50 Mbit/s. These measurements are from December 2003 - February 2004.

Trace 6 - Packet Headers

In scenario 6, a 100 Mbit/s Ethernet link connecting an educational organization to the internet has been measured. This is a relatively small organization with around 35 employees and a little over 100 students working and studying at this site (the headquarter location of this organization). All workstations at this location ( 100 in total) have a 100Mbit/s Lan connection. The core network consists of a 1 Gbit/s connection. The recordings took place between the external optical fiber modem and the first firewall. The measured link was only mildly loaded during this period. These measurements are from May - June 2007.


Netflow Traces

Trace 7 - Netflow Data

The Netflow version 5 data was recorded in the access router connecting a university to its ISP. It contains flow information about most of the incoming and outgoing university’s traffic and some internal traffic as well. The traces cover a period of time of two working days, namely between Wednesday August 1st 2007, 00:00 and Thursday August 2nd 2007, 23:59. The university has /16 network providing connectivity to the employees and the students on its buildings and the campus. The university is connected to its ISP through a 10 Gbps optical link with an average load of 650 Mbps and peaks up to 1.0 Gbps.


Software

Some analysis software is described in this PDF document and can be downloaded from here.

The source code of the application based in AnonTool API, used to anonymize the Netflow data can be found here.

Other Trace Sources