Difference between revisions of "Dropbox Traces"

From SimpleWiki
Jump to navigationJump to search
 
(57 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
You can download from this page the flow data used in the following paper:
 
You can download from this page the flow data used in the following paper:
  
* '''Drago, I. and Mellia, M. and Munafò, M. M. and Sperotto, A. and Sadre, R. and Pras, A. (2012) Inside Dropbox: Understanding Personal Cloud Storage Services. In: Proceedings of the 12th ACM Internet Measurement Conference - IMC'12, Boston, Nov. 2012'''
+
* [http://eprints.eemcs.utwente.nl/22286/01/imc140-drago.pdf '''Drago, I. and Mellia, M. and Munafò, M. M. and Sperotto, A. and Sadre, R. and Pras, A. (2012) Inside Dropbox: Understanding Personal Cloud Storage Services. Proceedings of the 12th ACM Internet Measurement Conference - IMC'12, Boston, Nov. 2012''']
  
As described in the paper, the data was captured at 4 vantage points in 2 European countries. Most of the data were collected from March 24, 2012 to May 5, 2012. A second dataset was collected in Campus 1 in June and July 2012 to complement the analysis.  
+
As described in the paper, the data was captured at 4 vantage points in 2 European countries. The first 4 files were collected from March 24, 2012 to May 5, 2012. A second dataset was collected in Campus 1 in June and July 2012 to complement the analysis.
  
All data was captured using Tstat: An open source monitoring tool developed at [http://www.tlc-networks.polito.it/ Politecnico di Torino]. Tstat exports flow data containing more than 100 metrics. The source code of Tstat can be obtained from [http://tstat.tlc.polito.it here].
+
The data was captured using Tstat: An open source monitoring tool developed at [http://www.tlc-networks.polito.it/ Politecnico di Torino]. Tstat exports flow data containing more than 100 metrics. The source code of Tstat can be obtained from [http://tstat.tlc.polito.it here]. More information about the DN-Hunter version of Tstat, needed for some experiments, can be found [http://www.tlc-networks.polito.it/oldsite/mellia/papers/DN-HunterImc12.pdf here]. Note that all IP addresses in the datasets are anonymized.
  
 
== Traces ==
 
== Traces ==
  
=== First data capture ===
+
=== First Data Capture ===
* [http://traces.simpleweb.org/dropbox/campus1_dataset1.log.gz Campus 1]
 
* Campus 2 (soon)
 
* Home 1 (soon)
 
* Home 2 (soon)
 
  
=== Second data capture ===
+
These datasets were captured from March 24, 2012 to May 5, 2012.
* [http://traces.simpleweb.org/dropbox/campus1_dataset2.log.gz Campus 1]
+
 
 +
{| class="wikitable" style="text-align: center; width: 400px; height: 100px;"
 +
|-
 +
! scope="col" | Name
 +
! scope="col" | File Size
 +
! scope="col" | Flows
 +
! scope="col" | Devices
 +
|-
 +
! scope="row" | [http://traces.simpleweb.org/dropbox/campus1_dataset1.log.gz Campus 1]
 +
|  21MB || 167,189 || 283
 +
|-
 +
! scope="row" | [http://traces.simpleweb.org/dropbox/campus2_dropbox.log.gz Campus 2]
 +
|  262M || 1,902,824 || 6,609
 +
|-
 +
! scope="row" | [http://traces.simpleweb.org/dropbox/home1_dropbox.log.gz Home 1]
 +
|  181M || 1,438,369 || 3,350
 +
|-
 +
! scope="row" | [http://traces.simpleweb.org/dropbox/home2_dropbox.log.gz Home 2]
 +
|  82M || 693,086 || 1,313
 +
|}
 +
 
 +
=== Second Data Capture ===
 +
 
 +
This dataset was captured from June 01, 2012 to July 31, 2012.
 +
 
 +
{| class="wikitable" style="text-align: center; width: 400px; height: 40px;"
 +
|-
 +
! scope="col" | Name
 +
! scope="col" | File Size
 +
! scope="col" | Flows
 +
! scope="col" | Devices
 +
|-
 +
! scope="row" | [http://traces.simpleweb.org/dropbox/campus1_dataset2.log.gz Campus 1]
 +
| 32M || 264,131 || 270
 +
|}
  
 
== Acceptable Use Policy ==
 
== Acceptable Use Policy ==
Line 24: Line 54:
 
* If noticing vulnerabilities in the anonymization procedure the user is kindly asked to inform the repository administrators.
 
* If noticing vulnerabilities in the anonymization procedure the user is kindly asked to inform the repository administrators.
  
* When writing a paper using this data, we ask the user to cite:
+
* When writing a paper using this data, please cite:
  
 
  @inproceedings{drago2012_dropbox,
 
  @inproceedings{drago2012_dropbox,
Line 31: Line 61:
 
   booktitle    = {Proceedings of the 12th ACM SIGCOMM Conference on Internet Measurement},
 
   booktitle    = {Proceedings of the 12th ACM SIGCOMM Conference on Internet Measurement},
 
   series        = {IMC'12},
 
   series        = {IMC'12},
   pages        = {},
+
   pages        = {481-494},
 
   year          = {2012}
 
   year          = {2012}
 
  }
 
  }
Line 37: Line 67:
 
== Format ==
 
== Format ==
  
All files are in a format similar to the [http://tstat.polito.it/measure.shtml#log_tcp_complete log_tcp_complete] saved by Tstat.
+
All files are in a format '''similar''' to the [http://tstat.polito.it/measure.shtml#log_tcp_complete log_tcp_complete] saved by Tstat.
  
 
The following columns are found in these traces:
 
The following columns are found in these traces:
Line 44: Line 74:
 
# C2S # S2C # Short description      # Unit  # Long description            #
 
# C2S # S2C # Short description      # Unit  # Long description            #
 
############################################################################
 
############################################################################
#  1  # 45  # Client/Server IP addr  # -    # IP addresses of the client/server
+
#  1  # 45  # Client/Server IP addr  # -    # Anonymized IP addresses of the client/server
 
#  2  # 46  # Client/Server TCP port # -    # TCP port addresses for the client/server
 
#  2  # 46  # Client/Server TCP port # -    # TCP port addresses for the client/server
 
#  3  # 47  # packets                # -    # total number of packets observed form the client/server
 
#  3  # 47  # packets                # -    # total number of packets observed form the client/server
Line 104: Line 134:
 
# 101      # Connection type        # -    # Bitmask stating the connection type as identified by TCPL7 inspection engine (see protocol.h)
 
# 101      # Connection type        # -    # Bitmask stating the connection type as identified by TCPL7 inspection engine (see protocol.h)
 
############################################################################
 
############################################################################
# 102      # P2P type              # -    # Type of P2P protocol, as identified by the IPP2P engine (see ipp2p_tstat.h)
+
</pre>
# 103      # P2P subtype            # -    # P2P protocol message type, as identified by the IPP2P engine (see ipp2p_tstat.c)
+
 
# 104      # ED2K Data              # -    # For P2P ED2K flows, the number of data messages
+
Note that the last columns of the current [http://tstat.polito.it/measure.shtml#log_tcp_complete log_tcp_complete] of Tstat are not included. Specifically for this analysis, the following extra columns were added:
# 105      # ED2K Signaling        # -    # For P2P ED2K flows, the number of signaling (not data) messages
+
 
# 106      # ED2K C2S              # -    # For P2P ED2K flows, the number of client<->server messages
+
<pre>
# 107      # ED2K C2C              # -    # For P2P ED2K flows, the number of client<->client messages
 
# 108      # ED2K Chat              # -    # For P2P ED2K flows, the number of chat messages
 
 
############################################################################
 
############################################################################
# 109       # HTTP type              # -    # For HTTP flows, the identified Web2.0 content (see the http_content enum in struct.h)
+
# 102       # C2S messages          # -    # PSH-separated "messages" C2S
 +
# 103      # S2C messages          # -    # PSH-separated "messages" S2C
 +
# 104      # DB host_int            # -    # Anonymized Dropbox device ID
 +
# 105      # DB service            # -    # Dropbox service inferred from the FQDN requested by the user or from server IP addresses
 
############################################################################
 
############################################################################
 
</pre>
 
</pre>
  
Specifically for this analysis, the following extra columns were added:  
+
Columns 102 and 103 were added some weeks after the data capture started in some vantage points. The columns are filled with a "-" for the period in which the value was not yet captured.
 +
 
 +
Column 105 has a string referring to a Dropbox service. Check Sec. 2 and Tab. 1 in the paper for details. In the scripts below (e.g. traffic_share.py) there are also more information on how those values were interpreted in the paper. Flows marked as "Unknown" were identified as related to Dropbox, but the destination service was unclear.
 +
 
 +
== Sample Scripts ==
 +
 
 +
* [http://traces.simpleweb.org/dropbox/scripts_v2.tar.gz Download the scripts (8.5K) for plotting the figures of the paper]
 +
 
 +
The scripts are written in bash, awk or python. Gnuplot is required, and each figure is created by a separate bash script. For example, after unpacking the files and download the data, the following command creates Fig.05 - the output eps will be in a sub-directory called 'figs':
 +
 
 
<pre>
 
<pre>
############################################################################
+
$ ./fig05_ts_contacted_servers.sh campus1_dataset1.log.gz campus2_dropbox.log.gz home1_dropbox.log.gz home2_dropbox.log.gz
# 110      # C2S messages          # -    #
 
# 111      # S2C messages          # -    #
 
# 112      # DB host_int            # -    #
 
# 113      # DB service            # -    #
 
############################################################################
 
 
</pre>
 
</pre>
 +
 +
== List of server IPs ==
 +
 +
Dropbox server IPs are public. To enforce privacy, we had to anonymize both client and server IPs in our datasets, using distinct methods. We however release the [http://www.simpleweb.org/w/images/2/27/Ips.zip list of Dropbox server IPs] that can be obtained from the DNS.
 +
 +
== External Links ==
 +
 +
* [http://www-net.cs.umass.edu/imc2012/ Conference Website]
 +
* [http://www-net.cs.umass.edu/imc2012/program.htm Conference Program]

Latest revision as of 08:05, 23 October 2013

You can download from this page the flow data used in the following paper:

As described in the paper, the data was captured at 4 vantage points in 2 European countries. The first 4 files were collected from March 24, 2012 to May 5, 2012. A second dataset was collected in Campus 1 in June and July 2012 to complement the analysis.

The data was captured using Tstat: An open source monitoring tool developed at Politecnico di Torino. Tstat exports flow data containing more than 100 metrics. The source code of Tstat can be obtained from here. More information about the DN-Hunter version of Tstat, needed for some experiments, can be found here. Note that all IP addresses in the datasets are anonymized.

Traces

First Data Capture

These datasets were captured from March 24, 2012 to May 5, 2012.

Name File Size Flows Devices
Campus 1 21MB 167,189 283
Campus 2 262M 1,902,824 6,609
Home 1 181M 1,438,369 3,350
Home 2 82M 693,086 1,313

Second Data Capture

This dataset was captured from June 01, 2012 to July 31, 2012.

Name File Size Flows Devices
Campus 1 32M 264,131 270

Acceptable Use Policy

  • The user must not attempt to reverse engineer the anonymization procedure used to protect the data.
  • If noticing vulnerabilities in the anonymization procedure the user is kindly asked to inform the repository administrators.
  • When writing a paper using this data, please cite:
@inproceedings{drago2012_dropbox,
  author        = {Idilio Drago and Marco Mellia and Maurizio M. Munaf\`{o} and Anna Sperotto and Ramin Sadre and Aiko Pras},
  title         = {{I}nside {D}ropbox: {U}nderstanding {P}ersonal {C}loud {S}torage {S}ervices},
  booktitle     = {Proceedings of the 12th ACM SIGCOMM Conference on Internet Measurement},
  series        = {IMC'12},
  pages         = {481-494},
  year          = {2012}
}

Format

All files are in a format similar to the log_tcp_complete saved by Tstat.

The following columns are found in these traces:

############################################################################
# C2S # S2C # Short description      # Unit  # Long description            #
############################################################################
#  1  # 45  # Client/Server IP addr  # -     # Anonymized IP addresses of the client/server
#  2  # 46  # Client/Server TCP port # -     # TCP port addresses for the client/server
#  3  # 47  # packets                # -     # total number of packets observed form the client/server
#  4  # 48  # RST sent               # 0/1   # 0 = no RST segment has been sent by the client/server
#  5  # 49  # ACK sent               # -     # number of segments with the ACK field set to 1
#  6  # 50  # PURE ACK sent          # -     # number of segments with ACK field set to 1 and no data
#  7  # 51  # unique bytes           # bytes # number of bytes sent in the payload
#  8  # 52  # data pkts              # -     # number of segments with payload
#  9  # 53  # data bytes             # bytes # number of bytes transmitted in the payload, including retransmissions
# 10  # 54  # rexmit pkts            # -     # number of retransmitted segments
# 11  # 55  # rexmit bytes           # bytes # number of retransmitted bytes
# 12  # 56  # out seq pkts           # -     # number of segments observed out of sequence
# 13  # 57  # SYN count              # -     # number of SYN segments observed (including rtx)
# 14  # 58  # FIN count              # -     # number of FIN segments observed (including rtx)
# 15  # 59  # RFC1323 ws             # 0/1   # Window scale option sent
# 16  # 60  # RFC1323 ts             # 0/1   # Timestamp option sent
# 17  # 61  # window scale           # -     # Scaling values negotiated [scale factor]
# 18  # 62  # SACK req               # 0/1   # SACK option set
# 19  # 63  # SACK sent              # -     # number of SACK messages sent
# 20  # 64  # MSS                    # bytes # MSS declared
# 21  # 65  # max seg size           # bytes # Maximum segment size observed
# 22  # 66  # min seg size           # bytes # Minimum segment size observed
# 23  # 67  # win max                # bytes # Maximum receiver window announced (already scale by the window scale factor)
# 24  # 68  # win min                # bytes # Maximum receiver windows announced (already scale by the window scale factor)
# 25  # 69  # win zero               # -     # Total number of segments declaring zero as receiver window
# 26  # 70  # cwin max               # bytes # Maximum in-flight-size (see Tstat docs)
# 27  # 71  # cwin min               # bytes # Minimum in-flight-size
# 28  # 72  # initial cwin           # bytes # First in-flight size, or total number of unack-ed bytes sent before receiving the first ACK segment
# 29  # 73  # Average rtt            # ms    # Average RTT computed measuring the time elapsed between the data segment and the corresponding ACK
# 30  # 74  # rtt min                # ms    # Minimum RTT observed during connection lifetime
# 31  # 75  # rtt max                # ms    # Maximum RTT observed during connection lifetime
# 32  # 76  # Stdev rtt              # ms    # Standard deviation of the RTT
# 33  # 77  # rtt count              # -     # Number of valid RTT observation
# 34  # 78  # ttl_min                # -     # Minimum Time To Live
# 35  # 79  # ttl_max                # -     # Maximum Time To Live
# 36  # 80  # rtx RTO                # -     # Number of retransmitted segments due to timeout expiration
# 37  # 81  # rtx FR                 # -     # Number of retransmitted segments due to Fast Retransmit (three dup-ack)
# 38  # 82  # reordering             # -     # Number of packet reordering observed
# 39  # 83  # net dup                # -     # Number of network duplicates observed
# 40  # 84  # unknown                # -     # Number of segments not in sequence or duplicate which are not classified as specific events
# 41  # 85  # flow control           # -     # Number of retransmitted segments to probe the receiver window
# 42  # 86  # unnece rtx RTO         # -     # Number of unnecessary transmissions following a timeout expiration
# 43  # 87  # unnece rtx FR          # -     # Number of unnecessary transmissions following a fast retransmit
# 44  # 88  # != SYN seqno           # 0/1   # 1 = retransmitted SYN segments have different initial seqno
############################################################################
# 89        # Completion time        # ms    # Flow duration since first packet to last packet
# 90        # First time             # ms    # Flow first packet since first segment ever
# 91        # Last time              # ms    # Flow last segment since first segment ever
# 92        # C first payload        # ms    # Client first segment with payload since the first flow segment
# 93        # S first payload        # ms    # Server first segment with payload since the first flow segment
# 94        # C last payload         # ms    # Client last segment with payload since the first flow segment
# 95        # S last payload         # ms    # Server last segment with payload since the first flow segment
# 96        # C first ack            # ms    # Client first ACK segment (without SYN) since the first flow segment
# 97        # S first ack            # ms    # Server first ACK segment (without SYN) since the first flow segment
# 98        # First time abs         # ms    # Flow first packet absolute time (epoch)
# 99        # C Internal             # 0/1   # 1 = client has internal IP, 0 = client has external IP
# 100       # S Internal             # 0/1   # 1 = server has internal IP, 0 = server has external IP
############################################################################
# 101       # Connection type        # -     # Bitmask stating the connection type as identified by TCPL7 inspection engine (see protocol.h)
############################################################################

Note that the last columns of the current log_tcp_complete of Tstat are not included. Specifically for this analysis, the following extra columns were added:

############################################################################
# 102       # C2S messages           # -     # PSH-separated "messages" C2S
# 103       # S2C messages           # -     # PSH-separated "messages" S2C
# 104       # DB host_int            # -     # Anonymized Dropbox device ID
# 105       # DB service             # -     # Dropbox service inferred from the FQDN requested by the user or from server IP addresses
############################################################################

Columns 102 and 103 were added some weeks after the data capture started in some vantage points. The columns are filled with a "-" for the period in which the value was not yet captured.

Column 105 has a string referring to a Dropbox service. Check Sec. 2 and Tab. 1 in the paper for details. In the scripts below (e.g. traffic_share.py) there are also more information on how those values were interpreted in the paper. Flows marked as "Unknown" were identified as related to Dropbox, but the destination service was unclear.

Sample Scripts

The scripts are written in bash, awk or python. Gnuplot is required, and each figure is created by a separate bash script. For example, after unpacking the files and download the data, the following command creates Fig.05 - the output eps will be in a sub-directory called 'figs':

$ ./fig05_ts_contacted_servers.sh campus1_dataset1.log.gz campus2_dropbox.log.gz home1_dropbox.log.gz home2_dropbox.log.gz

List of server IPs

Dropbox server IPs are public. To enforce privacy, we had to anonymize both client and server IPs in our datasets, using distinct methods. We however release the list of Dropbox server IPs that can be obtained from the DNS.

External Links