Difference between revisions of "Dropbox Traces"
Line 6: | Line 6: | ||
All data was captured using Tstat: An open source monitoring tool developed at [http://www.tlc-networks.polito.it/ Politecnico di Torino]. Tstat exports flow data containing more than 100 metrics. The source code of Tstat can be obtained from [http://tstat.tlc.polito.it here]. | All data was captured using Tstat: An open source monitoring tool developed at [http://www.tlc-networks.polito.it/ Politecnico di Torino]. Tstat exports flow data containing more than 100 metrics. The source code of Tstat can be obtained from [http://tstat.tlc.polito.it here]. | ||
+ | |||
+ | Note that ***all IP addresses*** are anonymized. | ||
== Traces == | == Traces == |
Revision as of 09:10, 7 September 2012
You can download from this page the flow data used in the following paper:
- Drago, I. and Mellia, M. and Munafò, M. M. and Sperotto, A. and Sadre, R. and Pras, A. (2012) Inside Dropbox: Understanding Personal Cloud Storage Services. In: Proceedings of the 12th ACM Internet Measurement Conference - IMC'12, Boston, Nov. 2012
As described in the paper, the data was captured at 4 vantage points in 2 European countries. Most of the data were collected from March 24, 2012 to May 5, 2012. A second dataset was collected in Campus 1 in June and July 2012 to complement the analysis.
All data was captured using Tstat: An open source monitoring tool developed at Politecnico di Torino. Tstat exports flow data containing more than 100 metrics. The source code of Tstat can be obtained from here.
Note that ***all IP addresses*** are anonymized.
Contents
Traces
First data capture
- Campus 1
- Campus 2 (soon)
- Home 1 (soon)
- Home 2 (soon)
Second data capture
- Campus 1
Acceptable Use Policy
- The user must not attempt to reverse engineer the anonymization procedure used to protect the data.
- If noticing vulnerabilities in the anonymization procedure the user is kindly asked to inform the repository administrators.
- When writing a paper using this data, we ask the user to cite:
@inproceedings{drago2012_dropbox, author = {Idilio Drago and Marco Mellia and Maurizio M. Munaf\`{o} and Anna Sperotto and Ramin Sadre and Aiko Pras}, title = {{I}nside {D}ropbox: {U}nderstanding {P}ersonal {C}loud {S}torage {S}ervices}, booktitle = {Proceedings of the 12th ACM SIGCOMM Conference on Internet Measurement}, series = {IMC'12}, pages = {}, year = {2012} }
Format
All files are in a format similar to the log_tcp_complete saved by Tstat.
The following columns are found in these traces:
############################################################################ # C2S # S2C # Short description # Unit # Long description # ############################################################################ # 1 # 45 # Client/Server IP addr # - # IP addresses of the client/server # 2 # 46 # Client/Server TCP port # - # TCP port addresses for the client/server # 3 # 47 # packets # - # total number of packets observed form the client/server # 4 # 48 # RST sent # 0/1 # 0 = no RST segment has been sent by the client/server # 5 # 49 # ACK sent # - # number of segments with the ACK field set to 1 # 6 # 50 # PURE ACK sent # - # number of segments with ACK field set to 1 and no data # 7 # 51 # unique bytes # bytes # number of bytes sent in the payload # 8 # 52 # data pkts # - # number of segments with payload # 9 # 53 # data bytes # bytes # number of bytes transmitted in the payload, including retransmissions # 10 # 54 # rexmit pkts # - # number of retransmitted segments # 11 # 55 # rexmit bytes # bytes # number of retransmitted bytes # 12 # 56 # out seq pkts # - # number of segments observed out of sequence # 13 # 57 # SYN count # - # number of SYN segments observed (including rtx) # 14 # 58 # FIN count # - # number of FIN segments observed (including rtx) # 15 # 59 # RFC1323 ws # 0/1 # Window scale option sent # 16 # 60 # RFC1323 ts # 0/1 # Timestamp option sent # 17 # 61 # window scale # - # Scaling values negotiated [scale factor] # 18 # 62 # SACK req # 0/1 # SACK option set # 19 # 63 # SACK sent # - # number of SACK messages sent # 20 # 64 # MSS # bytes # MSS declared # 21 # 65 # max seg size # bytes # Maximum segment size observed # 22 # 66 # min seg size # bytes # Minimum segment size observed # 23 # 67 # win max # bytes # Maximum receiver window announced (already scale by the window scale factor) # 24 # 68 # win min # bytes # Maximum receiver windows announced (already scale by the window scale factor) # 25 # 69 # win zero # - # Total number of segments declaring zero as receiver window # 26 # 70 # cwin max # bytes # Maximum in-flight-size (see Tstat docs) # 27 # 71 # cwin min # bytes # Minimum in-flight-size # 28 # 72 # initial cwin # bytes # First in-flight size, or total number of unack-ed bytes sent before receiving the first ACK segment # 29 # 73 # Average rtt # ms # Average RTT computed measuring the time elapsed between the data segment and the corresponding ACK # 30 # 74 # rtt min # ms # Minimum RTT observed during connection lifetime # 31 # 75 # rtt max # ms # Maximum RTT observed during connection lifetime # 32 # 76 # Stdev rtt # ms # Standard deviation of the RTT # 33 # 77 # rtt count # - # Number of valid RTT observation # 34 # 78 # ttl_min # - # Minimum Time To Live # 35 # 79 # ttl_max # - # Maximum Time To Live # 36 # 80 # rtx RTO # - # Number of retransmitted segments due to timeout expiration # 37 # 81 # rtx FR # - # Number of retransmitted segments due to Fast Retransmit (three dup-ack) # 38 # 82 # reordering # - # Number of packet reordering observed # 39 # 83 # net dup # - # Number of network duplicates observed # 40 # 84 # unknown # - # Number of segments not in sequence or duplicate which are not classified as specific events # 41 # 85 # flow control # - # Number of retransmitted segments to probe the receiver window # 42 # 86 # unnece rtx RTO # - # Number of unnecessary transmissions following a timeout expiration # 43 # 87 # unnece rtx FR # - # Number of unnecessary transmissions following a fast retransmit # 44 # 88 # != SYN seqno # 0/1 # 1 = retransmitted SYN segments have different initial seqno ############################################################################ # 89 # Completion time # ms # Flow duration since first packet to last packet # 90 # First time # ms # Flow first packet since first segment ever # 91 # Last time # ms # Flow last segment since first segment ever # 92 # C first payload # ms # Client first segment with payload since the first flow segment # 93 # S first payload # ms # Server first segment with payload since the first flow segment # 94 # C last payload # ms # Client last segment with payload since the first flow segment # 95 # S last payload # ms # Server last segment with payload since the first flow segment # 96 # C first ack # ms # Client first ACK segment (without SYN) since the first flow segment # 97 # S first ack # ms # Server first ACK segment (without SYN) since the first flow segment # 98 # First time abs # ms # Flow first packet absolute time (epoch) # 99 # C Internal # 0/1 # 1 = client has internal IP, 0 = client has external IP # 100 # S Internal # 0/1 # 1 = server has internal IP, 0 = server has external IP ############################################################################ # 101 # Connection type # - # Bitmask stating the connection type as identified by TCPL7 inspection engine (see protocol.h) ############################################################################
Note that the last columns are different from the current stable Tstat version. Specifically for this analysis, the following extra columns were added:
############################################################################ # 102 # C2S messages # - # PSH-separated "messages" C2S # 103 # S2C messages # - # PSH-separated "messages" S2C # 104 # DB host_int # - # Anonymized Dropbox device ID # 105 # DB service # - # Dropbox service inferred from the FQDN requested by the user ############################################################################
Note that columns 102 and 103 were added some weeks after the data capture started in some vantage points. The columns have an "-" for the period in which the value was not yet captured.