Difference between revisions of "Dropbox Crawler"

From SimpleWiki
Jump to navigationJump to search
 
(20 intermediate revisions by 2 users not shown)
Line 1: Line 1:
__NOTOC__
 
 
Personal cloud storage is becoming more and more popular, with Dropbox certainly being the best known example. It generates a huge amount of Internet traffic, but how does it works? How is it used? What are the possible improvements?
 
Personal cloud storage is becoming more and more popular, with Dropbox certainly being the best known example. It generates a huge amount of Internet traffic, but how does it works? How is it used? What are the possible improvements?
  
We have been doing research on the usage of Dropbox ([http://eprints.eemcs.utwente.nl/22286/01/imc140-drago.pdf see our results here]). As a next step, we need to know what type of files people store in the service. This would allow us to understand the impact of some technologies on the system performance and on network traffic, among other things.
+
In this experiment, we collected basic statistics of what files are stored in Dropbox folders.
  
We are looking for volunteers to provide us basic statistics (see below) of what files are stored in their Dropbox folders.
+
== Datasets ==
  
== Be part of the crowd - Click on the logos to download our client ==
+
Download our datasets:
  
{| border="0"
+
{| class="wikitable" style="text-align: center; width: 400px; height: 40px;"
| align="center" width="200px" | [[Image:windows.png|link=http://www.simpleweb.org/dropbox/dropbox_crawler_win32.exe]]
 
| align="center" width="200px" | [[Image:mac.png|link=http://www.simpleweb.org/dropbox/dropbox_crawler_osx.dmg]]
 
| align="center" width="200px" | [[Image:linux.png|link=http://www.simpleweb.org/dropbox/dropbox_crawler_linux32.zip]]
 
 
|-
 
|-
| align="center" width="200px" | [http://www.simpleweb.org/dropbox/dropbox_crawler_win32.exe Window - 8.2M]
+
! scope="col" | Name
| align="center" width="200px" | [http://www.simpleweb.org/dropbox/dropbox_crawler_osx.dmg Mac OS X - 34M]
+
! scope="col" | File Size
| align="center" width="200px" | [http://www.simpleweb.org/dropbox/dropbox_crawler_linux32.zip Linux 32bits - 7.6M]
+
! scope="col" | Volunteers
|-
 
| align="center" width="200px" |
 
| align="center" width="200px" |
 
| align="center" width="200px" | [http://www.simpleweb.org/dropbox/dropbox_crawler_linux64.zip Linux 64bits - 8.4M]
 
 
|-
 
|-
 +
! scope="row" | [http://traces.simpleweb.org/dropbox/crawler/dropbox_crawler.tar.gz Crawler Dataset]
 +
| 219M || 333
 
|}
 
|}
  
=== How to run it ===
+
Some results derived from these data can be found in [http://eprints.eemcs.utwente.nl/24136/01/2013_drago_thesis.pdf here].
 
 
* Download the application by clicking on the logo of your operating system
 
* Decompress the client (only Linux and OS X)
 
* Double click on the file to run it
 
 
 
If you have OS X Mountain Lion, you may need to right-click on the application after decompressing it, select "Open", and confirm that you want to run the application.
 
 
 
=== Java Version ===
 
 
 
If you have difficulties with the native versions, you can try the Java version. For running it you need the [http://www.java.com/en/download/index.jsp Java Runtime Environment 6+] in your computer. This version does not support manual proxy configuration. If you are behind a proxy, try to use the native versions, or contact us.  
 
 
{| border="0"
 
| align="center" width="200px" | [[Image:java.png|link=http://www.simpleweb.org/dropbox/dropbox_crawler_java.jar]]
 
|-
 
| align="center" width="200px" | [http://www.simpleweb.org/dropbox/dropbox_crawler_java.jar Java (requires JRE) - 270K]
 
|-
 
|}
 
  
== What our application will do? ==
+
In particular, the figures presented in Sect. 5.3 of the linked document are obtained using the
 +
scripts available in the folder "scripts" inside the tarball.
  
* Scan your Dropbox folder
+
== How our data collection work? ==
* Calculate basic statistics
 
* '''Show you what has been collected for your approval'''
 
* Send the statistics to us
 
  
The application has been designed to be as simple as possible. In case you have any difficult, please contact us.
+
* It scans Dropbox folders
 +
* Calculates basic statistics
 +
* Shows what has been collected for approval
 +
* Sends the statistics to us
  
== What will be logged? ==
+
== What has been logged? ==
  
For each file/folder in your Dropbox, the program will collect:
+
For each file/folder in a Dropbox, the program collects:
 
<pre>
 
<pre>
 
* Size in bytes
 
* Size in bytes
Line 60: Line 38:
 
* File extension
 
* File extension
 
* MD5 Hash of both initial and final 8 kbytes of the file
 
* MD5 Hash of both initial and final 8 kbytes of the file
* MD5 Hash of the file name
+
* MD5 Hash of the file name/path
 
</pre>
 
</pre>
  
The program will also send to us:
+
The program also sends to us:
 
<pre>
 
<pre>
 
* MD5 Hash of Dropbox configuration files (or MAC address if we cannot read the former)
 
* MD5 Hash of Dropbox configuration files (or MAC address if we cannot read the former)
Line 71: Line 49:
 
</pre>
 
</pre>
  
Collected information is sent via plain HTTP (let Wireshark be with you!) to a centralized collection server.
+
Collected information is sent via plain HTTP to a centralized collection server.
  
== How will we use this information? ==
+
== Client source code ==
  
Collected data, postprocessing scripts, and all results will be submitted to publication and made freely available in this website. Thus, anyone will be able to use our data sources for further researches.
+
Download the source code by clicking [http://www.simpleweb.org/dropbox/source_python.zip here] for the native versions (you will need Python 2.7 and [http://www.pyinstaller.org/ PyInstaller] for building these versions), or [http://www.simpleweb.org/dropbox/source_java.zip here] for the Java version.
  
We will, however, take extra actions to ensure that no sensitive information will be in these datasets. Note that the only information that could potentially reveal your identity is your IP address, which we will '''anonymize'''. All other statistics cannot be related to the person owning the files.
+
== More information ==
 
 
== What this program will NOT do? ==
 
 
 
* Copy any file or folder out of your computer
 
* Copy any other information than what is listed above
 
* Install or store anything in your computer
 
* ...
 
  
You can also take a look on the source code if you have any doubts about the program, recompile it on your own (and improve it :))
+
The dataset in this page is used in the following publications:
  
== Client source code ==
+
  @phdthesis{drago_understanding_2013,
 +
          author      = {Idilio Drago},
 +
          title        = {Understanding and Monitoring Cloud Services},
 +
          school      = {University of Twente},
 +
          url          = {<nowiki>\url{http://eprints.eemcs.utwente.nl/24136/</nowiki>}},
 +
          year        = {2013},
 +
  },
  
Download the source code by clicking [http://www.simpleweb.org/dropbox/source_python.zip here] for the native versions (you will need Python 2.7 and [http://www.pyinstaller.org/ PyInstaller] for building these versions), or [http://www.simpleweb.org/dropbox/source_java.zip here] for the Java version.
+
  @inproceedings{drago_caracterizacao_2013,
 +
          author      = {Idilio Drago and Alex Borges Vieira and Ana Paula Couto da Silva},
 +
          title        = {Caracteriza{\c c}{\~a}o dos Arquivos Armazenados no Dropbox},
 +
          booktitle    = {Anais do Workshop de Redes {P2P}, Din{\^a}micas, Sociais e Orientadas a Conte{\'u}do},
 +
          series      = <nowiki>{{WP2P+}}</nowiki>,
 +
          pages        = {109--114},
 +
          year        = {2013},
 +
  },
  
== More information ==
+
More information about our previous work is found on these papers:
  
* You can find more information about our work on this paper:
+
* [http://eprints.eemcs.utwente.nl/22286/01/imc140-drago.pdf '''Drago, I. and Mellia, M. and Munafò, M. M. and Sperotto, A. and Sadre, R. and Pras, A. (2012) Inside Dropbox: Understanding Personal Cloud Storage Services. Proceedings of the 12th ACM Internet Measurement Conference - IMC'12, Boston, Nov. 2012''']
  
[http://eprints.eemcs.utwente.nl/22286/01/imc140-drago.pdf '''Drago, I. and Mellia, M. and Munafò, M. M. and Sperotto, A. and Sadre, R. and Pras, A. (2012) Inside Dropbox: Understanding Personal Cloud Storage Services. Proceedings of the 12th ACM Internet Measurement Conference - IMC'12, Boston, Nov. 2012''']
+
* [http://eprints.eemcs.utwente.nl/23674/01/cloud_storage.pdf '''Drago, I. and Bocchi, E. and Mellia, M. and Slatman, H. and Pras, A. (2013) Benchmarking personal cloud storage. In: Proceedings of the 13th ACM Internet Measurement Conference, IMC 2013, 23-25 Oct 2013, Barcelona, Spain. pp. 205-212.''']
  
* [[Dropbox Traces|This page]] has more information about the data we used in our research so far.
+
* [[Dropbox Traces|This page]] and [[Cloud benchmarks | this page]] have more traces we used in other papers.
  
 
== External Links ==
 
== External Links ==
  
These institutes are running this research:
+
These institutes involved in this research:
* [http://www.utwente.nl/ewi/dacs/ DACS - University of Twente]
+
* [http://www.utwente.nl/ewi/dacs/ DACS - University of Twente] - Contact: Idilio Drago - idilio.drago@polito.it
* [http://www.ufjf.br/portal/ Universidade Federal de Juiz de Fora]
+
* [http://www.ufjf.br/portal/ Universidade Federal de Juiz de Fora] Contact: Alex Vieira - alex.borges@ufjf.edu.br
* [http://www.tlc-networks.polito.it/ Telecommunication Networks Group - Politecnico di Torino]
+
* [http://www.tlc-networks.polito.it/ Telecommunication Networks Group - Politecnico di Torino] - Marco Mellia - mellia@tlc.polito.it

Latest revision as of 10:44, 9 May 2014

Personal cloud storage is becoming more and more popular, with Dropbox certainly being the best known example. It generates a huge amount of Internet traffic, but how does it works? How is it used? What are the possible improvements?

In this experiment, we collected basic statistics of what files are stored in Dropbox folders.

Datasets

Download our datasets:

Name File Size Volunteers
Crawler Dataset 219M 333

Some results derived from these data can be found in here.

In particular, the figures presented in Sect. 5.3 of the linked document are obtained using the scripts available in the folder "scripts" inside the tarball.

How our data collection work?

  • It scans Dropbox folders
  • Calculates basic statistics
  • Shows what has been collected for approval
  • Sends the statistics to us

What has been logged?

For each file/folder in a Dropbox, the program collects:

* Size in bytes
* Last modification time
* Mime type of the file
* File extension
* MD5 Hash of both initial and final 8 kbytes of the file
* MD5 Hash of the file name/path

The program also sends to us:

* MD5 Hash of Dropbox configuration files (or MAC address if we cannot read the former)
* MD5 Hash of the path of your Dropbox home folder
* Your IP address and operating system version
* Error logs, in case something goes wrong during the data collection

Collected information is sent via plain HTTP to a centralized collection server.

Client source code

Download the source code by clicking here for the native versions (you will need Python 2.7 and PyInstaller for building these versions), or here for the Java version.

More information

The dataset in this page is used in the following publications:

 @phdthesis{drago_understanding_2013,
         author       = {Idilio Drago},
         title        = {Understanding and Monitoring Cloud Services},
         school       = {University of Twente},
         url          = {\url{http://eprints.eemcs.utwente.nl/24136/}},
         year         = {2013},
 },
 @inproceedings{drago_caracterizacao_2013,
         author       = {Idilio Drago and Alex Borges Vieira and Ana Paula Couto da Silva},
         title        = {Caracteriza{\c c}{\~a}o dos Arquivos Armazenados no Dropbox},
         booktitle    = {Anais do Workshop de Redes {P2P}, Din{\^a}micas, Sociais e Orientadas a Conte{\'u}do},
         series       = {{WP2P+}},
         pages        = {109--114},
         year         = {2013},
 },

More information about our previous work is found on these papers:

External Links

These institutes involved in this research: