Difference between revisions of "Dropbox Crawler"

From SimpleWiki
Jump to navigationJump to search
Line 1: Line 1:
__NOTOC__
+
Personal cloud storage is becoming more and more popular, with Dropbox certainly being the best known example. It generates a huge amount of Internet traffic, but how does it works? How is it used? What are the possible improvements?
== We are currently analyzing the results of the experiment. The data will be available to the public soon. ==
+
 
 +
We have been doing research on the usage of Dropbox ([http://eprints.eemcs.utwente.nl/22286/01/imc140-drago.pdf see our results here]). In this experiment, we collected basic statistics of what files are stored in Dropbox folders.
 +
 
 +
== Datasets ==
  
Personal cloud storage is becoming more and more popular, with Dropbox certainly being the best known example. It generates a huge amount of Internet traffic, but how does it works? How is it used? What are the possible improvements?
+
Download our datasets from here:
  
We have been doing research on the usage of Dropbox ([http://eprints.eemcs.utwente.nl/22286/01/imc140-drago.pdf see our results here]). As a next step, we need to know what type of files people store in the service. This would allow us to understand the impact of some technologies on the system performance and on network traffic, among other things.
 
  
In this experiment we collect basic statistics (see below) of what files are stored in Dropbox folders.
 
  
== What our application do? ==
+
== How our data collection work? ==
  
* Scan Dropbox folders
+
* It scans Dropbox folders
* Calculate basic statistics
+
* Calculates basic statistics
* Show you what has been collected for approval
+
* Shows what has been collected for approval
* Send the statistics to us
+
* Sends the statistics to us
  
== What is logged? ==
+
== What has been logged? ==
  
 
For each file/folder in a Dropbox, the program collects:
 
For each file/folder in a Dropbox, the program collects:
Line 37: Line 38:
 
Collected information is sent via plain HTTP to a centralized collection server.
 
Collected information is sent via plain HTTP to a centralized collection server.
  
== How will we use this information? ==
 
  
Collected data, postprocessing scripts, and all results will be submitted to publication and made freely available in this website. Thus, anyone will be able to use our data sources for further researches.
 
  
We will, however, take extra actions to ensure that no sensitive information will be in these datasets. Note that the only information that could potentially reveal identity is the IP addresses, which we '''anonymize'''. All other statistics cannot be related to the person owning the files.
+
== What our program DIDN'T do? ==
 
 
== What this program will NOT do? ==
 
  
 
* Copy any file or folder out of computers
 
* Copy any file or folder out of computers
 
* Copy any other information than what is listed above
 
* Copy any other information than what is listed above
* Install or store anything in your computer
+
* Install or store anything  
  
 
We also release the source code of our program. Recompile it on your own -- and improve it :)
 
We also release the source code of our program. Recompile it on your own -- and improve it :)
Line 54: Line 51:
  
 
Download the source code by clicking [http://www.simpleweb.org/dropbox/source_python.zip here] for the native versions (you will need Python 2.7 and [http://www.pyinstaller.org/ PyInstaller] for building these versions), or [http://www.simpleweb.org/dropbox/source_java.zip here] for the Java version.
 
Download the source code by clicking [http://www.simpleweb.org/dropbox/source_python.zip here] for the native versions (you will need Python 2.7 and [http://www.pyinstaller.org/ PyInstaller] for building these versions), or [http://www.simpleweb.org/dropbox/source_java.zip here] for the Java version.
 +
 +
== More information ==
 +
 +
The datasets in this page are used in the following publications:
 +
 +
@phdthesis{drago_understanding_2013,
 +
        author      = {Idilio Drago},
 +
        title        = {Understanding and Monitoring Cloud Services},
 +
        school      = {University of Twente},
 +
        url          = {\url{http://eprints.eemcs.utwente.nl/24136/}},
 +
        year        = {2013},
 +
},
 +
 +
@inproceedings{drago_caracterizacao_2013,
 +
        author      = {Idilio Drago and Alex Borges Vieira and Ana Paula Couto da Silva},
 +
        title        = {Caracteriza{\c c}{\~a}o dos Arquivos Armazenados no Dropbox},
 +
        booktitle    = {Anais do Workshop de Redes {P2P}, Din{\^a}micas, Sociais e Orientadas a Conte{\'u}do},
 +
        series      = {{WP2P+}},
 +
        pages        = {109--114},
 +
        year        = {2013},
 +
},
 +
  
 
== More information ==
 
== More information ==
Line 62: Line 81:
  
 
* [[Dropbox Traces|This page]] has more information about the data we used in our research so far.
 
* [[Dropbox Traces|This page]] has more information about the data we used in our research so far.
 +
  
 
== External Links ==
 
== External Links ==
  
These institutes are running this research:
+
These institutes involved in this research:
* [http://www.utwente.nl/ewi/dacs/ DACS - University of Twente] - Contact: Idilio Drago - i.drago@utwente.nl
+
* [http://www.utwente.nl/ewi/dacs/ DACS - University of Twente] - Contact: Idilio Drago - idilio.drago@polito.it
 
* [http://www.ufjf.br/portal/ Universidade Federal de Juiz de Fora] Contact: Alex Vieira - alex.borges@ufjf.edu.br
 
* [http://www.ufjf.br/portal/ Universidade Federal de Juiz de Fora] Contact: Alex Vieira - alex.borges@ufjf.edu.br
 
* [http://www.tlc-networks.polito.it/ Telecommunication Networks Group - Politecnico di Torino] - Marco Mellia - mellia@tlc.polito.it
 
* [http://www.tlc-networks.polito.it/ Telecommunication Networks Group - Politecnico di Torino] - Marco Mellia - mellia@tlc.polito.it

Revision as of 09:14, 9 May 2014

Personal cloud storage is becoming more and more popular, with Dropbox certainly being the best known example. It generates a huge amount of Internet traffic, but how does it works? How is it used? What are the possible improvements?

We have been doing research on the usage of Dropbox (see our results here). In this experiment, we collected basic statistics of what files are stored in Dropbox folders.

Datasets

Download our datasets from here:


How our data collection work?

  • It scans Dropbox folders
  • Calculates basic statistics
  • Shows what has been collected for approval
  • Sends the statistics to us

What has been logged?

For each file/folder in a Dropbox, the program collects:

* Size in bytes
* Last modification time
* Mime type of the file
* File extension
* MD5 Hash of both initial and final 8 kbytes of the file
* MD5 Hash of the file name/path

The program also sends to us:

* MD5 Hash of Dropbox configuration files (or MAC address if we cannot read the former)
* MD5 Hash of the path of your Dropbox home folder
* Your IP address and operating system version
* Error logs, in case something goes wrong during the data collection

Collected information is sent via plain HTTP to a centralized collection server.


What our program DIDN'T do?

  • Copy any file or folder out of computers
  • Copy any other information than what is listed above
  • Install or store anything

We also release the source code of our program. Recompile it on your own -- and improve it :)

Client source code

Download the source code by clicking here for the native versions (you will need Python 2.7 and PyInstaller for building these versions), or here for the Java version.

More information

The datasets in this page are used in the following publications:

@phdthesis{drago_understanding_2013,

       author       = {Idilio Drago},
       title        = {Understanding and Monitoring Cloud Services},
       school       = {University of Twente},
       url          = {\url{http://eprints.eemcs.utwente.nl/24136/}},
       year         = {2013},

},

@inproceedings{drago_caracterizacao_2013,

       author       = {Idilio Drago and Alex Borges Vieira and Ana Paula Couto da Silva},
       title        = {Caracteriza{\c c}{\~a}o dos Arquivos Armazenados no Dropbox},
       booktitle    = {Anais do Workshop de Redes {P2P}, Din{\^a}micas, Sociais e Orientadas a Conte{\'u}do},
       series       = Template:WP2P+,
       pages        = {109--114},
       year         = {2013},

},


More information

  • More information about our work is found on this paper:

Drago, I. and Mellia, M. and Munafò, M. M. and Sperotto, A. and Sadre, R. and Pras, A. (2012) Inside Dropbox: Understanding Personal Cloud Storage Services. Proceedings of the 12th ACM Internet Measurement Conference - IMC'12, Boston, Nov. 2012

  • This page has more information about the data we used in our research so far.


External Links

These institutes involved in this research: