Difference between revisions of "Dropbox Crawler"
(48 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
− | Personal cloud storage is becoming more and more popular with Dropbox certainly the best known example. It generates a huge amount of Internet traffic, but how it works? How is it | + | Personal cloud storage is becoming more and more popular, with Dropbox certainly being the best known example. It generates a huge amount of Internet traffic, but how does it works? How is it used? What are the possible improvements? |
− | |||
− | + | In this experiment, we collected basic statistics of what files are stored in Dropbox folders. | |
− | |||
− | == | + | == Datasets == |
− | + | Download our datasets: | |
− | {| | + | {| class="wikitable" style="text-align: center; width: 400px; height: 40px;" |
− | |||
− | |||
− | |||
|- | |- | ||
− | + | ! scope="col" | Name | |
− | + | ! scope="col" | File Size | |
− | + | ! scope="col" | Volunteers | |
− | |||
− | |||
− | |||
− | |||
|- | |- | ||
+ | ! scope="row" | [http://traces.simpleweb.org/dropbox/crawler/dropbox_crawler.tar.gz Crawler Dataset] | ||
+ | | 219M || 333 | ||
|} | |} | ||
− | + | Some results derived from these data can be found in [http://eprints.eemcs.utwente.nl/24136/01/2013_drago_thesis.pdf here]. | |
− | + | In particular, the figures presented in Sect. 5.3 of the linked document are obtained using the | |
− | + | scripts available in the folder "scripts" inside the tarball. | |
− | |||
− | |||
− | + | == How our data collection work? == | |
− | + | * It scans Dropbox folders | |
− | + | * Calculates basic statistics | |
− | + | * Shows what has been collected for approval | |
− | + | * Sends the statistics to us | |
− | |||
− | |||
− | == What | + | == What has been logged? == |
− | + | For each file/folder in a Dropbox, the program collects: | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | For each file/folder in | ||
<pre> | <pre> | ||
* Size in bytes | * Size in bytes | ||
Line 58: | Line 38: | ||
* File extension | * File extension | ||
* MD5 Hash of both initial and final 8 kbytes of the file | * MD5 Hash of both initial and final 8 kbytes of the file | ||
− | * MD5 Hash of the file name | + | * MD5 Hash of the file name/path |
</pre> | </pre> | ||
− | The program | + | The program also sends to us: |
<pre> | <pre> | ||
* MD5 Hash of Dropbox configuration files (or MAC address if we cannot read the former) | * MD5 Hash of Dropbox configuration files (or MAC address if we cannot read the former) | ||
Line 69: | Line 49: | ||
</pre> | </pre> | ||
− | Collected information is sent via plain HTTP | + | Collected information is sent via plain HTTP to a centralized collection server. |
− | == | + | == Client source code == |
− | + | Download the source code by clicking [http://www.simpleweb.org/dropbox/source_python.zip here] for the native versions (you will need Python 2.7 and [http://www.pyinstaller.org/ PyInstaller] for building these versions), or [http://www.simpleweb.org/dropbox/source_java.zip here] for the Java version. | |
− | |||
− | + | == More information == | |
− | |||
− | == | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | The dataset in this page is used in the following publications: | |
− | == | + | @phdthesis{drago_understanding_2013, |
+ | author = {Idilio Drago}, | ||
+ | title = {Understanding and Monitoring Cloud Services}, | ||
+ | school = {University of Twente}, | ||
+ | url = {<nowiki>\url{http://eprints.eemcs.utwente.nl/24136/</nowiki>}}, | ||
+ | year = {2013}, | ||
+ | }, | ||
− | + | @inproceedings{drago_caracterizacao_2013, | |
+ | author = {Idilio Drago and Alex Borges Vieira and Ana Paula Couto da Silva}, | ||
+ | title = {Caracteriza{\c c}{\~a}o dos Arquivos Armazenados no Dropbox}, | ||
+ | booktitle = {Anais do Workshop de Redes {P2P}, Din{\^a}micas, Sociais e Orientadas a Conte{\'u}do}, | ||
+ | series = <nowiki>{{WP2P+}}</nowiki>, | ||
+ | pages = {109--114}, | ||
+ | year = {2013}, | ||
+ | }, | ||
− | + | More information about our previous work is found on these papers: | |
− | * | + | * [http://eprints.eemcs.utwente.nl/22286/01/imc140-drago.pdf '''Drago, I. and Mellia, M. and Munafò, M. M. and Sperotto, A. and Sadre, R. and Pras, A. (2012) Inside Dropbox: Understanding Personal Cloud Storage Services. Proceedings of the 12th ACM Internet Measurement Conference - IMC'12, Boston, Nov. 2012'''] |
− | [http://eprints.eemcs.utwente.nl/ | + | * [http://eprints.eemcs.utwente.nl/23674/01/cloud_storage.pdf '''Drago, I. and Bocchi, E. and Mellia, M. and Slatman, H. and Pras, A. (2013) Benchmarking personal cloud storage. In: Proceedings of the 13th ACM Internet Measurement Conference, IMC 2013, 23-25 Oct 2013, Barcelona, Spain. pp. 205-212.'''] |
− | * [[Dropbox Traces|This page]] | + | * [[Dropbox Traces|This page]] and [[Cloud benchmarks | this page]] have more traces we used in other papers. |
== External Links == | == External Links == | ||
− | These institutes | + | These institutes involved in this research: |
− | * [http://www.utwente.nl/ewi/dacs/ DACS - University of Twente] | + | * [http://www.utwente.nl/ewi/dacs/ DACS - University of Twente] - Contact: Idilio Drago - idilio.drago@polito.it |
− | * [http://www.ufjf.br/portal/ Universidade Federal de Juiz de Fora] | + | * [http://www.ufjf.br/portal/ Universidade Federal de Juiz de Fora] Contact: Alex Vieira - alex.borges@ufjf.edu.br |
− | * [http://www.tlc-networks.polito.it/ Telecommunication Networks Group - Politecnico di Torino] | + | * [http://www.tlc-networks.polito.it/ Telecommunication Networks Group - Politecnico di Torino] - Marco Mellia - mellia@tlc.polito.it |
Latest revision as of 09:44, 9 May 2014
Personal cloud storage is becoming more and more popular, with Dropbox certainly being the best known example. It generates a huge amount of Internet traffic, but how does it works? How is it used? What are the possible improvements?
In this experiment, we collected basic statistics of what files are stored in Dropbox folders.
Contents
Datasets
Download our datasets:
Name | File Size | Volunteers |
---|---|---|
Crawler Dataset | 219M | 333 |
Some results derived from these data can be found in here.
In particular, the figures presented in Sect. 5.3 of the linked document are obtained using the scripts available in the folder "scripts" inside the tarball.
How our data collection work?
- It scans Dropbox folders
- Calculates basic statistics
- Shows what has been collected for approval
- Sends the statistics to us
What has been logged?
For each file/folder in a Dropbox, the program collects:
* Size in bytes * Last modification time * Mime type of the file * File extension * MD5 Hash of both initial and final 8 kbytes of the file * MD5 Hash of the file name/path
The program also sends to us:
* MD5 Hash of Dropbox configuration files (or MAC address if we cannot read the former) * MD5 Hash of the path of your Dropbox home folder * Your IP address and operating system version * Error logs, in case something goes wrong during the data collection
Collected information is sent via plain HTTP to a centralized collection server.
Client source code
Download the source code by clicking here for the native versions (you will need Python 2.7 and PyInstaller for building these versions), or here for the Java version.
More information
The dataset in this page is used in the following publications:
@phdthesis{drago_understanding_2013, author = {Idilio Drago}, title = {Understanding and Monitoring Cloud Services}, school = {University of Twente}, url = {\url{http://eprints.eemcs.utwente.nl/24136/}}, year = {2013}, },
@inproceedings{drago_caracterizacao_2013, author = {Idilio Drago and Alex Borges Vieira and Ana Paula Couto da Silva}, title = {Caracteriza{\c c}{\~a}o dos Arquivos Armazenados no Dropbox}, booktitle = {Anais do Workshop de Redes {P2P}, Din{\^a}micas, Sociais e Orientadas a Conte{\'u}do}, series = {{WP2P+}}, pages = {109--114}, year = {2013}, },
More information about our previous work is found on these papers:
External Links
These institutes involved in this research:
- DACS - University of Twente - Contact: Idilio Drago - idilio.drago@polito.it
- Universidade Federal de Juiz de Fora Contact: Alex Vieira - alex.borges@ufjf.edu.br
- Telecommunication Networks Group - Politecnico di Torino - Marco Mellia - mellia@tlc.polito.it