Difference between revisions of "Dropbox Crawler"

From SimpleWiki
Jump to navigationJump to search
Line 1: Line 1:
We are crawling DropBox information to help our research.
+
== Introduction ==
  
It is very important to us to know  the DropBox user file pattern. For example, how big  files are and  which kind of file users store on DropBox.
+
We are crawling DropBox information to help our research. It is very important to us to know  the DropBox user file pattern. For example, how big  files are and  which kind of file users store on DropBox.
  
To Run our crawler, you may try do load it directly from our page, clicking here:
 
  
 +
== Help our research ==
  
or You may download the Jar package and run it (double click on most OS or java -jar HelpOurResearch.jar)
+
To Run our crawler, you may try do load it directly from our page, clicking here. Or you may download the Jar package and run it (double click on most OS or java -jar HelpOurResearch.jar)
  
 
Lauch Java Application
 
Lauch Java Application
  
  
 
+
== What will be captured? ==
We ensure that:
 
 
 
All data we collect are anonymized.
 
We do not copy any file content.
 
We do not collect any personal information and file/dir names.
 
 
 
 
 
We also will make our data publicity in a near future. Thus, anyone will be able to use this important data source.
 
  
 
What we do:
 
What we do:
Line 36: Line 28:
  
  
 +
== Client source code ==
  
Traces
 
  
----
+
Download the Java Source Code to Capture Files Information
 +
The Project may be used direct in NetBeans, version 7.2.1
  
As soon as possible, we will make our logs public.
 
  
These datasets were captured from Jan. 3, 2013 to (not yet defined).
+
== Policy ==
  
Acceptable Use Policy (to use our logs in future)
 
  
The user must not attempt to reverse engineer the anonymization procedure used to protect the data.
+
We ensure that:
  
If noticing vulnerabilities in the anonymization procedure the user is kindly asked to inform the repository administrators.
+
All data we collect are anonymized.
 +
We do not copy any file content.
 +
We do not collect any personal information and file/dir names.
  
When writing a paper using this data, please cite:
 
  
@inproceedings{
+
We also will make our data publicity in a near future. Thus, anyone will be able to use this important data source.
 
 
}
 
  
 
+
== Format ==
Format
 
 
 
----
 
  
 
All files are in a simple format. Each line has files attributes, separeted by #.
 
All files are in a simple format. Each line has files attributes, separeted by #.
Line 66: Line 53:
 
The following columns are found in these traces:
 
The following columns are found in these traces:
  
<nowiki>############################################################################
+
<pre>
 +
############################################################################
 
#    #    # Short description      # Unit  # Long description            #
 
#    #    # Short description      # Unit  # Long description            #
 
############################################################################
 
############################################################################
Line 76: Line 64:
 
#  6  #    # MD5 of the name        # -    # MD5 hash code of file name string.
 
#  6  #    # MD5 of the name        # -    # MD5 hash code of file name string.
 
############################################################################
 
############################################################################
 +
</pre>
  
</nowiki>
 
  
Crawler Source Code (java)
 
  
----
+
== More information ==
  
Download the Java Source Code to Capture Files Information
 
The Project may be used direct in NetBeans, version 7.2.1
 
 
 
Previous Work
 
 
----
 
  
You may find DropBox information on our previous work
+
You may find Dropbox information on our previous work
  
 
Drago, I. and Mellia, M. and Munafò, M. M. and Sperotto, A. and Sadre, R. and Pras, A. (2012) Inside Dropbox: Understanding Personal Cloud Storage Services. Proceedings of the 12th ACM Internet Measurement Conference - IMC'12, Boston, Nov. 2012
 
Drago, I. and Mellia, M. and Munafò, M. M. and Sperotto, A. and Sadre, R. and Pras, A. (2012) Inside Dropbox: Understanding Personal Cloud Storage Services. Proceedings of the 12th ACM Internet Measurement Conference - IMC'12, Boston, Nov. 2012
Line 100: Line 80:
  
  
External Link
+
== External Links ==
 
 
----
 
 
 
Conference Website
 

Revision as of 21:50, 6 January 2013

Introduction

We are crawling DropBox information to help our research. It is very important to us to know the DropBox user file pattern. For example, how big files are and which kind of file users store on DropBox.


Help our research

To Run our crawler, you may try do load it directly from our page, clicking here. Or you may download the Jar package and run it (double click on most OS or java -jar HelpOurResearch.jar)

Lauch Java Application


What will be captured?

What we do:

We will read all your DropBox Folder; We will collect basic statistics (log format can be viewed in the following); We will send these statistics to our web server.


What we DO NOT do:

We do not copy any file content; We do not copy file or folder name; We do not copy any personal information; We do not install or store anything in your computer.


Client source code

Download the Java Source Code to Capture Files Information The Project may be used direct in NetBeans, version 7.2.1


Policy

We ensure that:

All data we collect are anonymized. We do not copy any file content. We do not collect any personal information and file/dir names.


We also will make our data publicity in a near future. Thus, anyone will be able to use this important data source.

Format

All files are in a simple format. Each line has files attributes, separeted by #.

The following columns are found in these traces:

############################################################################
#     #     # Short description      # Unit  # Long description            #
############################################################################
#  1  #     # Lenght                 # -     # File Size in Bytes
#  2  #     # Modified               # -     # Last modification on file (Unix date/time format)
#  3  #     # MIME                   # -     # File Mime Type using Magic Java Unit
#  4  #     # EXTENSION              # -     # File extension (substring after the last "." on the string)
#  5  #     # MD5                    # -     # MD5 hash code of the initial/final 8 bytes of the file.
#  6  #     # MD5 of the name        # -     # MD5 hash code of file name string.
############################################################################


More information

You may find Dropbox information on our previous work

Drago, I. and Mellia, M. and Munafò, M. M. and Sperotto, A. and Sadre, R. and Pras, A. (2012) Inside Dropbox: Understanding Personal Cloud Storage Services. Proceedings of the 12th ACM Internet Measurement Conference - IMC'12, Boston, Nov. 2012

As described in the paper, the data was captured at 4 vantage points in 2 European countries. The first 4 files were collected from March 24, 2012 to May 5, 2012. A second dataset was collected in Campus 1 in June and July 2012 to complement the analysis.

The data was captured using Tstat: An open source monitoring tool developed at Politecnico di Torino. Tstat exports flow data containing more than 100 metrics. The source code of Tstat can be obtained from here. More information about the DN-Hunter version of Tstat, needed for some experiments, can be found here. Note that all IP addresses in the datasets are anonymized


External Links