Difference between revisions of "SSH datasets"

From SimpleWiki
Jump to navigationJump to search
(Added copyright note)
 
(28 intermediate revisions by the same user not shown)
Line 2: Line 2:
  
 
'''SSH Compromise Detection using NetFlow/IPFIX'''<br />
 
'''SSH Compromise Detection using NetFlow/IPFIX'''<br />
Rick Hofstede, Luuk Hendriks, Anna Sperotto, Aiko Pras. In: ACM Computer Communication Review, 2014.
+
Rick Hofstede, Luuk Hendriks, Anna Sperotto, Aiko Pras. In: ''ACM SIGCOMM Computer Communication Review'', Vol. 44, No. 5, 2014, ISSN 0146-4833, pp. 20-26.
  
More information regarding this publication can be found [http://www.rickhofstede.nl here]. ''Any usage of materials provided on this page - be it the datasets, scripts, paper itself, or any content derived thereof - should reference this publication.''
+
@article{hofstede2014_ccr,
 +
  author        = {Hofstede, Rick and Hendriks, Luuk and Sperotto, Anna and Pras, Aiko},
 +
  title        = {SSH Compromise Detection using NetFlow/IPFIX},
 +
  journal      = {ACM SIGCOMM Computer Communication Review},
 +
  volume        = {44},
 +
  number        = {5},
 +
  pages        = {20--26},
 +
  year          = {2014},
 +
}
 +
 
 +
More information regarding this publication can be found [http://dx.doi.org/10.1145/2677046.2677050 here]. ''Any usage of materials provided on this page - be it the datasets, scripts, paper itself, or any content derived thereof - should reference this publication.''
  
 
== Datasets ==
 
== Datasets ==
  
{| class="wikitable" style="text-align: center; width: 400px; height: 40px;"
+
Two datasets have been used in the paper presented above, both consisting of one month of flow data and log files. They have been collected on the campus network of the [http://www.utwente.nl University of Twente], The Netherlands.
 +
 
 +
{| class="wikitable" style="text-align: center; width:700px; height: 40px;"
 
|-
 
|-
! scope="col" | Name
+
! scope="col" | Dataset
 +
! scope="col" | Type
 +
! scope="col" | File name
 
! scope="col" | File Size
 
! scope="col" | File Size
! scope="col" | CRC
+
! scope="col" | md5
 +
|-
 +
|rowspan="2" | 1
 +
| Flow data
 +
| [http://traces.simpleweb.org/ssh_datasets/dataset1_flow_data.tgz dataset1_flow_data.tgz]
 +
| 1.1 GB
 +
| <code>7a52c41bb8d742a01ca9e8374a49bb3b</code>
 
|-
 
|-
<!-- ! scope="row" | [http://traces.simpleweb.org/dropbox/crawler/dropbox_crawler.tar.gz Crawler Dataset] -->
+
| Log files
! scope="row" | Flow data
+
| [http://traces.simpleweb.org/ssh_datasets/dataset1_log_files.tgz dataset1_log_files.tgz]
| x GB || xxx
+
| 12 MB
 +
| <code>ab5f0a593aa69c45efb025fc34ec00fc</code>
 
|-
 
|-
! scope="row" | Log files  
+
|rowspan="2" | 2
| x GB || xxx
+
| Flow data
 +
| [http://traces.simpleweb.org/ssh_datasets/dataset2_flow_data.tgz dataset2_flow_data.tgz]
 +
| 1.3 GB
 +
| <code>a125ba2253839ec3f12428e235730ed3</code>
 +
|-
 +
| Log files
 +
| [http://traces.simpleweb.org/ssh_datasets/dataset2_log_files.tgz dataset2_log_files.tgz]
 +
| 101 MB
 +
| <code>548e1c9ff4ccb423a0f5bb199898ed44</code>
 
|}
 
|}
  
Some results derived from these data can be found in [http://www.rickhofstede.nl/publications here].
+
Some results derived from these datasets, as well as an extensive description of the types of hosts of which the data comprises these datasets, can be found in [http://dl.acm.org/citation.cfm?id=J101 here].
  
 
=== Flow data ===
 
=== Flow data ===
  
The flow data has been exported by a Cisco Catalyst 6500 with SUP2T supervisor module (PFC4, MSFC 5), and collected using [http://nfdump.sf.net nfcapd]. Neither packet sampling nor flow sampling have been applied. The following post-processing operations have however been performed:
+
The flow data has been exported by two types of flow exporters:
 +
 
 +
# '''router-b''', '''router-c''': Cisco Catalyst 6500 with SUP720 supervisor module (PFC3, MSFC3, EARL7)
 +
# '''router-s''', '''router-t''': Cisco Catalyst 6500 with SUP2T supervisor module (PFC4, MSFC5, EARL8)
 +
 
 +
The data has been collected using [http://nfdump.sf.net nfcapd]. Neither packet sampling nor flow sampling have been applied. The following post-processing operations have however been performed:
  
 
# '''Filtering''': Only SSH data has been selected, i.e., the following [http://nfdump.sf.net nfdump] filter has been used: ''port 22 and proto tcp''.
 
# '''Filtering''': Only SSH data has been selected, i.e., the following [http://nfdump.sf.net nfdump] filter has been used: ''port 22 and proto tcp''.
 
# '''Anonymization''': [http://nfdump.sf.net nfanon] has been used for anonymizing the flow data in a [http://www.caida.org/projects/predict/anonymization/ prefix-preserving] manner. More precisely, nfanon relies on the [http://www.cc.gatech.edu/computing/Telecomm/projects/cryptopan/ CryptoPAn] (Cryptography-based Prefix-preserving Anonymization) module.
 
# '''Anonymization''': [http://nfdump.sf.net nfanon] has been used for anonymizing the flow data in a [http://www.caida.org/projects/predict/anonymization/ prefix-preserving] manner. More precisely, nfanon relies on the [http://www.cc.gatech.edu/computing/Telecomm/projects/cryptopan/ CryptoPAn] (Cryptography-based Prefix-preserving Anonymization) module.
 +
 +
In case you'd like to learn more about flow monitoring in general or any aspect required for doing sound flow measurements, we refer to [http://dx.doi.org/10.1109/COMST.2014.2321898 this tutorial].
  
 
=== Log files ===
 
=== Log files ===
Line 38: Line 74:
 
# '''Renaming''': The file names have been changed from ''<hostname>.<extension>'' into <anonymized_IP_address>.<extension>. As such, the log files can easily be correlated with the flow data.
 
# '''Renaming''': The file names have been changed from ''<hostname>.<extension>'' into <anonymized_IP_address>.<extension>. As such, the log files can easily be correlated with the flow data.
 
# '''Anonymization''': We have replaced any usernames by "XXXXX" and hostnames by the anonymized IP address of the considered host.
 
# '''Anonymization''': We have replaced any usernames by "XXXXX" and hostnames by the anonymized IP address of the considered host.
 +
 +
=== Caveats ===
 +
 +
* Note that there are activities over SSH from several (anonymized) IP address ranges that can principally be found in the log files only (so not in the flow data). These are internal IP address ranges for which the traffic is not considered for flow export.
 +
** 161.166.0.0/16
 +
** 178.135.128.0/18
 +
** 195.212.40.0/24
 +
** 195.212.41.0/24
 +
* Several hosts of which the authentication logs have been included in the dataset store host names in log files instead of IP addresses. In the anonymization process, we have resolved these hostnames and anonymized the resulting IP addresses. As such, it ''may'' be possible that the IP address of such hosts has changed between the moment of generating the datasets and anonymizing them.
 +
* The log files have been exported by several daemons, consisting mostly of Kippo and OpenSSH (Debian) in Dataset 1, and another mixture of daemons in Dataset 2. Whether log files have been exported by Kippo or another daemon, can be concluded from the file names of the log files.
  
 
== Scripts ==
 
== Scripts ==
 +
 +
The following scripts have been used for processing the datasets:
 +
 +
* '''[http://traces.simpleweb.org/ssh_datasets/anonymizer.pl anonymizer.pl]''': Anonymizes log files and flow data (in [http://nfdump.sf.net nfdump] binary format) using [http://www.cc.gatech.edu/computing/Telecomm/projects/cryptopan/ CryptoPAn]. It uses the same key for anonymizing both the log files and the flow data, to allow for matching the log files with the flow data. More precisely, it performs the following actions:
 +
** Copy the full directory structure of log files and flow data, after anonymizing directory names consisting of host names. Directory names consisting of hostnames are resolved and replaced by the anonymized IP address corresponding to the resolved host name.
 +
** All hostnames in the log files are resolved and replaced by the anonymized IP address corresponding to the resolved host name.
 +
** Usernames of hosts running Kippo in Dataset 1 are retained, to allow for future analysis on the usage of login credentials on honeypots, for example. Usernames of logs exported by other daemons in Dataset 1 and usernames in Dataset 2 are replaced by 'XXXXX'.
 +
** All IP address in the log files are replaced by their anonymized counterparts.
 +
 +
== Contact ==
 +
 +
In case of any questions or comments, please contact the authors of the paper mentioned above at r.j.hofstede [at] utwente.nl.

Latest revision as of 20:40, 23 October 2014

This page provides access to the materials accompanying the following publication:

SSH Compromise Detection using NetFlow/IPFIX
Rick Hofstede, Luuk Hendriks, Anna Sperotto, Aiko Pras. In: ACM SIGCOMM Computer Communication Review, Vol. 44, No. 5, 2014, ISSN 0146-4833, pp. 20-26.

@article{hofstede2014_ccr,
  author        = {Hofstede, Rick and Hendriks, Luuk and Sperotto, Anna and Pras, Aiko},
  title         = {SSH Compromise Detection using NetFlow/IPFIX},
  journal       = {ACM SIGCOMM Computer Communication Review},
  volume        = {44},
  number        = {5},
  pages         = {20--26},
  year          = {2014},
}

More information regarding this publication can be found here. Any usage of materials provided on this page - be it the datasets, scripts, paper itself, or any content derived thereof - should reference this publication.

Datasets

Two datasets have been used in the paper presented above, both consisting of one month of flow data and log files. They have been collected on the campus network of the University of Twente, The Netherlands.

Dataset Type File name File Size md5
1 Flow data dataset1_flow_data.tgz 1.1 GB 7a52c41bb8d742a01ca9e8374a49bb3b
Log files dataset1_log_files.tgz 12 MB ab5f0a593aa69c45efb025fc34ec00fc
2 Flow data dataset2_flow_data.tgz 1.3 GB a125ba2253839ec3f12428e235730ed3
Log files dataset2_log_files.tgz 101 MB 548e1c9ff4ccb423a0f5bb199898ed44

Some results derived from these datasets, as well as an extensive description of the types of hosts of which the data comprises these datasets, can be found in here.

Flow data

The flow data has been exported by two types of flow exporters:

  1. router-b, router-c: Cisco Catalyst 6500 with SUP720 supervisor module (PFC3, MSFC3, EARL7)
  2. router-s, router-t: Cisco Catalyst 6500 with SUP2T supervisor module (PFC4, MSFC5, EARL8)

The data has been collected using nfcapd. Neither packet sampling nor flow sampling have been applied. The following post-processing operations have however been performed:

  1. Filtering: Only SSH data has been selected, i.e., the following nfdump filter has been used: port 22 and proto tcp.
  2. Anonymization: nfanon has been used for anonymizing the flow data in a prefix-preserving manner. More precisely, nfanon relies on the CryptoPAn (Cryptography-based Prefix-preserving Anonymization) module.

In case you'd like to learn more about flow monitoring in general or any aspect required for doing sound flow measurements, we refer to this tutorial.

Log files

The log files have been gathered from various Linux operating systems. The following post-processing operations have however been performed:

  1. Merging: On some machines, the authentication logs were distributed over <hostname>.messages and <hostname>.warn. We have merged those log files, sorted them again (if necessary), and removed any introduced duplicates.
  2. Renaming: The file names have been changed from <hostname>.<extension> into <anonymized_IP_address>.<extension>. As such, the log files can easily be correlated with the flow data.
  3. Anonymization: We have replaced any usernames by "XXXXX" and hostnames by the anonymized IP address of the considered host.

Caveats

  • Note that there are activities over SSH from several (anonymized) IP address ranges that can principally be found in the log files only (so not in the flow data). These are internal IP address ranges for which the traffic is not considered for flow export.
    • 161.166.0.0/16
    • 178.135.128.0/18
    • 195.212.40.0/24
    • 195.212.41.0/24
  • Several hosts of which the authentication logs have been included in the dataset store host names in log files instead of IP addresses. In the anonymization process, we have resolved these hostnames and anonymized the resulting IP addresses. As such, it may be possible that the IP address of such hosts has changed between the moment of generating the datasets and anonymizing them.
  • The log files have been exported by several daemons, consisting mostly of Kippo and OpenSSH (Debian) in Dataset 1, and another mixture of daemons in Dataset 2. Whether log files have been exported by Kippo or another daemon, can be concluded from the file names of the log files.

Scripts

The following scripts have been used for processing the datasets:

  • anonymizer.pl: Anonymizes log files and flow data (in nfdump binary format) using CryptoPAn. It uses the same key for anonymizing both the log files and the flow data, to allow for matching the log files with the flow data. More precisely, it performs the following actions:
    • Copy the full directory structure of log files and flow data, after anonymizing directory names consisting of host names. Directory names consisting of hostnames are resolved and replaced by the anonymized IP address corresponding to the resolved host name.
    • All hostnames in the log files are resolved and replaced by the anonymized IP address corresponding to the resolved host name.
    • Usernames of hosts running Kippo in Dataset 1 are retained, to allow for future analysis on the usage of login credentials on honeypots, for example. Usernames of logs exported by other daemons in Dataset 1 and usernames in Dataset 2 are replaced by 'XXXXX'.
    • All IP address in the log files are replaced by their anonymized counterparts.

Contact

In case of any questions or comments, please contact the authors of the paper mentioned above at r.j.hofstede [at] utwente.nl.