SSH datasets

From SimpleWiki
Revision as of 07:59, 5 August 2014 by Hofstede (talk | contribs)
Jump to navigationJump to search

This page provides access to the materials accompanying the following publication:

SSH Compromise Detection using NetFlow/IPFIX
Rick Hofstede, Luuk Hendriks, Anna Sperotto, Aiko Pras. In: ACM Computer Communication Review, 2014 (to appear).

More information regarding this publication can be found here. Any usage of materials provided on this page - be it the datasets, scripts, paper itself, or any content derived thereof - should reference this publication.

Datasets

Name File Size md5
Flow data x GB xxx
Log files x GB xxx

Some results derived from these data can be found in here.

Flow data

The flow data has been exported by a Cisco Catalyst 6500 with SUP2T supervisor module (PFC4, MSFC 5), and collected using nfcapd. Neither packet sampling nor flow sampling have been applied. The following post-processing operations have however been performed:

  1. Filtering: Only SSH data has been selected, i.e., the following nfdump filter has been used: port 22 and proto tcp.
  2. Anonymization: nfanon has been used for anonymizing the flow data in a prefix-preserving manner. More precisely, nfanon relies on the CryptoPAn (Cryptography-based Prefix-preserving Anonymization) module.

Log files

The log files have been gathered from various Linux operating systems. The following post-processing operations have however been performed:

  1. Merging: On some machines, the authentication logs were distributed over <hostname>.messages and <hostname>.warn. We have merged those log files, sorted them again (if necessary), and removed any introduced duplicates.
  2. Renaming: The file names have been changed from <hostname>.<extension> into <anonymized_IP_address>.<extension>. As such, the log files can easily be correlated with the flow data.
  3. Anonymization: We have replaced any usernames by "XXXXX" and hostnames by the anonymized IP address of the considered host.

Scripts