Labeled Dataset for Intrusion Detection

From SimpleWiki
Jump to: navigation, search

In this page, we describe the database structure for the labeled flow data set presented in the paper:

This dataset contains the flow information of traffic collected at a honeypot hosted in the network of the University of Twente in September 2008. The honeypot was directly connected to the Internet and ran several typical network services, such as ftp, ssh, etc. By design, the honeypot only captured "unexpected" traffic, hence it can be assumed that most of the recorded traffic is at least suspicious.

The dataset is provided as a gzipped SQL script, generated from a MySQL database. In order to recreate the database at your site, you simply have to create a database scheme in your database and execute the uncompressed script there.

Acceptable Use Policy

  • The user must not attempt to reverse engineer the anonymization procedure used to protect the data.
  • If noticing vulnerabilities in the anonymization procedure the user is kindly asked to inform the repository administrators.
  • In the case that you use the dataset for your research, please refer in your publications to the above mentioned paper. The bibtex entry is:
@inproceedings{sperottoIPOM2009,
        volume = {5843},
         month = {October},
        author = {A. Sperotto and R. Sadre and D. F. van Vliet and A. Pras},
        series = {Lecture Notes in Computer Science},
     booktitle = {Proceedings of the 9th IEEE International Workshop on IP Operations and Management, IPOM 2009, Venice, Italy},
         title = {A Labeled Data Set For Flow-based Intrusion Detection},
     publisher = {Springer Verlag},
         pages = {39--50},
          year = {2009}
}

More Information

Feel free to send us your error reports, comments, questions, etc.:

  • Anna Sperotto: a.sperotto@utwente.nl
  • Ramin Sadre: r.sadre@utwente.nl
  • Aiko Pras: a.pras@utwente.nl

Remarks

  • IP addresses have been anonymized. The anonymization does not preserve the ordering of the addresses. Hence, the flows of the scans may not appear sequentially in the database.
  • Depending on your database queries, you may need to add other indexes to speed them up.
  • The (encrypted) honeypot IP address is 146.217.254.148. Its numerical representation in the database is 2463760020.

Database Description

  • Table "flows":
 * Column "id": the ID of the flow (primary key, referenced in table "flow_alert")
 * Column "src_ip": anonymized source IP address (encoded as 32-bit number. Decode with inet_ineta()).
 * Column "dst_ip": anonymized destination IP address (encoded as 32-bit number. Decode with inet_ineta()).
 * Column "packets": number of packets in flow
 * Column "octets": number of bytes in flow
 * Column "start_time": UNIX start time (number of seconds)
 * Column "start_msec": start time (milliseconds part)
 * Column "end_time": UNIX end time (number of seconds)
 * Column "end_msec": end time (milliseconds part)
 * Column "src_port": source port number
 * Column "dst_port": destination port number
 * Column "tcp_flags": TCP flags obtained by ORing the TCP flags field of all packets of the flow
 * Column "prot": IP protocol number

An entry in the table "flows" represents a flow as described in section 3.1 of the paper. The definition of the columns follows netflow v5 specification.

  • Table "alert_type":
 * Column "id": the ID of the alert type (referenced in table alert)
 * Column "description": textual description of the alert type

An entry in the table "alert_type" combines the information of the "Serv" field and of the "Type" field in the alert tuple "A" (see section 3.3 of the paper). For example, "Serv=ICMP" with "Type=SIDE_EFFECT" becomes alert type number 9 with descrption "icmp_sideeffect".

  • Table "alerts":
 * Column "id": the ID of the alert (primary key, referenced in tables "flow_alert", "alert_cluster", "alert_causality")
 * Column "automated": 1 if automated, 0 if not automated, null if unknown.
 * Column "succeded": 1 if succeded, 0 if not succeded, null if unknown.
 * Column "description": textual description of the alert, as found in the log files.
 * Column "timestamp": timestamp of the alert. It can be null in case of cluster alerts.
 * Column "type": alert type (reference to table "alert_type"). Cluster alerts are of type 1, 3, or 5.

An entry in the table "alerts" represents an alert as defined in section 3.3 of the paper or a cluster alert as defined in section 3.4 of the paper.

  • Table "flow_alert":
 * Column "flowid": ID of the flow, referencing to table "flows"
 * Column "alertid": ID of the alert, referencing to table "alerts"

An entry in the table "flow_alert" assigns one or more flows to one alert. Note that some flows in table "flows" are not associated with any alerts because we regarded them as non-suspicious.

  • Table "alert_cluster":
 * Column "parent": ID of the cluster alert, referencing to table "alerts"
 * Column "child": ID of the "basic" alert, referencing to table "alerts"

"Basic" alerts in the table "alerts" (type 2, 4, or 6) can be grouped to cluster alerts (type 1, 3, or 5), as described in section 3.4 of the paper. The table "alerts" stores the basic alerts, as well as the cluster alerts. The logical connection between the basic alerts and "their" cluster alert is then represented by the entries in the table "alert_cluster".

  • Table "alert_causality":
 * Column "parent": ID of the causing alert, referencing to table "alerts"
 * Column "child": ID of the caused alert, referencing to table "alerts"

Network activity labeled as suspicious can cause other suspicious activities. For example, if an attacker logins into our honeypot via ssh and performs an ssh-scan of an outside network from the honeypot, we create the following database entries:

  - an alert for the attacker login (in table "alerts")
  - alerts for the scan connections (in table "alerts")
  - an cluster alert for the whole scan activity (in table "alerts")
  - an entry in table "alert_causality", logically connecting the alert of the login with the cluster alert of the scan.

In this example, the login alert is the "parent" and the cluster alert is the "child".

Query Examples

Get all HTTP flows to the honeypot:
  SELECT * FROM flows
    WHERE prot=6 AND dst_port=80 AND dst_ip=inet_aton('146.217.254.148')
Get all flows with alert type 9:
  SELECT f.* FROM flows f,alerts a,flow_alert fa
    WHERE f.id=fa.flowid AND a.id=fa.alertid AND a.type=9
  
Get all alerts that belong to a specific cluster alert 65616:
  SELECT a.* FROM alerts a,alert_cluster c
    WHERE c.parent=65616 AND c.child=a.id
Get all flows that belong to a specific cluster alert 65616:
  SELECT f.* FROM flows f,alerts a,alert_cluster c,flow_alert fa
    WHERE c.parent=65616 AND c.child=a.id AND fa.alertid=a.id AND fa.flowid=f.id