Skip to content

Data

Data Sources

Getting the data

Data will not be published to git. In order to download the data run in in a terminal shell from the project folder the following commands:

# Download data, 10 percent only
wget -N http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz
# Download headers
wget -N http://kdd.ics.uci.edu/databases/kddcup99/kddcup.names
# Move data to folder
mv kddcup* data
# Unzip
echo n | gunzip data/kddcup.data_10_percent.gz

Dataset Description

"This is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99 The Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between bad connections, called intrusions or attacks, and good normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment."

source: KDD Cup 1999 Data

Each row in our dataset contains a connection. Each connection has a label which could be normal for rows that represent normal connections or the type of cyberattack for for rows that represent cyber attacks.

The dataset contain a total of 23 attack types, which are namely:

  • back
  • buffer_overflow
  • ftp_write
  • guess_passwd
  • imap
  • ipsweep
  • land
  • loadmodule
  • multihop
  • neptune
  • nmap
  • normal
  • perl
  • phf
  • pod
  • portsweep
  • rootkit
  • satan
  • smurf
  • spy
  • teardrop
  • warezclient
  • warezmaster

The attacks of type neptune and smurf are DDOS attacks. The others are mainly Remote to user.

Features Description

feature name description  type
duration  length (number of seconds) of the connection  continuous
protocol_type  type of the protocol, e.g. tcp, udp, etc.  discrete
service  network service on the destination, e.g., http, telnet, etc.  discrete
src_bytes  number of data bytes from source to destination  continuous
dst_bytes  number of data bytes from destination to source  continuous
flag  normal or error status of the connection  discrete 
land  1 if connection is from/to the same host/port; 0 otherwise  discrete
wrong_fragment  number of ``wrong'' fragments  continuous
urgent  number of urgent packets  continuous
 
Table 1: Basic features of individual TCP connections.

 

feature name description  type
hot  number of ``hot'' indicators continuous
num_failed_logins  number of failed login attempts  continuous
logged_in  1 if successfully logged in; 0 otherwise  discrete
num_compromised  number of ``compromised'' conditions  continuous
root_shell  1 if root shell is obtained; 0 otherwise  discrete
su_attempted  1 if ``su root'' command attempted; 0 otherwise  discrete
num_root  number of ``root'' accesses  continuous
num_file_creations  number of file creation operations  continuous
num_shells  number of shell prompts  continuous
num_access_files  number of operations on access control files  continuous
num_outbound_cmds number of outbound commands in an ftp session  continuous
is_hot_login  1 if the login belongs to the ``hot'' list; 0 otherwise  discrete
is_guest_login  1 if the login is a ``guest''login; 0 otherwise  discrete
 
Table 2: Content features within a connection suggested by domain knowledge.

 

feature name description  type
count  number of connections to the same host as the current connection in the past two seconds  continuous
Note: The following  features refer to these same-host connections.
serror_rate  % of connections that have ``SYN'' errors  continuous
rerror_rate  % of connections that have ``REJ'' errors  continuous
same_srv_rate  % of connections to the same service  continuous
diff_srv_rate  % of connections to different services  continuous
srv_count  number of connections to the same service as the current connection in the past two seconds  continuous
Note: The following features refer to these same-service connections.
srv_serror_rate  % of connections that have ``SYN'' errors  continuous
srv_rerror_rate  % of connections that have ``REJ'' errors  continuous
srv_diff_host_rate  % of connections to different hosts  continuous 
 
Table 3: Traffic features computed using a two-second time window.

source: KDD Cup 1999 Data-Task description