Exploratory Data Analysis

In this notebook we aim at inspecting the available data and at performing an univariate analysis of the features distribution.

In [4]:

            
                Copied!
                
%load_ext autoreload
%load_ext autoreload

In [5]:

            
                Copied!
                
                    
                    
                
                

        
%autoreload
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from intrusion_detection.load_input_data import load_df
from intrusion_detection.preprocessing.preprocessing import remove_dot_from_attack_type_classes
%autoreload
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from intrusion_detection.load_input_data import load_df
from intrusion_detection.preprocessing.preprocessing import remove_dot_from_attack_type_classes

In [6]:

            
                Copied!
                
from functools import partial
from IPython.display import HTML, display, Markdown

def header(text, level):
    display(HTML(f"<h{level}>{text}</h{level}>"))

for level in range(1, 6):
    globals()[f"h{level}"] = partial(header, level=level)
from functools import partial
from IPython.display import HTML, display, Markdown

def header(text, level):
    display(HTML(f"{text}"))

for level in range(1, 6):
    globals()[f"h{level}"] = partial(header, level=level)

In [7]:

            
                Copied!
                
df = load_df(
    file_path="../../../data/kddcup.data_10_percent",
    header_file="../../../data/kddcup.names"
)
df = remove_dot_from_attack_type_classes(df)
df = load_df(
    file_path="../../../data/kddcup.data_10_percent",
    header_file="../../../data/kddcup.names"
)
df = remove_dot_from_attack_type_classes(df)

In [8]:

            
                Copied!
                
df.head()
df.head()

Out[8]:

	protocol_type	service	flag	src_bytes	dst_bytes	...	dst_host_srv_count	dst_host_same_srv_rate	dst_host_same_src_port_rate	attack_type
0	tcp	http	SF	181	5450	...	9	1.0	0.11	normal
1	tcp	http	SF	239	486	...	19	1.0	0.05	normal
2	tcp	http	SF	235	1337	...	29	1.0	0.03	normal
3	tcp	http	SF	219	1337	...	39	1.0	0.03	normal
4	tcp	http	SF	217	2032	...	49	1.0	0.02	normal

5 rows × 42 columns

In [5]:

            
                Copied!
                
target_columns = ["attack_type"]
feature_columns = [col for col in df.columns if not col in target_columns]
feature_columns_dtypes = df[feature_columns].dtypes
target_columns = ["attack_type"]
feature_columns = [col for col in df.columns if not col in target_columns]
feature_columns_dtypes = df[feature_columns].dtypes

NaNs¶

In [8]:

            
                Copied!
                
null_values = df.isnull().sum()
null_values[null_values>0]
null_values = df.isnull().sum()
null_values[null_values>0]

Out[8]:

Series([], dtype: int64)

The dataset does not contain NaN values.

Categorical Features¶

List of categorical features:

In [9]:

            
                Copied!
                
categorical_features = feature_columns_dtypes[
    feature_columns_dtypes == object
].index.to_list()

nl = "\n- "
display(Markdown(f"The categorical Features in our dataset are: {nl + nl.join(categorical_features)}"))
categorical_features = feature_columns_dtypes[
    feature_columns_dtypes == object
].index.to_list()

nl = "\n- "
display(Markdown(f"The categorical Features in our dataset are: {nl + nl.join(categorical_features)}"))

The categorical Features in our dataset are:

protocol_type
service
flag

Countplots of each categorical features¶

In [10]:

            
                Copied!
                
for feat in categorical_features:
    h5(feat)
    ax = sns.countplot(data=df, y=feat)
    ax.bar_label(ax.containers[0])
    plt.show()
for feat in categorical_features:
    h5(feat)
    ax = sns.countplot(data=df, y=feat)
    ax.bar_label(ax.containers[0])
    plt.show()

protocol_type

service

flag

Considerations¶

From the plots above we notice that for the first categorical feature protocol_type, each of the 3 categories is well represented in our dataset. This suggests that we can encode protocol_type using one hot encoding. For the other two categorical features, service and flag, some categories appear very rarely, in less than 1% of the cases sometimes. For service and flag we can group rare categories into a single bucket and then use one hot encoding. Moreover, we can also add the frequency encodings for the features service and flag. To avoid loss of information.

Numeric features¶

In [11]:

            
                Copied!
                
numerical_features = feature_columns_dtypes[
    feature_columns_dtypes != object
].index.to_list()

nl = "\n- "
display(Markdown(f"The numerical Features in our dataset are: {nl + nl.join(numerical_features)}"))
numerical_features = feature_columns_dtypes[
    feature_columns_dtypes != object
].index.to_list()

nl = "\n- "
display(Markdown(f"The numerical Features in our dataset are: {nl + nl.join(numerical_features)}"))

The numerical Features in our dataset are:

duration
src_bytes
dst_bytes
land
wrong_fragment
urgent
hot
num_failed_logins
logged_in
num_compromised
root_shell
su_attempted
num_root
num_file_creations
num_shells
num_access_files
num_outbound_cmds
is_host_login
is_guest_login
count
srv_count
serror_rate
srv_serror_rate
rerror_rate
srv_rerror_rate
same_srv_rate
diff_srv_rate
srv_diff_host_rate
dst_host_count
dst_host_srv_count
dst_host_same_srv_rate
dst_host_diff_srv_rate
dst_host_same_src_port_rate
dst_host_srv_diff_host_rate
dst_host_serror_rate
dst_host_srv_serror_rate
dst_host_rerror_rate
dst_host_srv_rerror_rate

In [12]:

            
                Copied!
                
df.describe(percentiles=[0.01, 0.05, .25, .5, .75, 0.95, 0.99]).T
df.describe(percentiles=[0.01, 0.05, .25, .5, .75, 0.95, 0.99]).T

Out[12]:

	count	mean	std	1%	5%	25%	50%	75%	95%	99%	max
duration	494021.0	47.979302	707.746472	0.00	0.00	0.00	0.0	0.00	0.00	88.000	58329.0
src_bytes	494021.0	3025.610296	988218.101050	0.00	0.00	45.00	520.0	1032.00	1032.00	2394.800	693375640.0
dst_bytes	494021.0	868.532425	33040.001252	0.00	0.00	0.00	0.0	0.00	2417.00	12260.000	5155468.0
land	494021.0	0.000045	0.006673	0.00	0.00	0.00	0.0	0.00	0.00	0.000	1.0
wrong_fragment	494021.0	0.006433	0.134805	0.00	0.00	0.00	0.0	0.00	0.00	0.000	3.0
urgent	494021.0	0.000014	0.005510	0.00	0.00	0.00	0.0	0.00	0.00	0.000	3.0
hot	494021.0	0.034519	0.782103	0.00	0.00	0.00	0.0	0.00	0.00	0.000	30.0
num_failed_logins	494021.0	0.000152	0.015520	0.00	0.00	0.00	0.0	0.00	0.00	0.000	5.0
logged_in	494021.0	0.148247	0.355345	0.00	0.00	0.00	0.0	0.00	1.00	1.000	1.0
num_compromised	494021.0	0.010212	1.798326	0.00	0.00	0.00	0.0	0.00	0.00	0.000	884.0
root_shell	494021.0	0.000111	0.010551	0.00	0.00	0.00	0.0	0.00	0.00	0.000	1.0
su_attempted	494021.0	0.000036	0.007793	0.00	0.00	0.00	0.0	0.00	0.00	0.000	2.0
num_root	494021.0	0.011352	2.012718	0.00	0.00	0.00	0.0	0.00	0.00	0.000	993.0
num_file_creations	494021.0	0.001083	0.096416	0.00	0.00	0.00	0.0	0.00	0.00	0.000	28.0
num_shells	494021.0	0.000109	0.011020	0.00	0.00	0.00	0.0	0.00	0.00	0.000	2.0
num_access_files	494021.0	0.001008	0.036482	0.00	0.00	0.00	0.0	0.00	0.00	0.000	8.0
num_outbound_cmds	494021.0	0.000000	0.000000	0.00	0.00	0.00	0.0	0.00	0.00	0.000	0.0
is_host_login	494021.0	0.000000	0.000000	0.00	0.00	0.00	0.0	0.00	0.00	0.000	0.0
is_guest_login	494021.0	0.001387	0.037211	0.00	0.00	0.00	0.0	0.00	0.00	0.000	1.0
count	494021.0	332.285690	213.147412	1.00	1.00	117.00	510.0	511.00	511.00	511.000	511.0
srv_count	494021.0	292.906557	246.322817	1.00	1.00	10.00	510.0	511.00	511.00	511.000	511.0
serror_rate	494021.0	0.176687	0.380717	0.00	0.00	0.00	0.0	0.00	1.00	1.000	1.0
srv_serror_rate	494021.0	0.176609	0.381017	0.00	0.00	0.00	0.0	0.00	1.00	1.000	1.0
rerror_rate	494021.0	0.057433	0.231623	0.00	0.00	0.00	0.0	0.00	1.00	1.000	1.0
srv_rerror_rate	494021.0	0.057719	0.232147	0.00	0.00	0.00	0.0	0.00	1.00	1.000	1.0
same_srv_rate	494021.0	0.791547	0.388189	0.01	0.03	1.00	1.0	1.00	1.00	1.000	1.0
diff_srv_rate	494021.0	0.020982	0.082205	0.00	0.00	0.00	0.0	0.00	0.07	0.208	1.0
srv_diff_host_rate	494021.0	0.028997	0.142397	0.00	0.00	0.00	0.0	0.00	0.14	1.000	1.0
dst_host_count	494021.0	232.470778	64.745380	3.00	33.00	255.00	255.0	255.00	255.00	255.000	255.0
dst_host_srv_count	494021.0	188.665670	106.040437	1.00	3.00	46.00	255.0	255.00	255.00	255.000	255.0
dst_host_same_srv_rate	494021.0	0.753780	0.410781	0.00	0.02	0.41	1.0	1.00	1.00	1.000	1.0
dst_host_diff_srv_rate	494021.0	0.030906	0.109259	0.00	0.00	0.00	0.0	0.04	0.08	0.820	1.0
dst_host_same_src_port_rate	494021.0	0.601935	0.481309	0.00	0.00	0.00	1.0	1.00	1.00	1.000	1.0
dst_host_srv_diff_host_rate	494021.0	0.006684	0.042133	0.00	0.00	0.00	0.0	0.00	0.03	0.150	1.0
dst_host_serror_rate	494021.0	0.176754	0.380593	0.00	0.00	0.00	0.0	0.00	1.00	1.000	1.0
dst_host_srv_serror_rate	494021.0	0.176443	0.380919	0.00	0.00	0.00	0.0	0.00	1.00	1.000	1.0
dst_host_rerror_rate	494021.0	0.058118	0.230590	0.00	0.00	0.00	0.0	0.00	1.00	1.000	1.0
dst_host_srv_rerror_rate	494021.0	0.057412	0.230140	0.00	0.00	0.00	0.0	0.00	1.00	1.000	1.0

By looking at the table above, we notice that the features:

src_bytes
dst_bytes

Have max values that are very far away from the 95th percentile and the 99th percentile, it would be appropriate to cap those values.

Countplots of each Numerical features¶

In [10]:

            
                Copied!
                
                    
                    
                
                

        
# Sampling to speed up computation
sample_df = df.sample(frac=0.01)
for feat in numerical_features:
    h5(feat)
    sns.histplot(data=sample_df, x=feat)
    plt.show()
# Sampling to speed up computation
sample_df = df.sample(frac=0.01)
for feat in numerical_features:
    h5(feat)
    sns.histplot(data=sample_df, x=feat)
    plt.show()

duration

src_bytes

dst_bytes

land

wrong_fragment

urgent

hot

num_failed_logins

logged_in

num_compromised

root_shell

su_attempted

num_root

num_file_creations

num_shells

num_access_files

num_outbound_cmds

is_host_login

is_guest_login

count

srv_count

serror_rate

srv_serror_rate

rerror_rate

srv_rerror_rate

same_srv_rate

diff_srv_rate

srv_diff_host_rate

dst_host_count

dst_host_srv_count

dst_host_same_srv_rate

dst_host_diff_srv_rate

dst_host_same_src_port_rate

dst_host_srv_diff_host_rate

dst_host_serror_rate

dst_host_srv_serror_rate

dst_host_rerror_rate

dst_host_srv_rerror_rate

Considerations¶

From the countplots above we notice that numerical features exibit highly skewed distributions, with data concentrated mainly on few values. This suggests that models that rely on binning might have an edge over models that rely on distances and linear relationships between features and target. Moreover, features have different scales, suggesting that we might need to rescale features if we decide to use certain models.

Some features exhibit highly outlier values behavior like dst_host_diff_srv_rate and especially src_bytes. For this last one, I think that some cyber attacks heavily rely on overloading the target system by sending heavy requests. This column might be an interesting indicator for these type of attacks.