Target Definition

In this notebook we aim at defining a target for evaluating our model.

In [3]:

            
                Copied!
                
%load_ext autoreload
%load_ext autoreload

In [4]:

            
                Copied!
                
                    
                    
                
                

        
%autoreload
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display, Markdown

from intrusion_detection.load_input_data import load_df
from intrusion_detection.preprocessing.preprocessing import remove_dot_from_attack_type_classes
from intrusion_detection.target_definition import define_target
%autoreload
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display, Markdown

from intrusion_detection.load_input_data import load_df
from intrusion_detection.preprocessing.preprocessing import remove_dot_from_attack_type_classes
from intrusion_detection.target_definition import define_target

In [7]:

            
                Copied!
                
df = load_df(
    file_path="../../../data/kddcup.data_10_percent",
    header_file="../../../data/kddcup.names"
)
df = remove_dot_from_attack_type_classes(df)
df = load_df(
    file_path="../../../data/kddcup.data_10_percent",
    header_file="../../../data/kddcup.names"
)
df = remove_dot_from_attack_type_classes(df)

In [8]:

            
                Copied!
                
df.head()
df.head()

Out[8]:

	protocol_type	service	flag	src_bytes	dst_bytes	...	dst_host_srv_count	dst_host_same_srv_rate	dst_host_same_src_port_rate	attack_type
0	tcp	http	SF	181	5450	...	9	1.0	0.11	normal
1	tcp	http	SF	239	486	...	19	1.0	0.05	normal
2	tcp	http	SF	235	1337	...	29	1.0	0.03	normal
3	tcp	http	SF	219	1337	...	39	1.0	0.03	normal
4	tcp	http	SF	217	2032	...	49	1.0	0.02	normal

5 rows × 42 columns

Choose the anomalous classes¶

In [9]:

            
                Copied!
                
                    
                    
                
                

        
# Calculate the frequency for each class
sns.countplot(
    data=df,
    y="attack_type",
)
plt.title("Number of connections per attack type")
# Calculate the frequency for each class
sns.countplot(
    data=df,
    y="attack_type",
)
plt.title("Number of connections per attack type")

Out[9]:

Text(0.5, 1.0, 'Number of connections per attack type')

From the plot above we notice that the 3 most frequent classes for attack type are smurf, neptune and normal. All other attack types are very rare. The class normal represents normal connections that are not to be considered as cyber attacks. The classes smurf and neptune are DDOS attacks. DDOS attacks are a type of cyber attacks where attackers try to break a system by overloading it with requests, so it is natural that such kind of connections outweigh even normal connections.

The strategy that I plan to utilize to model this behavior and to distinguish the normal class from the attack classes is to utilize two models in cascade. The fist model will be specialized in identifying rare anomalies (so all attacks except the DDOS ones) and the second model will be specialized in detecting connections from DDOS attacks.

To better simulate a realistic scenario, where lables are not available, I will tackle the problem using unsupervised learning approaches and I will utilize the attack type lables only for the final model evaluation.

Define Target¶

Given the considerations of the previous chapter, we define two target columns for the two problems. For the first target column target_anomaly we consider as positive classes the attack types that are not DDOS attacks and as negative class all other connections (thus including DDOS attacks) and for the second target column target_ddos we take as positive class connections related to DDOS attacks and we consider all other connections as negative class (thus including anomalous attacks as negative class in this case).

In [10]:

            
                Copied!
                
df = define_target(df)
df = define_target(df)

In [11]:

            
                Copied!
                
sns.countplot(
    data=df,
    x='target_anomaly',
)
sns.countplot(
    data=df,
    x='target_anomaly',
)

Out[11]:

<AxesSubplot: xlabel='target_anomaly', ylabel='count'>

In [17]:

            
                Copied!
                
value_counts = df.target_anomaly.value_counts()
imbalance_ratio = value_counts[1] / (value_counts[1] + value_counts[0])

display(Markdown(f"For the first problem, we have an imbalance problem with an imbalance ratio of {imbalance_ratio:.2f}%"))
value_counts = df.target_anomaly.value_counts()
imbalance_ratio = value_counts[1] / (value_counts[1] + value_counts[0])

display(Markdown(f"For the first problem, we have an imbalance problem with an imbalance ratio of {imbalance_ratio:.2f}%"))

For the first problem, we have an imbalance problem with an imbalance ratio of 0.02%

In [13]:

            
                Copied!
                
sns.countplot(
    data=df,
    x='target_ddos',
)
sns.countplot(
    data=df,
    x='target_ddos',
)

Out[13]:

<AxesSubplot: xlabel='target_ddos', ylabel='count'>

In [16]:

            
                Copied!
                
value_counts = df.target_ddos.value_counts()
imbalance_ratio = value_counts[0] / (value_counts[1] + value_counts[0])

display(Markdown(f"For the second problem, we have an imbalance problem with an imbalance ratio of {imbalance_ratio:.2f}%"))
value_counts = df.target_ddos.value_counts()
imbalance_ratio = value_counts[0] / (value_counts[1] + value_counts[0])

display(Markdown(f"For the second problem, we have an imbalance problem with an imbalance ratio of {imbalance_ratio:.2f}%"))

For the second problem, we have an imbalance problem with an imbalance ratio of 0.21%