Target Definition
In this notebook we aim at defining a target for evaluating our model.
%load_ext autoreload
%autoreload
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display, Markdown
from intrusion_detection.load_input_data import load_df
from intrusion_detection.preprocessing.preprocessing import remove_dot_from_attack_type_classes
from intrusion_detection.target_definition import define_target
df = load_df(
file_path="../../../data/kddcup.data_10_percent",
header_file="../../../data/kddcup.names"
)
df = remove_dot_from_attack_type_classes(df)
df.head()
| duration | protocol_type | service | flag | src_bytes | dst_bytes | land | wrong_fragment | urgent | hot | ... | dst_host_srv_count | dst_host_same_srv_rate | dst_host_diff_srv_rate | dst_host_same_src_port_rate | dst_host_srv_diff_host_rate | dst_host_serror_rate | dst_host_srv_serror_rate | dst_host_rerror_rate | dst_host_srv_rerror_rate | attack_type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | tcp | http | SF | 181 | 5450 | 0 | 0 | 0 | 0 | ... | 9 | 1.0 | 0.0 | 0.11 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | normal |
| 1 | 0 | tcp | http | SF | 239 | 486 | 0 | 0 | 0 | 0 | ... | 19 | 1.0 | 0.0 | 0.05 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | normal |
| 2 | 0 | tcp | http | SF | 235 | 1337 | 0 | 0 | 0 | 0 | ... | 29 | 1.0 | 0.0 | 0.03 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | normal |
| 3 | 0 | tcp | http | SF | 219 | 1337 | 0 | 0 | 0 | 0 | ... | 39 | 1.0 | 0.0 | 0.03 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | normal |
| 4 | 0 | tcp | http | SF | 217 | 2032 | 0 | 0 | 0 | 0 | ... | 49 | 1.0 | 0.0 | 0.02 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | normal |
5 rows × 42 columns
Choose the anomalous classes¶
# Calculate the frequency for each class
sns.countplot(
data=df,
y="attack_type",
)
plt.title("Number of connections per attack type")
Text(0.5, 1.0, 'Number of connections per attack type')
From the plot above we notice that the 3 most frequent classes for attack type are smurf, neptune and normal. All other attack types are very rare. The class normal represents normal connections that are not to be considered as cyber attacks. The classes smurf and neptune are DDOS attacks. DDOS attacks are a type of cyber attacks where attackers try to break a system by overloading it with requests, so it is natural that such kind of connections outweigh even normal connections.
The strategy that I plan to utilize to model this behavior and to distinguish the normal class from the attack classes is to utilize two models in cascade. The fist model will be specialized in identifying rare anomalies (so all attacks except the DDOS ones) and the second model will be specialized in detecting connections from DDOS attacks.
To better simulate a realistic scenario, where lables are not available, I will tackle the problem using unsupervised learning approaches and I will utilize the attack type lables only for the final model evaluation.
Define Target¶
Given the considerations of the previous chapter, we define two target columns for the two problems. For the first target column target_anomaly we consider as positive classes the attack types that are not DDOS attacks and as negative class all other connections (thus including DDOS attacks) and for the second target column target_ddos we take as positive class connections related to DDOS attacks and we consider all other connections as negative class (thus including anomalous attacks as negative class in this case).
df = define_target(df)
sns.countplot(
data=df,
x='target_anomaly',
)
<AxesSubplot: xlabel='target_anomaly', ylabel='count'>
value_counts = df.target_anomaly.value_counts()
imbalance_ratio = value_counts[1] / (value_counts[1] + value_counts[0])
display(Markdown(f"For the first problem, we have an imbalance problem with an imbalance ratio of {imbalance_ratio:.2f}%"))
For the first problem, we have an imbalance problem with an imbalance ratio of 0.02%
sns.countplot(
data=df,
x='target_ddos',
)
<AxesSubplot: xlabel='target_ddos', ylabel='count'>
value_counts = df.target_ddos.value_counts()
imbalance_ratio = value_counts[0] / (value_counts[1] + value_counts[0])
display(Markdown(f"For the second problem, we have an imbalance problem with an imbalance ratio of {imbalance_ratio:.2f}%"))
For the second problem, we have an imbalance problem with an imbalance ratio of 0.21%