Modelling Anomalies

In order to simulate a realistic scenario, where labels are unavailable, to model the problem of detecting cyber attacks the target labels will not be used during modelling and they will only be used for the final evaluation.

In [83]:

            
                Copied!
                
%load_ext autoreload
%load_ext autoreload

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

In [84]:

            
                Copied!
                
                    
                    
                
                

        
%autoreload
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display, Markdown

from intrusion_detection.load_input_data import load_df
from intrusion_detection.preprocessing.preprocessing import remove_dot_from_attack_type_classes
from intrusion_detection.target_definition import define_target
from intrusion_detection.preprocessing.pipeline import get_preprocessing_pipeline
%autoreload
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display, Markdown

from intrusion_detection.load_input_data import load_df
from intrusion_detection.preprocessing.preprocessing import remove_dot_from_attack_type_classes
from intrusion_detection.target_definition import define_target
from intrusion_detection.preprocessing.pipeline import get_preprocessing_pipeline

In [85]:

            
                Copied!
                
df = load_df(
    file_path="../../../data/kddcup.data_10_percent",
    header_file="../../../data/kddcup.names"
)
df = remove_dot_from_attack_type_classes(df)
df = load_df(
    file_path="../../../data/kddcup.data_10_percent",
    header_file="../../../data/kddcup.names"
)
df = remove_dot_from_attack_type_classes(df)

Training Models¶

Given the considerations from the Exploratory Data Analysis, I think that it can make sense to utilize unsupervised anomaly detection problems that make use of binning. I have chosen IsolationForest and Histogram Based Outlier Score (HBOS). The first model tries to isolate anomalies by randomly splitting random features one after the others and counting the number of splits before each instance is completely isolated (the less the higher the anomaly score). The latter uses histograms for each feature and counts how many times an instance ends up in a histogram with low frequencies, the higher the more the probability of being an outlier.

In [86]:

            
                Copied!
                
from sklearn.pipeline import Pipeline
from pyod.models.iforest import IForest
from pyod.models.hbos import HBOS

df_target = define_target(df)
preprocessing_pipeline = get_preprocessing_pipeline()
from sklearn.pipeline import Pipeline
from pyod.models.iforest import IForest
from pyod.models.hbos import HBOS

df_target = define_target(df)
preprocessing_pipeline = get_preprocessing_pipeline()

In [ ]:

            
                Copied!
                
target_columns = [col for col in df_target.columns if "target" in col]
X = df_target.drop(target_columns, axis=1)
y = df_target["target_anomaly"]
target_columns = [col for col in df_target.columns if "target" in col]
X = df_target.drop(target_columns, axis=1)
y = df_target["target_anomaly"]

Isolation Forest Pipeline¶

In [ ]:

            
                Copied!
                
from sklearn.model_selection import cross_val_predict
import warnings

with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=UserWarning)
    iforest_preds = cross_val_predict(iforest_pipeline, X, cv=10, method="predict_proba")
from sklearn.model_selection import cross_val_predict
import warnings

with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=UserWarning)
    iforest_preds = cross_val_predict(iforest_pipeline, X, cv=10, method="predict_proba")

HBOS Pipeline¶

In [27]:

            
                Copied!
                
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=UserWarning)
    hbos_preds = cross_val_predict(hbos_pipeline, X, cv=10, method="predict_proba")
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=UserWarning)
    hbos_preds = cross_val_predict(hbos_pipeline, X, cv=10, method="predict_proba")

Predictions comparisons¶

In [31]:

            
                Copied!
                
sns.histplot(iforest_preds[:,1])
sns.histplot(hbos_preds[:,1])
sns.histplot(iforest_preds[:,1])
sns.histplot(hbos_preds[:,1])

Out[31]:

<AxesSubplot: ylabel='Count'>

From the gaph above we notice that the distributions of predictions agree to some extent.

We notice that the two models agree on most instances being not anomalies. They agree on 5k instances being outliers and they disagree on about 8k instances. It can make sense to ensemble those two models using a voting ensemble.

Ensambled Model Evaluation¶

In order to make the predictions of our models a bit more robust, we ensamble their prediction usign a Soft voting ensamble, i.e. taking the average prediction.

In [ ]:

            
                Copied!
                
ensembled_preds = (iforest_preds[:,1] + hbos_preds[:,1]) / 2
ensembled_preds = (iforest_preds[:,1] + hbos_preds[:,1]) / 2

In [47]:

            
                Copied!
                
from sklearn.metrics import RocCurveDisplay

RocCurveDisplay.from_predictions(y, ensembled_preds)
from sklearn.metrics import RocCurveDisplay

RocCurveDisplay.from_predictions(y, ensembled_preds)

Out[47]:

<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x14823f9a0>

In [49]:

            
                Copied!
                
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_predictions(
   y, ensembled_preds > 0.5
)
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_predictions(
   y, ensembled_preds > 0.5
)

Out[49]:

<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1473eea90>

Preliminary results at this stage are pretty good, however we need to keep in mind that here measured performances are highly influenced by connections of DDOS attacks that virtually inflate number of positives and are relatively easy to classify.

Model Evaluation Without DDOS Classes¶

In [82]:

            
                Copied!
                
mask = ~df.attack_type.isin(["smurf", "neptune"])
RocCurveDisplay.from_predictions(y[mask], ensembled_preds[mask])
mask = ~df.attack_type.isin(["smurf", "neptune"])
RocCurveDisplay.from_predictions(y[mask], ensembled_preds[mask])

Out[82]:

<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x14caf9e20>

Indeed, when we do not consider DDOS attacks we notice an important performance drop.

Predictions By Attack type¶

In [81]:

            
                Copied!
                
prediction_by_attack_type = pd.DataFrame({"attack_type": df.attack_type.values, "classified_as_anomaly": (ensembled_preds > ensembled_thr).astype(int)})
prediction_by_attack_type = pd.DataFrame({"attack_type": df.attack_type.values, "classified_as_anomaly": (ensembled_preds > ensembled_thr).astype(int)})

In [79]:

            
                Copied!
                
attack_anomaly_counts = pd.DataFrame(prediction_by_attack_type.value_counts()).rename(columns={0: "cross_counts"}).reset_index()
attack_counts = pd.DataFrame(prediction_by_attack_type.attack_type.value_counts()).rename(columns={"attack_type": "attack_counts"}).reset_index().rename(columns={"index": "attack_type"})
merged = attack_anomaly_counts.merge(attack_counts, on="attack_type")
merged["frac"] = merged.cross_counts / merged.attack_counts
merged
attack_anomaly_counts = pd.DataFrame(prediction_by_attack_type.value_counts()).rename(columns={0: "cross_counts"}).reset_index()
attack_counts = pd.DataFrame(prediction_by_attack_type.attack_type.value_counts()).rename(columns={"attack_type": "attack_counts"}).reset_index().rename(columns={"index": "attack_type"})
merged = attack_anomaly_counts.merge(attack_counts, on="attack_type")
merged["frac"] = merged.cross_counts / merged.attack_counts
merged

Out[79]:

	attack_type	classified_as_anomaly	cross_counts	attack_counts	frac
0	smurf	0	280790	280790	1.000000
1	neptune	0	102610	107201	0.957174
2	neptune	1	4591	107201	0.042826
3	normal	0	94439	97278	0.970816
4	normal	1	2839	97278	0.029184
5	back	0	2098	2203	0.952338
6	back	1	105	2203	0.047662
7	satan	1	1383	1589	0.870359
8	satan	0	206	1589	0.129641
9	ipsweep	0	1233	1247	0.988773
10	ipsweep	1	14	1247	0.011227
11	warezclient	0	974	1020	0.954902
12	warezclient	1	46	1020	0.045098
13	teardrop	0	905	979	0.924413
14	teardrop	1	74	979	0.075587
15	portsweep	1	719	1040	0.691346
16	portsweep	0	321	1040	0.308654
17	pod	0	252	264	0.954545
18	pod	1	12	264	0.045455
19	nmap	0	230	231	0.995671
20	nmap	1	1	231	0.004329
21	guess_passwd	1	50	53	0.943396
22	guess_passwd	0	3	53	0.056604
23	warezmaster	0	20	20	1.000000
24	land	1	19	21	0.904762
25	land	0	2	21	0.095238
26	buffer_overflow	0	19	30	0.633333
27	buffer_overflow	1	11	30	0.366667
28	rootkit	0	10	10	1.000000
29	imap	1	9	12	0.750000
30	imap	0	3	12	0.250000
31	loadmodule	0	9	9	1.000000
32	ftp_write	0	7	8	0.875000
33	ftp_write	1	1	8	0.125000
34	multihop	0	6	7	0.857143
35	multihop	1	1	7	0.142857
36	phf	0	4	4	1.000000
37	perl	0	3	3	1.000000
38	spy	1	1	2	0.500000
39	spy	0	1	2	0.500000

In [80]:

            
                Copied!
                
sns.barplot(data=merged, x="frac", y="attack_type", hue="classified_as_anomaly")
sns.barplot(data=merged, x="frac", y="attack_type", hue="classified_as_anomaly")

Out[80]:

<AxesSubplot: xlabel='frac', ylabel='attack_type'>

In the graph above we see how correct and wrong predictions are spread across different attack types. Notice that the results are normalized by the number of instances for that attack, so the blue and orange lines for each attack sum to 1.

Moreover, notice that at this stage of modelling we are not interested in detecting attacks of type neptune and smurf as we are going to use a separate model for that.

From the graph above we notice that the model is making relatively few false positives (around 3% of normal connections are classified as anomalies). Moreover the model is able to detect certain attack types better than others, for example it does very good on preventing the attack type satan and guess_password but it completely fails in detecting attacks like ipsweep. In a future development we might decide to improve the model with a model that is more specialized on those attack types.

In [ ]: