Modelling Anomalies
In order to simulate a realistic scenario, where labels are unavailable, to model the problem of detecting cyber attacks the target labels will not be used during modelling and they will only be used for the final evaluation.
%load_ext autoreload
The autoreload extension is already loaded. To reload it, use: %reload_ext autoreload
%autoreload
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display, Markdown
from intrusion_detection.load_input_data import load_df
from intrusion_detection.preprocessing.preprocessing import remove_dot_from_attack_type_classes
from intrusion_detection.target_definition import define_target
from intrusion_detection.preprocessing.pipeline import get_preprocessing_pipeline
df = load_df(
file_path="../../../data/kddcup.data_10_percent",
header_file="../../../data/kddcup.names"
)
df = remove_dot_from_attack_type_classes(df)
Training Models¶
Given the considerations from the Exploratory Data Analysis, I think that it can make sense to utilize unsupervised anomaly detection problems that make use of binning. I have chosen IsolationForest and Histogram Based Outlier Score (HBOS). The first model tries to isolate anomalies by randomly splitting random features one after the others and counting the number of splits before each instance is completely isolated (the less the higher the anomaly score). The latter uses histograms for each feature and counts how many times an instance ends up in a histogram with low frequencies, the higher the more the probability of being an outlier.
from sklearn.pipeline import Pipeline
from pyod.models.iforest import IForest
from pyod.models.hbos import HBOS
df_target = define_target(df)
preprocessing_pipeline = get_preprocessing_pipeline()
target_columns = [col for col in df_target.columns if "target" in col]
X = df_target.drop(target_columns, axis=1)
y = df_target["target_anomaly"]
Isolation Forest Pipeline¶
iforest_pipeline = Pipeline([*preprocessing_pipeline.steps, ("model", IForest(n_jobs=-1))])
iforest_pipeline
Pipeline(steps=[('drop_target', DropFeatures(features_to_drop=['attack_type'])),
('outlier_removal',
Winsorizer(add_indicators=True, capping_method='quantiles',
fold=0.01, variables=['src_bytes', 'dst_bytes'])),
('frequency_encoder',
KeepInputFeaturesWrapper(rename_suffix='_freq',
wrapped_transformer=CountFrequencyEncoder(encoding_method='frequency',
unseen='encode'))),
('replace_rare_categories',
RareLabelEncoder(n_categories=2, tol=0.01)),
('one_hot_encoder', OneHotEncoder()),
('min_max_scaler',
SklearnTransformerWrapper(transformer=MinMaxScaler())),
('model',
IForest(behaviour='old', bootstrap=False, contamination=0.1, max_features=1.0,
max_samples='auto', n_estimators=100, n_jobs=-1, random_state=None,
verbose=0))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('drop_target', DropFeatures(features_to_drop=['attack_type'])),
('outlier_removal',
Winsorizer(add_indicators=True, capping_method='quantiles',
fold=0.01, variables=['src_bytes', 'dst_bytes'])),
('frequency_encoder',
KeepInputFeaturesWrapper(rename_suffix='_freq',
wrapped_transformer=CountFrequencyEncoder(encoding_method='frequency',
unseen='encode'))),
('replace_rare_categories',
RareLabelEncoder(n_categories=2, tol=0.01)),
('one_hot_encoder', OneHotEncoder()),
('min_max_scaler',
SklearnTransformerWrapper(transformer=MinMaxScaler())),
('model',
IForest(behaviour='old', bootstrap=False, contamination=0.1, max_features=1.0,
max_samples='auto', n_estimators=100, n_jobs=-1, random_state=None,
verbose=0))])DropFeatures(features_to_drop=['attack_type'])
Winsorizer(add_indicators=True, capping_method='quantiles', fold=0.01,
variables=['src_bytes', 'dst_bytes'])KeepInputFeaturesWrapper(rename_suffix='_freq',
wrapped_transformer=CountFrequencyEncoder(encoding_method='frequency',
unseen='encode'))CountFrequencyEncoder(encoding_method='frequency', unseen='encode')
CountFrequencyEncoder(encoding_method='frequency', unseen='encode')
RareLabelEncoder(n_categories=2, tol=0.01)
OneHotEncoder()
SklearnTransformerWrapper(transformer=MinMaxScaler())
MinMaxScaler()
MinMaxScaler()
IForest(behaviour='old', bootstrap=False, contamination=0.1, max_features=1.0,
max_samples='auto', n_estimators=100, n_jobs=-1, random_state=None,
verbose=0)from sklearn.model_selection import cross_val_predict
import warnings
with warnings.catch_warnings():
warnings.filterwarnings("ignore", category=UserWarning)
iforest_preds = cross_val_predict(iforest_pipeline, X, cv=10, method="predict_proba")
HBOS Pipeline¶
hbos_pipeline = Pipeline([*preprocessing_pipeline.steps, ("model", HBOS())])
hbos_pipeline
Pipeline(steps=[('drop_target', DropFeatures(features_to_drop=['attack_type'])),
('outlier_removal',
Winsorizer(add_indicators=True, capping_method='quantiles',
fold=0.01, variables=['src_bytes', 'dst_bytes'])),
('frequency_encoder',
KeepInputFeaturesWrapper(rename_suffix='_freq',
wrapped_transformer=CountFrequencyEncoder(encoding_method='frequency',
unseen='encode'))),
('replace_rare_categories',
RareLabelEncoder(n_categories=2, tol=0.01)),
('one_hot_encoder', OneHotEncoder()),
('min_max_scaler',
SklearnTransformerWrapper(transformer=MinMaxScaler())),
('model',
HBOS(alpha=0.1, contamination=0.1, n_bins=10, tol=0.5))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('drop_target', DropFeatures(features_to_drop=['attack_type'])),
('outlier_removal',
Winsorizer(add_indicators=True, capping_method='quantiles',
fold=0.01, variables=['src_bytes', 'dst_bytes'])),
('frequency_encoder',
KeepInputFeaturesWrapper(rename_suffix='_freq',
wrapped_transformer=CountFrequencyEncoder(encoding_method='frequency',
unseen='encode'))),
('replace_rare_categories',
RareLabelEncoder(n_categories=2, tol=0.01)),
('one_hot_encoder', OneHotEncoder()),
('min_max_scaler',
SklearnTransformerWrapper(transformer=MinMaxScaler())),
('model',
HBOS(alpha=0.1, contamination=0.1, n_bins=10, tol=0.5))])DropFeatures(features_to_drop=['attack_type'])
Winsorizer(add_indicators=True, capping_method='quantiles', fold=0.01,
variables=['src_bytes', 'dst_bytes'])KeepInputFeaturesWrapper(rename_suffix='_freq',
wrapped_transformer=CountFrequencyEncoder(encoding_method='frequency',
unseen='encode'))CountFrequencyEncoder(encoding_method='frequency', unseen='encode')
CountFrequencyEncoder(encoding_method='frequency', unseen='encode')
RareLabelEncoder(n_categories=2, tol=0.01)
OneHotEncoder()
SklearnTransformerWrapper(transformer=MinMaxScaler())
MinMaxScaler()
MinMaxScaler()
HBOS(alpha=0.1, contamination=0.1, n_bins=10, tol=0.5)
with warnings.catch_warnings():
warnings.filterwarnings("ignore", category=UserWarning)
hbos_preds = cross_val_predict(hbos_pipeline, X, cv=10, method="predict_proba")
Predictions comparisons¶
sns.histplot(iforest_preds[:,1])
sns.histplot(hbos_preds[:,1])
<AxesSubplot: ylabel='Count'>
From the gaph above we notice that the distributions of predictions agree to some extent.
We notice that the two models agree on most instances being not anomalies. They agree on 5k instances being outliers and they disagree on about 8k instances. It can make sense to ensemble those two models using a voting ensemble.
Ensambled Model Evaluation¶
In order to make the predictions of our models a bit more robust, we ensamble their prediction usign a Soft voting ensamble, i.e. taking the average prediction.
ensembled_preds = (iforest_preds[:,1] + hbos_preds[:,1]) / 2
from sklearn.metrics import RocCurveDisplay
RocCurveDisplay.from_predictions(y, ensembled_preds)
<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x14823f9a0>
from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_predictions(
y, ensembled_preds > 0.5
)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1473eea90>
Preliminary results at this stage are pretty good, however we need to keep in mind that here measured performances are highly influenced by connections of DDOS attacks that virtually inflate number of positives and are relatively easy to classify.
Model Evaluation Without DDOS Classes¶
mask = ~df.attack_type.isin(["smurf", "neptune"])
RocCurveDisplay.from_predictions(y[mask], ensembled_preds[mask])
<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x14caf9e20>
Indeed, when we do not consider DDOS attacks we notice an important performance drop.
Predictions By Attack type¶
prediction_by_attack_type = pd.DataFrame({"attack_type": df.attack_type.values, "classified_as_anomaly": (ensembled_preds > ensembled_thr).astype(int)})
attack_anomaly_counts = pd.DataFrame(prediction_by_attack_type.value_counts()).rename(columns={0: "cross_counts"}).reset_index()
attack_counts = pd.DataFrame(prediction_by_attack_type.attack_type.value_counts()).rename(columns={"attack_type": "attack_counts"}).reset_index().rename(columns={"index": "attack_type"})
merged = attack_anomaly_counts.merge(attack_counts, on="attack_type")
merged["frac"] = merged.cross_counts / merged.attack_counts
merged
| attack_type | classified_as_anomaly | cross_counts | attack_counts | frac | |
|---|---|---|---|---|---|
| 0 | smurf | 0 | 280790 | 280790 | 1.000000 |
| 1 | neptune | 0 | 102610 | 107201 | 0.957174 |
| 2 | neptune | 1 | 4591 | 107201 | 0.042826 |
| 3 | normal | 0 | 94439 | 97278 | 0.970816 |
| 4 | normal | 1 | 2839 | 97278 | 0.029184 |
| 5 | back | 0 | 2098 | 2203 | 0.952338 |
| 6 | back | 1 | 105 | 2203 | 0.047662 |
| 7 | satan | 1 | 1383 | 1589 | 0.870359 |
| 8 | satan | 0 | 206 | 1589 | 0.129641 |
| 9 | ipsweep | 0 | 1233 | 1247 | 0.988773 |
| 10 | ipsweep | 1 | 14 | 1247 | 0.011227 |
| 11 | warezclient | 0 | 974 | 1020 | 0.954902 |
| 12 | warezclient | 1 | 46 | 1020 | 0.045098 |
| 13 | teardrop | 0 | 905 | 979 | 0.924413 |
| 14 | teardrop | 1 | 74 | 979 | 0.075587 |
| 15 | portsweep | 1 | 719 | 1040 | 0.691346 |
| 16 | portsweep | 0 | 321 | 1040 | 0.308654 |
| 17 | pod | 0 | 252 | 264 | 0.954545 |
| 18 | pod | 1 | 12 | 264 | 0.045455 |
| 19 | nmap | 0 | 230 | 231 | 0.995671 |
| 20 | nmap | 1 | 1 | 231 | 0.004329 |
| 21 | guess_passwd | 1 | 50 | 53 | 0.943396 |
| 22 | guess_passwd | 0 | 3 | 53 | 0.056604 |
| 23 | warezmaster | 0 | 20 | 20 | 1.000000 |
| 24 | land | 1 | 19 | 21 | 0.904762 |
| 25 | land | 0 | 2 | 21 | 0.095238 |
| 26 | buffer_overflow | 0 | 19 | 30 | 0.633333 |
| 27 | buffer_overflow | 1 | 11 | 30 | 0.366667 |
| 28 | rootkit | 0 | 10 | 10 | 1.000000 |
| 29 | imap | 1 | 9 | 12 | 0.750000 |
| 30 | imap | 0 | 3 | 12 | 0.250000 |
| 31 | loadmodule | 0 | 9 | 9 | 1.000000 |
| 32 | ftp_write | 0 | 7 | 8 | 0.875000 |
| 33 | ftp_write | 1 | 1 | 8 | 0.125000 |
| 34 | multihop | 0 | 6 | 7 | 0.857143 |
| 35 | multihop | 1 | 1 | 7 | 0.142857 |
| 36 | phf | 0 | 4 | 4 | 1.000000 |
| 37 | perl | 0 | 3 | 3 | 1.000000 |
| 38 | spy | 1 | 1 | 2 | 0.500000 |
| 39 | spy | 0 | 1 | 2 | 0.500000 |
sns.barplot(data=merged, x="frac", y="attack_type", hue="classified_as_anomaly")
<AxesSubplot: xlabel='frac', ylabel='attack_type'>
In the graph above we see how correct and wrong predictions are spread across different attack types. Notice that the results are normalized by the number of instances for that attack, so the blue and orange lines for each attack sum to 1.
Moreover, notice that at this stage of modelling we are not interested in detecting attacks of type neptune and smurf as we are going to use a separate model for that.
From the graph above we notice that the model is making relatively few false positives (around 3% of normal connections are classified as anomalies). Moreover the model is able to detect certain attack types better than others, for example it does very good on preventing the attack type satan and guess_password but it completely fails in detecting attacks like ipsweep. In a future development we might decide to improve the model with a model that is more specialized on those attack types.