Exploratory Data Analysis
In this notebook we aim at inspecting the available data and at performing an univariate analysis of the features distribution.
%load_ext autoreload
%autoreload
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from intrusion_detection.load_input_data import load_df
from intrusion_detection.preprocessing.preprocessing import remove_dot_from_attack_type_classes
from functools import partial
from IPython.display import HTML, display, Markdown
def header(text, level):
display(HTML(f"<h{level}>{text}</h{level}>"))
for level in range(1, 6):
globals()[f"h{level}"] = partial(header, level=level)
df = load_df(
file_path="../../../data/kddcup.data_10_percent",
header_file="../../../data/kddcup.names"
)
df = remove_dot_from_attack_type_classes(df)
df.head()
| duration | protocol_type | service | flag | src_bytes | dst_bytes | land | wrong_fragment | urgent | hot | ... | dst_host_srv_count | dst_host_same_srv_rate | dst_host_diff_srv_rate | dst_host_same_src_port_rate | dst_host_srv_diff_host_rate | dst_host_serror_rate | dst_host_srv_serror_rate | dst_host_rerror_rate | dst_host_srv_rerror_rate | attack_type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | tcp | http | SF | 181 | 5450 | 0 | 0 | 0 | 0 | ... | 9 | 1.0 | 0.0 | 0.11 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | normal |
| 1 | 0 | tcp | http | SF | 239 | 486 | 0 | 0 | 0 | 0 | ... | 19 | 1.0 | 0.0 | 0.05 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | normal |
| 2 | 0 | tcp | http | SF | 235 | 1337 | 0 | 0 | 0 | 0 | ... | 29 | 1.0 | 0.0 | 0.03 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | normal |
| 3 | 0 | tcp | http | SF | 219 | 1337 | 0 | 0 | 0 | 0 | ... | 39 | 1.0 | 0.0 | 0.03 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | normal |
| 4 | 0 | tcp | http | SF | 217 | 2032 | 0 | 0 | 0 | 0 | ... | 49 | 1.0 | 0.0 | 0.02 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | normal |
5 rows × 42 columns
target_columns = ["attack_type"]
feature_columns = [col for col in df.columns if not col in target_columns]
feature_columns_dtypes = df[feature_columns].dtypes
NaNs¶
null_values = df.isnull().sum()
null_values[null_values>0]
Series([], dtype: int64)
The dataset does not contain NaN values.
Categorical Features¶
List of categorical features:
categorical_features = feature_columns_dtypes[
feature_columns_dtypes == object
].index.to_list()
nl = "\n- "
display(Markdown(f"The categorical Features in our dataset are: {nl + nl.join(categorical_features)}"))
The categorical Features in our dataset are:
- protocol_type
- service
- flag
Countplots of each categorical features¶
for feat in categorical_features:
h5(feat)
ax = sns.countplot(data=df, y=feat)
ax.bar_label(ax.containers[0])
plt.show()
protocol_type
service
flag
Considerations¶
From the plots above we notice that for the first categorical feature protocol_type, each of the 3 categories is well represented in our dataset. This suggests that we can encode protocol_type using one hot encoding. For the other two categorical features, service and flag, some categories appear very rarely, in less than 1% of the cases sometimes. For service and flag we can group rare categories into a single bucket and then use one hot encoding. Moreover, we can also add the frequency encodings for the features service and flag. To avoid loss of information.
Numeric features¶
numerical_features = feature_columns_dtypes[
feature_columns_dtypes != object
].index.to_list()
nl = "\n- "
display(Markdown(f"The numerical Features in our dataset are: {nl + nl.join(numerical_features)}"))
The numerical Features in our dataset are:
- duration
- src_bytes
- dst_bytes
- land
- wrong_fragment
- urgent
- hot
- num_failed_logins
- logged_in
- num_compromised
- root_shell
- su_attempted
- num_root
- num_file_creations
- num_shells
- num_access_files
- num_outbound_cmds
- is_host_login
- is_guest_login
- count
- srv_count
- serror_rate
- srv_serror_rate
- rerror_rate
- srv_rerror_rate
- same_srv_rate
- diff_srv_rate
- srv_diff_host_rate
- dst_host_count
- dst_host_srv_count
- dst_host_same_srv_rate
- dst_host_diff_srv_rate
- dst_host_same_src_port_rate
- dst_host_srv_diff_host_rate
- dst_host_serror_rate
- dst_host_srv_serror_rate
- dst_host_rerror_rate
- dst_host_srv_rerror_rate
df.describe(percentiles=[0.01, 0.05, .25, .5, .75, 0.95, 0.99]).T
| count | mean | std | min | 1% | 5% | 25% | 50% | 75% | 95% | 99% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| duration | 494021.0 | 47.979302 | 707.746472 | 0.0 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 | 88.000 | 58329.0 |
| src_bytes | 494021.0 | 3025.610296 | 988218.101050 | 0.0 | 0.00 | 0.00 | 45.00 | 520.0 | 1032.00 | 1032.00 | 2394.800 | 693375640.0 |
| dst_bytes | 494021.0 | 868.532425 | 33040.001252 | 0.0 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 2417.00 | 12260.000 | 5155468.0 |
| land | 494021.0 | 0.000045 | 0.006673 | 0.0 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 | 0.000 | 1.0 |
| wrong_fragment | 494021.0 | 0.006433 | 0.134805 | 0.0 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 | 0.000 | 3.0 |
| urgent | 494021.0 | 0.000014 | 0.005510 | 0.0 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 | 0.000 | 3.0 |
| hot | 494021.0 | 0.034519 | 0.782103 | 0.0 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 | 0.000 | 30.0 |
| num_failed_logins | 494021.0 | 0.000152 | 0.015520 | 0.0 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 | 0.000 | 5.0 |
| logged_in | 494021.0 | 0.148247 | 0.355345 | 0.0 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 1.00 | 1.000 | 1.0 |
| num_compromised | 494021.0 | 0.010212 | 1.798326 | 0.0 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 | 0.000 | 884.0 |
| root_shell | 494021.0 | 0.000111 | 0.010551 | 0.0 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 | 0.000 | 1.0 |
| su_attempted | 494021.0 | 0.000036 | 0.007793 | 0.0 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 | 0.000 | 2.0 |
| num_root | 494021.0 | 0.011352 | 2.012718 | 0.0 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 | 0.000 | 993.0 |
| num_file_creations | 494021.0 | 0.001083 | 0.096416 | 0.0 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 | 0.000 | 28.0 |
| num_shells | 494021.0 | 0.000109 | 0.011020 | 0.0 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 | 0.000 | 2.0 |
| num_access_files | 494021.0 | 0.001008 | 0.036482 | 0.0 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 | 0.000 | 8.0 |
| num_outbound_cmds | 494021.0 | 0.000000 | 0.000000 | 0.0 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 | 0.000 | 0.0 |
| is_host_login | 494021.0 | 0.000000 | 0.000000 | 0.0 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 | 0.000 | 0.0 |
| is_guest_login | 494021.0 | 0.001387 | 0.037211 | 0.0 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 | 0.000 | 1.0 |
| count | 494021.0 | 332.285690 | 213.147412 | 0.0 | 1.00 | 1.00 | 117.00 | 510.0 | 511.00 | 511.00 | 511.000 | 511.0 |
| srv_count | 494021.0 | 292.906557 | 246.322817 | 0.0 | 1.00 | 1.00 | 10.00 | 510.0 | 511.00 | 511.00 | 511.000 | 511.0 |
| serror_rate | 494021.0 | 0.176687 | 0.380717 | 0.0 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 1.00 | 1.000 | 1.0 |
| srv_serror_rate | 494021.0 | 0.176609 | 0.381017 | 0.0 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 1.00 | 1.000 | 1.0 |
| rerror_rate | 494021.0 | 0.057433 | 0.231623 | 0.0 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 1.00 | 1.000 | 1.0 |
| srv_rerror_rate | 494021.0 | 0.057719 | 0.232147 | 0.0 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 1.00 | 1.000 | 1.0 |
| same_srv_rate | 494021.0 | 0.791547 | 0.388189 | 0.0 | 0.01 | 0.03 | 1.00 | 1.0 | 1.00 | 1.00 | 1.000 | 1.0 |
| diff_srv_rate | 494021.0 | 0.020982 | 0.082205 | 0.0 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.07 | 0.208 | 1.0 |
| srv_diff_host_rate | 494021.0 | 0.028997 | 0.142397 | 0.0 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.14 | 1.000 | 1.0 |
| dst_host_count | 494021.0 | 232.470778 | 64.745380 | 0.0 | 3.00 | 33.00 | 255.00 | 255.0 | 255.00 | 255.00 | 255.000 | 255.0 |
| dst_host_srv_count | 494021.0 | 188.665670 | 106.040437 | 0.0 | 1.00 | 3.00 | 46.00 | 255.0 | 255.00 | 255.00 | 255.000 | 255.0 |
| dst_host_same_srv_rate | 494021.0 | 0.753780 | 0.410781 | 0.0 | 0.00 | 0.02 | 0.41 | 1.0 | 1.00 | 1.00 | 1.000 | 1.0 |
| dst_host_diff_srv_rate | 494021.0 | 0.030906 | 0.109259 | 0.0 | 0.00 | 0.00 | 0.00 | 0.0 | 0.04 | 0.08 | 0.820 | 1.0 |
| dst_host_same_src_port_rate | 494021.0 | 0.601935 | 0.481309 | 0.0 | 0.00 | 0.00 | 0.00 | 1.0 | 1.00 | 1.00 | 1.000 | 1.0 |
| dst_host_srv_diff_host_rate | 494021.0 | 0.006684 | 0.042133 | 0.0 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.03 | 0.150 | 1.0 |
| dst_host_serror_rate | 494021.0 | 0.176754 | 0.380593 | 0.0 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 1.00 | 1.000 | 1.0 |
| dst_host_srv_serror_rate | 494021.0 | 0.176443 | 0.380919 | 0.0 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 1.00 | 1.000 | 1.0 |
| dst_host_rerror_rate | 494021.0 | 0.058118 | 0.230590 | 0.0 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 1.00 | 1.000 | 1.0 |
| dst_host_srv_rerror_rate | 494021.0 | 0.057412 | 0.230140 | 0.0 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 1.00 | 1.000 | 1.0 |
By looking at the table above, we notice that the features:
- src_bytes
- dst_bytes
Have max values that are very far away from the 95th percentile and the 99th percentile, it would be appropriate to cap those values.
Countplots of each Numerical features¶
# Sampling to speed up computation
sample_df = df.sample(frac=0.01)
for feat in numerical_features:
h5(feat)
sns.histplot(data=sample_df, x=feat)
plt.show()
duration
src_bytes
dst_bytes
land
wrong_fragment
urgent
hot
num_failed_logins
logged_in
num_compromised
root_shell
su_attempted
num_root
num_file_creations
num_shells
num_access_files
num_outbound_cmds
is_host_login
is_guest_login
count
srv_count
serror_rate
srv_serror_rate
rerror_rate
srv_rerror_rate
same_srv_rate
diff_srv_rate
srv_diff_host_rate
dst_host_count
dst_host_srv_count
dst_host_same_srv_rate
dst_host_diff_srv_rate
dst_host_same_src_port_rate
dst_host_srv_diff_host_rate
dst_host_serror_rate
dst_host_srv_serror_rate
dst_host_rerror_rate
dst_host_srv_rerror_rate
Considerations¶
From the countplots above we notice that numerical features exibit highly skewed distributions, with data concentrated mainly on few values. This suggests that models that rely on binning might have an edge over models that rely on distances and linear relationships between features and target. Moreover, features have different scales, suggesting that we might need to rescale features if we decide to use certain models.
Some features exhibit highly outlier values behavior like dst_host_diff_srv_rate and especially src_bytes. For this last one, I think that some cyber attacks heavily rely on overloading the target system by sending heavy requests. This column might be an interesting indicator for these type of attacks.