Feature Engineering
%load_ext autoreload
The input dataset already contains a lot of engineered features. Moreover, I lack the expertise to suggest which features would make sense to engineer. For this reason, I decided not to engineer new features, but to focus on precessing the exiting ones. The preprocessing pipeline that has been proposed in this project is the following:
%autoreload
from sklearn import set_config
from intrusion_detection.preprocessing.pipeline import get_preprocessing_pipeline
from intrusion_detection.preprocessing.preprocessing import remove_dot_from_attack_type_classes
from intrusion_detection.load_input_data import load_df
set_config(display="diagram")
preprocessing_pipeline = get_preprocessing_pipeline()
preprocessing_pipeline
Pipeline(steps=[('drop_target', DropFeatures(features_to_drop=['attack_type'])),
('outlier_removal',
Winsorizer(add_indicators=True, capping_method='quantiles',
fold=0.05, variables=['src_bytes', 'dst_bytes'])),
('frequency_encoder',
KeepInputFeaturesWrapper(rename_suffix='_freq',
wrapped_transformer=CountFrequencyEncoder(encoding_method='frequency'))),
('replace_rare_categories',
RareLabelEncoder(n_categories=2, tol=0.01)),
('one_hot_encoder', OneHotEncoder()),
('min_max_scaler',
SklearnTransformerWrapper(transformer=MinMaxScaler()))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('drop_target', DropFeatures(features_to_drop=['attack_type'])),
('outlier_removal',
Winsorizer(add_indicators=True, capping_method='quantiles',
fold=0.05, variables=['src_bytes', 'dst_bytes'])),
('frequency_encoder',
KeepInputFeaturesWrapper(rename_suffix='_freq',
wrapped_transformer=CountFrequencyEncoder(encoding_method='frequency'))),
('replace_rare_categories',
RareLabelEncoder(n_categories=2, tol=0.01)),
('one_hot_encoder', OneHotEncoder()),
('min_max_scaler',
SklearnTransformerWrapper(transformer=MinMaxScaler()))])DropFeatures(features_to_drop=['attack_type'])
Winsorizer(add_indicators=True, capping_method='quantiles', fold=0.05,
variables=['src_bytes', 'dst_bytes'])KeepInputFeaturesWrapper(rename_suffix='_freq',
wrapped_transformer=CountFrequencyEncoder(encoding_method='frequency'))CountFrequencyEncoder(encoding_method='frequency')
CountFrequencyEncoder(encoding_method='frequency')
RareLabelEncoder(n_categories=2, tol=0.01)
OneHotEncoder()
SklearnTransformerWrapper(transformer=MinMaxScaler())
MinMaxScaler()
MinMaxScaler()
Numerical Features preprocessing¶
The first step of the pipeline is used to deal with outlier values. Similarly to what has been already mentioned in the exploratory data analysis section, this step will cap the features src_bytes and dst_bytes to the 99th percentile and will add an indicator column specifying when the column values have been capped. All features (numerical and not) are scaled between 0 and 1 in the last step of the pipeline.
Categorical Features Encoding¶
Categorical features are encoded in two ways, using frequency encoding and using OneHotEncoding. The latter creates a binary column for each category of each categorical variable, having 1 when the categorical variable has the category value. In order to limit the number of columns in our dataset, some rare categories (with frequency less than 1%) have been grouped into a single category called Rare. This has been obtained using the RareLabelEncoder. To counter the loss of information when replacing rare labels, categorical features have been also encoded using frequency encoding which replaces the category values with the relative frequencies of each category.
Dataset after preprocessing¶
df = load_df(
file_path="../../../data/kddcup.data_10_percent",
header_file="../../../data/kddcup.names"
)
df = remove_dot_from_attack_type_classes(df)
preprocessed_df = preprocessing_pipeline.fit_transform(df)
preprocessed_df.describe(percentiles=[0.01, 0.05, .25, .5, .75, 0.95, 0.99]).T
| count | mean | std | min | 1% | 5% | 25% | 50% | 75% | 95% | 99% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| duration | 494021.0 | 0.000823 | 0.012134 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 0.00 | 0.001509 | 1.0 |
| protocol_type_freq | 494021.0 | 0.822097 | 0.241090 | 0.0 | 0.000000 | 0.644681 | 0.644681 | 1.000000 | 1.00 | 1.00 | 1.000000 | 1.0 |
| service_freq | 494021.0 | 0.689304 | 0.369005 | 0.0 | 0.000714 | 0.025714 | 0.394074 | 1.000000 | 1.00 | 1.00 | 1.000000 | 1.0 |
| flag_freq | 494021.0 | 0.810397 | 0.344788 | 0.0 | 0.070996 | 0.070996 | 1.000000 | 1.000000 | 1.00 | 1.00 | 1.000000 | 1.0 |
| src_bytes | 494021.0 | 0.578597 | 0.436075 | 0.0 | 0.000000 | 0.000000 | 0.043605 | 0.503876 | 1.00 | 1.00 | 1.000000 | 1.0 |
| dst_bytes | 494021.0 | 0.083433 | 0.245432 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 1.00 | 1.000000 | 1.0 |
| land | 494021.0 | 0.000045 | 0.006673 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 0.00 | 0.000000 | 1.0 |
| wrong_fragment | 494021.0 | 0.002144 | 0.044935 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 0.00 | 0.000000 | 1.0 |
| urgent | 494021.0 | 0.000005 | 0.001837 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 0.00 | 0.000000 | 1.0 |
| hot | 494021.0 | 0.001151 | 0.026070 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 0.00 | 0.000000 | 1.0 |
| num_failed_logins | 494021.0 | 0.000030 | 0.003104 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 0.00 | 0.000000 | 1.0 |
| logged_in | 494021.0 | 0.148247 | 0.355345 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 1.00 | 1.000000 | 1.0 |
| num_compromised | 494021.0 | 0.000012 | 0.002034 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 0.00 | 0.000000 | 1.0 |
| root_shell | 494021.0 | 0.000111 | 0.010551 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 0.00 | 0.000000 | 1.0 |
| su_attempted | 494021.0 | 0.000018 | 0.003896 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 0.00 | 0.000000 | 1.0 |
| num_root | 494021.0 | 0.000011 | 0.002027 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 0.00 | 0.000000 | 1.0 |
| num_file_creations | 494021.0 | 0.000039 | 0.003443 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 0.00 | 0.000000 | 1.0 |
| num_shells | 494021.0 | 0.000055 | 0.005510 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 0.00 | 0.000000 | 1.0 |
| num_access_files | 494021.0 | 0.000126 | 0.004560 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 0.00 | 0.000000 | 1.0 |
| num_outbound_cmds | 494021.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 0.00 | 0.000000 | 0.0 |
| is_host_login | 494021.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 0.00 | 0.000000 | 0.0 |
| is_guest_login | 494021.0 | 0.001387 | 0.037211 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 0.00 | 0.000000 | 1.0 |
| count | 494021.0 | 0.650266 | 0.417118 | 0.0 | 0.001957 | 0.001957 | 0.228963 | 0.998043 | 1.00 | 1.00 | 1.000000 | 1.0 |
| srv_count | 494021.0 | 0.573203 | 0.482041 | 0.0 | 0.001957 | 0.001957 | 0.019569 | 0.998043 | 1.00 | 1.00 | 1.000000 | 1.0 |
| serror_rate | 494021.0 | 0.176687 | 0.380717 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 1.00 | 1.000000 | 1.0 |
| srv_serror_rate | 494021.0 | 0.176609 | 0.381017 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 1.00 | 1.000000 | 1.0 |
| rerror_rate | 494021.0 | 0.057433 | 0.231623 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 1.00 | 1.000000 | 1.0 |
| srv_rerror_rate | 494021.0 | 0.057719 | 0.232147 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 1.00 | 1.000000 | 1.0 |
| same_srv_rate | 494021.0 | 0.791547 | 0.388189 | 0.0 | 0.010000 | 0.030000 | 1.000000 | 1.000000 | 1.00 | 1.00 | 1.000000 | 1.0 |
| diff_srv_rate | 494021.0 | 0.020982 | 0.082205 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 0.07 | 0.208000 | 1.0 |
| srv_diff_host_rate | 494021.0 | 0.028997 | 0.142397 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 0.14 | 1.000000 | 1.0 |
| dst_host_count | 494021.0 | 0.911650 | 0.253903 | 0.0 | 0.011765 | 0.129412 | 1.000000 | 1.000000 | 1.00 | 1.00 | 1.000000 | 1.0 |
| dst_host_srv_count | 494021.0 | 0.739865 | 0.415845 | 0.0 | 0.003922 | 0.011765 | 0.180392 | 1.000000 | 1.00 | 1.00 | 1.000000 | 1.0 |
| dst_host_same_srv_rate | 494021.0 | 0.753780 | 0.410781 | 0.0 | 0.000000 | 0.020000 | 0.410000 | 1.000000 | 1.00 | 1.00 | 1.000000 | 1.0 |
| dst_host_diff_srv_rate | 494021.0 | 0.030906 | 0.109259 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.04 | 0.08 | 0.820000 | 1.0 |
| dst_host_same_src_port_rate | 494021.0 | 0.601935 | 0.481309 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.00 | 1.00 | 1.000000 | 1.0 |
| dst_host_srv_diff_host_rate | 494021.0 | 0.006684 | 0.042133 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 0.03 | 0.150000 | 1.0 |
| dst_host_serror_rate | 494021.0 | 0.176754 | 0.380593 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 1.00 | 1.000000 | 1.0 |
| dst_host_srv_serror_rate | 494021.0 | 0.176443 | 0.380919 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 1.00 | 1.000000 | 1.0 |
| dst_host_rerror_rate | 494021.0 | 0.058118 | 0.230590 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 1.00 | 1.000000 | 1.0 |
| dst_host_srv_rerror_rate | 494021.0 | 0.057412 | 0.230140 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 1.00 | 1.000000 | 1.0 |
| src_bytes_right | 494021.0 | 0.022284 | 0.147607 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 0.00 | 1.000000 | 1.0 |
| dst_bytes_right | 494021.0 | 0.049988 | 0.217920 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 0.00 | 1.000000 | 1.0 |
| protocol_type_tcp | 494021.0 | 0.384731 | 0.486532 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.00 | 1.00 | 1.000000 | 1.0 |
| protocol_type_udp | 494021.0 | 0.041201 | 0.198754 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 0.00 | 1.000000 | 1.0 |
| protocol_type_icmp | 494021.0 | 0.574069 | 0.494484 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.00 | 1.00 | 1.000000 | 1.0 |
| service_http | 494021.0 | 0.130142 | 0.336460 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 1.00 | 1.000000 | 1.0 |
| service_smtp | 494021.0 | 0.019681 | 0.138903 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 0.00 | 1.000000 | 1.0 |
| service_Rare | 494021.0 | 0.029578 | 0.169419 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 0.00 | 1.000000 | 1.0 |
| service_domain_u | 494021.0 | 0.011868 | 0.108292 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 0.00 | 1.000000 | 1.0 |
| service_ecr_i | 494021.0 | 0.569611 | 0.495131 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.00 | 1.00 | 1.000000 | 1.0 |
| service_other | 494021.0 | 0.014649 | 0.120144 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 0.00 | 1.000000 | 1.0 |
| service_private | 494021.0 | 0.224470 | 0.417233 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 1.00 | 1.000000 | 1.0 |
| flag_SF | 494021.0 | 0.766040 | 0.423347 | 0.0 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.00 | 1.00 | 1.000000 | 1.0 |
| flag_Rare | 494021.0 | 0.003439 | 0.058543 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 0.00 | 0.000000 | 1.0 |
| flag_REJ | 494021.0 | 0.054401 | 0.226807 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 1.00 | 1.000000 | 1.0 |
| flag_S0 | 494021.0 | 0.176120 | 0.380923 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00 | 1.00 | 1.000000 | 1.0 |
After preprocessing we notice that many features are quasi-constant. For example, the feature land is 0 for more than 99% of the cases. In a realistic situation, I would try to get some extra domain knowledge from experts and eventually drop such features. In a supervised problem with a more balanced dataset, I would definitely remove such features from my dataset. However, since the purpose of this exercise is to detect anomalies, which are very rare, I think it is a good idea to keep those quasi-constant features for the anomaly detection part. This will probably result in considering the values that deviate from the constant as anomalies, but probably it is not wrong do to so.