Feature Engineering

In [2]:

            
                Copied!
                
%load_ext autoreload
%load_ext autoreload

The input dataset already contains a lot of engineered features. Moreover, I lack the expertise to suggest which features would make sense to engineer. For this reason, I decided not to engineer new features, but to focus on precessing the exiting ones. The preprocessing pipeline that has been proposed in this project is the following:

In [18]:

            
                Copied!
                
%autoreload
from sklearn import set_config

from intrusion_detection.preprocessing.pipeline import get_preprocessing_pipeline
from intrusion_detection.preprocessing.preprocessing import remove_dot_from_attack_type_classes
from intrusion_detection.load_input_data import load_df
set_config(display="diagram")
%autoreload
from sklearn import set_config

from intrusion_detection.preprocessing.pipeline import get_preprocessing_pipeline
from intrusion_detection.preprocessing.preprocessing import remove_dot_from_attack_type_classes
from intrusion_detection.load_input_data import load_df
set_config(display="diagram")

Numerical Features preprocessing¶

The first step of the pipeline is used to deal with outlier values. Similarly to what has been already mentioned in the exploratory data analysis section, this step will cap the features src_bytes and dst_bytes to the 99th percentile and will add an indicator column specifying when the column values have been capped. All features (numerical and not) are scaled between 0 and 1 in the last step of the pipeline.

Categorical Features Encoding¶

Categorical features are encoded in two ways, using frequency encoding and using OneHotEncoding. The latter creates a binary column for each category of each categorical variable, having 1 when the categorical variable has the category value. In order to limit the number of columns in our dataset, some rare categories (with frequency less than 1%) have been grouped into a single category called Rare. This has been obtained using the RareLabelEncoder. To counter the loss of information when replacing rare labels, categorical features have been also encoded using frequency encoding which replaces the category values with the relative frequencies of each category.

Dataset after preprocessing¶

In [20]:

            
                Copied!
                
df = load_df(
    file_path="../../../data/kddcup.data_10_percent",
    header_file="../../../data/kddcup.names"
)
df = remove_dot_from_attack_type_classes(df)
df = load_df(
    file_path="../../../data/kddcup.data_10_percent",
    header_file="../../../data/kddcup.names"
)
df = remove_dot_from_attack_type_classes(df)

In [21]:

            
                Copied!
                
preprocessed_df = preprocessing_pipeline.fit_transform(df)
preprocessed_df.describe(percentiles=[0.01, 0.05, .25, .5, .75, 0.95, 0.99]).T
preprocessed_df = preprocessing_pipeline.fit_transform(df)
preprocessed_df.describe(percentiles=[0.01, 0.05, .25, .5, .75, 0.95, 0.99]).T

Out[21]:

	count	mean	std	1%	5%	25%	50%	75%	95%	99%	max
duration	494021.0	0.000823	0.012134	0.000000	0.000000	0.000000	0.000000	0.00	0.00	0.001509	1.0
protocol_type_freq	494021.0	0.822097	0.241090	0.000000	0.644681	0.644681	1.000000	1.00	1.00	1.000000	1.0
service_freq	494021.0	0.689304	0.369005	0.000714	0.025714	0.394074	1.000000	1.00	1.00	1.000000	1.0
flag_freq	494021.0	0.810397	0.344788	0.070996	0.070996	1.000000	1.000000	1.00	1.00	1.000000	1.0
src_bytes	494021.0	0.578597	0.436075	0.000000	0.000000	0.043605	0.503876	1.00	1.00	1.000000	1.0
dst_bytes	494021.0	0.083433	0.245432	0.000000	0.000000	0.000000	0.000000	0.00	1.00	1.000000	1.0
land	494021.0	0.000045	0.006673	0.000000	0.000000	0.000000	0.000000	0.00	0.00	0.000000	1.0
wrong_fragment	494021.0	0.002144	0.044935	0.000000	0.000000	0.000000	0.000000	0.00	0.00	0.000000	1.0
urgent	494021.0	0.000005	0.001837	0.000000	0.000000	0.000000	0.000000	0.00	0.00	0.000000	1.0
hot	494021.0	0.001151	0.026070	0.000000	0.000000	0.000000	0.000000	0.00	0.00	0.000000	1.0
num_failed_logins	494021.0	0.000030	0.003104	0.000000	0.000000	0.000000	0.000000	0.00	0.00	0.000000	1.0
logged_in	494021.0	0.148247	0.355345	0.000000	0.000000	0.000000	0.000000	0.00	1.00	1.000000	1.0
num_compromised	494021.0	0.000012	0.002034	0.000000	0.000000	0.000000	0.000000	0.00	0.00	0.000000	1.0
root_shell	494021.0	0.000111	0.010551	0.000000	0.000000	0.000000	0.000000	0.00	0.00	0.000000	1.0
su_attempted	494021.0	0.000018	0.003896	0.000000	0.000000	0.000000	0.000000	0.00	0.00	0.000000	1.0
num_root	494021.0	0.000011	0.002027	0.000000	0.000000	0.000000	0.000000	0.00	0.00	0.000000	1.0
num_file_creations	494021.0	0.000039	0.003443	0.000000	0.000000	0.000000	0.000000	0.00	0.00	0.000000	1.0
num_shells	494021.0	0.000055	0.005510	0.000000	0.000000	0.000000	0.000000	0.00	0.00	0.000000	1.0
num_access_files	494021.0	0.000126	0.004560	0.000000	0.000000	0.000000	0.000000	0.00	0.00	0.000000	1.0
num_outbound_cmds	494021.0	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.00	0.00	0.000000	0.0
is_host_login	494021.0	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.00	0.00	0.000000	0.0
is_guest_login	494021.0	0.001387	0.037211	0.000000	0.000000	0.000000	0.000000	0.00	0.00	0.000000	1.0
count	494021.0	0.650266	0.417118	0.001957	0.001957	0.228963	0.998043	1.00	1.00	1.000000	1.0
srv_count	494021.0	0.573203	0.482041	0.001957	0.001957	0.019569	0.998043	1.00	1.00	1.000000	1.0
serror_rate	494021.0	0.176687	0.380717	0.000000	0.000000	0.000000	0.000000	0.00	1.00	1.000000	1.0
srv_serror_rate	494021.0	0.176609	0.381017	0.000000	0.000000	0.000000	0.000000	0.00	1.00	1.000000	1.0
rerror_rate	494021.0	0.057433	0.231623	0.000000	0.000000	0.000000	0.000000	0.00	1.00	1.000000	1.0
srv_rerror_rate	494021.0	0.057719	0.232147	0.000000	0.000000	0.000000	0.000000	0.00	1.00	1.000000	1.0
same_srv_rate	494021.0	0.791547	0.388189	0.010000	0.030000	1.000000	1.000000	1.00	1.00	1.000000	1.0
diff_srv_rate	494021.0	0.020982	0.082205	0.000000	0.000000	0.000000	0.000000	0.00	0.07	0.208000	1.0
srv_diff_host_rate	494021.0	0.028997	0.142397	0.000000	0.000000	0.000000	0.000000	0.00	0.14	1.000000	1.0
dst_host_count	494021.0	0.911650	0.253903	0.011765	0.129412	1.000000	1.000000	1.00	1.00	1.000000	1.0
dst_host_srv_count	494021.0	0.739865	0.415845	0.003922	0.011765	0.180392	1.000000	1.00	1.00	1.000000	1.0
dst_host_same_srv_rate	494021.0	0.753780	0.410781	0.000000	0.020000	0.410000	1.000000	1.00	1.00	1.000000	1.0
dst_host_diff_srv_rate	494021.0	0.030906	0.109259	0.000000	0.000000	0.000000	0.000000	0.04	0.08	0.820000	1.0
dst_host_same_src_port_rate	494021.0	0.601935	0.481309	0.000000	0.000000	0.000000	1.000000	1.00	1.00	1.000000	1.0
dst_host_srv_diff_host_rate	494021.0	0.006684	0.042133	0.000000	0.000000	0.000000	0.000000	0.00	0.03	0.150000	1.0
dst_host_serror_rate	494021.0	0.176754	0.380593	0.000000	0.000000	0.000000	0.000000	0.00	1.00	1.000000	1.0
dst_host_srv_serror_rate	494021.0	0.176443	0.380919	0.000000	0.000000	0.000000	0.000000	0.00	1.00	1.000000	1.0
dst_host_rerror_rate	494021.0	0.058118	0.230590	0.000000	0.000000	0.000000	0.000000	0.00	1.00	1.000000	1.0
dst_host_srv_rerror_rate	494021.0	0.057412	0.230140	0.000000	0.000000	0.000000	0.000000	0.00	1.00	1.000000	1.0
src_bytes_right	494021.0	0.022284	0.147607	0.000000	0.000000	0.000000	0.000000	0.00	0.00	1.000000	1.0
dst_bytes_right	494021.0	0.049988	0.217920	0.000000	0.000000	0.000000	0.000000	0.00	0.00	1.000000	1.0
protocol_type_tcp	494021.0	0.384731	0.486532	0.000000	0.000000	0.000000	0.000000	1.00	1.00	1.000000	1.0
protocol_type_udp	494021.0	0.041201	0.198754	0.000000	0.000000	0.000000	0.000000	0.00	0.00	1.000000	1.0
protocol_type_icmp	494021.0	0.574069	0.494484	0.000000	0.000000	0.000000	1.000000	1.00	1.00	1.000000	1.0
service_http	494021.0	0.130142	0.336460	0.000000	0.000000	0.000000	0.000000	0.00	1.00	1.000000	1.0
service_smtp	494021.0	0.019681	0.138903	0.000000	0.000000	0.000000	0.000000	0.00	0.00	1.000000	1.0
service_Rare	494021.0	0.029578	0.169419	0.000000	0.000000	0.000000	0.000000	0.00	0.00	1.000000	1.0
service_domain_u	494021.0	0.011868	0.108292	0.000000	0.000000	0.000000	0.000000	0.00	0.00	1.000000	1.0
service_ecr_i	494021.0	0.569611	0.495131	0.000000	0.000000	0.000000	1.000000	1.00	1.00	1.000000	1.0
service_other	494021.0	0.014649	0.120144	0.000000	0.000000	0.000000	0.000000	0.00	0.00	1.000000	1.0
service_private	494021.0	0.224470	0.417233	0.000000	0.000000	0.000000	0.000000	0.00	1.00	1.000000	1.0
flag_SF	494021.0	0.766040	0.423347	0.000000	0.000000	1.000000	1.000000	1.00	1.00	1.000000	1.0
flag_Rare	494021.0	0.003439	0.058543	0.000000	0.000000	0.000000	0.000000	0.00	0.00	0.000000	1.0
flag_REJ	494021.0	0.054401	0.226807	0.000000	0.000000	0.000000	0.000000	0.00	1.00	1.000000	1.0
flag_S0	494021.0	0.176120	0.380923	0.000000	0.000000	0.000000	0.000000	0.00	1.00	1.000000	1.0

After preprocessing we notice that many features are quasi-constant. For example, the feature land is 0 for more than 99% of the cases. In a realistic situation, I would try to get some extra domain knowledge from experts and eventually drop such features. In a supervised problem with a more balanced dataset, I would definitely remove such features from my dataset. However, since the purpose of this exercise is to detect anomalies, which are very rare, I think it is a good idea to keep those quasi-constant features for the anomaly detection part. This will probably result in considering the values that deviate from the constant as anomalies, but probably it is not wrong do to so.