Recursive Feature Elimination (RFE)¶

This tutorial will show an example of how we can use recursive feature elimination to improve our model performances. More specifically, this tutorial will illustrate how to perform backward recursive feature elimination based on permutation importance using the class felimination.rfe.PermutationImportanceRFECV

In [1]:

Copied!

# Install felimination
! pip install felimination
# Install felimination
! pip install felimination

pyenv: version `3.12' is not installed (set by /Users/claudio/workspace/.python-version)

Create a dummy Dataset¶

For this tutorial we will use a dummy classification dataset created using sklearn.datasets.make_classification. For this dataset we will have 6 predictive features, 10 redundant and 184 random features.

In [2]:

Copied!





from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=1000,
    n_features=200,
    n_informative=6,
    n_redundant=10,
    n_clusters_per_class=1,
    random_state=42,
    shuffle=False
)
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=1000,
    n_features=200,
    n_informative=6,
    n_redundant=10,
    n_clusters_per_class=1,
    random_state=42,
    shuffle=False
)

Evaluate performances without feature elimination¶

In [3]:

Copied!





from sklearn.model_selection import cross_validate, StratifiedKFold
from sklearn.linear_model import LogisticRegression


# Define a simple logistic regression model
model = LogisticRegression(random_state=42)

# Perform cross-validation
cv_results = cross_validate(
    model,
    X,
    y,
    cv=StratifiedKFold(random_state=42, shuffle=True),
    scoring="roc_auc",
    return_train_score=True,
)

cv_results["test_score"].mean()
from sklearn.model_selection import cross_validate, StratifiedKFold
from sklearn.linear_model import LogisticRegression


# Define a simple logistic regression model
model = LogisticRegression(random_state=42)

# Perform cross-validation
cv_results = cross_validate(
    model,
    X,
    y,
    cv=StratifiedKFold(random_state=42, shuffle=True),
    scoring="roc_auc",
    return_train_score=True,
)

cv_results["test_score"].mean()

Out[3]:

np.float64(0.8561362716271628)

Perform now feature elimination¶

In [5]:

Copied!





from felimination.rfe import PermutationImportanceRFECV
from felimination.callbacks import plot_progress_callback


selector = PermutationImportanceRFECV(
    model,
    step=0.2,
    callbacks=[plot_progress_callback],
    scoring="roc_auc",
    cv=StratifiedKFold(random_state=42, shuffle=True),
    best_iteration_selection_criteria="mean_test_score"
)
selector.fit(X, y)
from felimination.rfe import PermutationImportanceRFECV
from felimination.callbacks import plot_progress_callback


selector = PermutationImportanceRFECV(
    model,
    step=0.2,
    callbacks=[plot_progress_callback],
    scoring="roc_auc",
    cv=StratifiedKFold(random_state=42, shuffle=True),
    best_iteration_selection_criteria="mean_test_score"
)
selector.fit(X, y)

No description has been provided for this image

Out[5]:

PermutationImportanceRFECV(best_iteration_selection_criteria='mean_test_score',
                           callbacks=[<function plot_progress_callback at 0x12e561a80>],
                           cv=StratifiedKFold(n_splits=5, random_state=42, shuffle=True),
                           estimator=LogisticRegression(random_state=42),
                           scoring='roc_auc', step=0.2)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [6]:

Copied!

selector.support_
selector.support_

Out[6]:

array([False,  True,  True,  True, False,  True,  True,  True,  True,
        True,  True, False, False, False, False,  True, False,  True,
       False,  True, False, False, False, False, False, False, False,
        True, False, False, False, False, False, False, False, False,
        True, False, False, False, False, False,  True, False, False,
        True, False,  True, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,  True,  True, False, False,  True, False,  True,
       False,  True, False, False,  True, False, False, False,  True,
       False,  True, False, False, False,  True, False,  True, False,
       False, False, False,  True, False,  True, False, False, False,
       False, False, False,  True, False, False,  True,  True,  True,
       False, False, False, False, False,  True, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False,  True, False, False, False, False, False, False, False,
       False, False,  True,  True, False, False, False,  True, False,
       False, False, False, False, False, False,  True, False, False,
       False, False, False, False, False,  True, False, False, False,
        True, False,  True, False, False, False,  True, False, False,
       False, False, False, False, False, False, False, False,  True,
       False, False])

In [7]:

Copied!

selector.ranking_
selector.ranking_

Out[7]:

array([23,  5, 12,  2, 24, 15, 16,  1,  3,  6,  4, 24, 23, 19, 24, 14, 19,
       17, 21, 16, 24, 20, 24, 21, 24, 18, 22, 16, 22, 23, 24, 21, 22, 22,
       21, 22, 16, 20, 23, 23, 24, 20, 13, 24, 23, 13, 23, 14, 23, 22, 22,
       24, 19, 19, 23, 19, 23, 20, 23, 23, 22, 23, 23, 23, 24, 17, 11, 20,
       23, 10, 22, 14, 18, 13, 24, 21, 12, 23, 24, 18,  9, 21, 13, 21, 24,
       21, 16, 18, 15, 21, 24, 22, 20, 17, 20, 17, 22, 21, 24, 19, 19, 24,
       16, 20, 24, 15, 17, 17, 24, 24, 24, 22, 21, 14, 21, 22, 23, 24, 21,
       21, 22, 20, 23, 23, 24, 20, 23, 23, 24, 24, 18, 19, 20, 22, 23, 24,
       22, 18, 21, 24, 24, 23, 22, 24, 22, 15, 20, 21, 23, 23, 22, 19, 22,
       20, 22,  8, 12, 20, 23, 22, 17, 18, 23, 24, 24, 22, 21, 24, 11, 19,
       20, 24, 21, 24, 18, 21, 16, 21, 19, 24, 17, 18, 15, 24, 22, 24, 10,
       19, 22, 24, 23, 24, 23, 20, 24, 23, 19,  7, 18, 23])

Notice how model performances increase with the progressive elimination of features.

This is due to the fact that models with a lot of not predictive feature tend to find patterns even in random noise and end up overfitting, see how the train score and the validation score get closer with the progressive elimination of features.

In [5]:

Copied!

import pandas as pd

cv_results_df = pd.DataFrame(selector.cv_results_)

cv_results_df[["mean_test_score", "n_features"]].sort_values(
    "mean_test_score", ascending=False
).head(10)
import pandas as pd

cv_results_df = pd.DataFrame(selector.cv_results_)

cv_results_df[["mean_test_score", "n_features"]].sort_values(
    "mean_test_score", ascending=False
).head(10)

Out[5]:

	mean_test_score	n_features
6	0.943938	54
7	0.943898	44
5	0.942938	67
8	0.942518	36
4	0.942198	83
9	0.941858	29
11	0.939038	20
10	0.938998	24
3	0.936798	103
12	0.936738	16

The best AUC score obtained with feature elimination is now 0.94, that's 0.08 AUC points obtained from less features.

If I had to choose a number of features, I would probably go for 13 number of features because there the validation score is very close to the train score.

We can do this using the method set_n_features_to_select. This will change the support of the selector as well as the behavior of the transform method.

In [6]:

Copied!

selector.set_n_features_to_select(13)
selector.transform(X).shape
selector.set_n_features_to_select(13)
selector.transform(X).shape

Out[6]:

(1000, 13)

In [7]:

Copied!

import numpy as np

# Show the index of the selected features, index <= 15 are relevant
np.arange(0, X.shape[1])[selector.support_]
import numpy as np

# Show the index of the selected features, index <= 15 are relevant
np.arange(0, X.shape[1])[selector.support_]

Out[7]:

array([  2,   3,   7,   8,   9,  10,  12,  15,  66,  69,  80,  82, 197])

We can see from the index of selected features that most of the selected features are informative (index<=15) while still some random features are being selected. Also some of the features are still redundant.