Forward Feature Selection with MRMR¶

This tutorial will show an example of how we can use forward feature selection based on the Minimum Redundancy Maximum Relevance (MRMR) criterion to improve our model performances.

More specifically, this tutorial will illustrate how to perform forward feature selection using the class felimination.mrmr.MRMRCV

In [1]:

Copied!

# Install felimination
! pip install felimination
# Install felimination
! pip install felimination

zsh:1: command not found: pip

What is MRMR?¶

MRMR is a forward feature selection strategy: it starts from an empty set and greedily adds one feature at a time, choosing at each step the feature that maximises a score combining two quantities:

Relevance: how much information a candidate feature shares with the target (measured by mutual information by default).
Redundancy: how much information the candidate feature shares with the features already selected (also measured by mutual information by default).

The combination can be either a difference (relevance - mean_redundancy, MID-style) or a ratio (relevance / mean_redundancy, MIQ-style), controlled by the scheme parameter.

The key advantage over plain relevance-based ranking is that MRMR actively avoids selecting highly correlated features: if two features carry the same information about the target, only the first one brings a real gain — the second one adds high redundancy and will be penalised.

Create a dummy Dataset¶

For this tutorial we will use a dummy classification dataset created using sklearn.datasets.make_classification. For this dataset we will have 6 predictive features, 10 redundant and 184 random features.

In [1]:

Copied!





from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=1000,
    n_features=200,
    n_informative=6,
    n_redundant=10,
    n_clusters_per_class=1,
    random_state=42,
    shuffle=False
)
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=1000,
    n_features=200,
    n_informative=6,
    n_redundant=10,
    n_clusters_per_class=1,
    random_state=42,
    shuffle=False
)

Evaluate performances without feature selection¶

In [2]:

Copied!





from sklearn.model_selection import cross_validate, StratifiedKFold
from sklearn.linear_model import LogisticRegression


# Define a simple logistic regression model
model = LogisticRegression(random_state=42)

# Perform cross-validation
cv_results = cross_validate(
    model,
    X,
    y,
    cv=StratifiedKFold(random_state=42, shuffle=True),
    scoring="roc_auc",
    return_train_score=True,
)

cv_results["test_score"].mean()
from sklearn.model_selection import cross_validate, StratifiedKFold
from sklearn.linear_model import LogisticRegression


# Define a simple logistic regression model
model = LogisticRegression(random_state=42)

# Perform cross-validation
cv_results = cross_validate(
    model,
    X,
    y,
    cv=StratifiedKFold(random_state=42, shuffle=True),
    scoring="roc_auc",
    return_train_score=True,
)

cv_results["test_score"].mean()

Out[2]:

np.float64(0.8561362716271628)

Perform Forward Feature Selection with MRMR¶

MRMRCV wraps ForwardSelectorCV and wires it with the MRMRRanker importance getter. At each step of the forward selection loop it:

Scores every candidate feature using the MRMR criterion.
Adds the highest-scoring feature to the selected set.
Evaluates the model (via cross-validation) at checkpoints controlled by the step parameter.

After fitting, best_iteration_selection_criteria is used to pick the number of features that achieved the best cross-validation score.

In [8]:

Copied!





from felimination.mrmr import MRMRCV
from felimination.callbacks import plot_progress_callback


selector = MRMRCV(
    model,
    step=0.2,
    max_features_to_select=50,
    callbacks=[plot_progress_callback],
    scoring="roc_auc",
    cv=StratifiedKFold(random_state=42, shuffle=True),
    best_iteration_selection_criteria="mean_test_score",
    random_state=42,
    min_relevance=0.05,
)
selector.fit(X, y)
from felimination.mrmr import MRMRCV
from felimination.callbacks import plot_progress_callback


selector = MRMRCV(
    model,
    step=0.2,
    max_features_to_select=50,
    callbacks=[plot_progress_callback],
    scoring="roc_auc",
    cv=StratifiedKFold(random_state=42, shuffle=True),
    best_iteration_selection_criteria="mean_test_score",
    random_state=42,
    min_relevance=0.05,
)
selector.fit(X, y)

No description has been provided for this image

Out[8]:

MRMRCV(callbacks=[<function plot_progress_callback at 0x11aa50720>],
       cv=StratifiedKFold(n_splits=5, random_state=42, shuffle=True),
       estimator=LogisticRegression(random_state=42), max_features_to_select=50,
       random_state=42, scoring='roc_auc', step=0.2)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [9]:

Copied!

selector.support_
selector.support_

Out[9]:

array([False, False, False, False,  True,  True, False,  True,  True,
       False, False, False, False,  True,  True, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False,  True, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False])

In [10]:

Copied!

selector.ranking_
selector.ranking_

Out[10]:

array([50, 51, 10, 14,  6,  3, 11,  7,  1, 13, 12, 51,  9,  4,  5,  8, 51,
       51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51,
       51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51,
       51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51,
       51, 51, 51, 51, 51, 51, 51,  2, 51, 51, 51, 51, 51, 51, 51, 51, 51,
       51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51,
       51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51,
       51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51,
       51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51,
       51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51,
       51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51,
       51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51])

Notice how model performances increase as more informative features are added, then plateau or drop when only noisy features remain to be selected.

Because MRMR penalises redundant features, the selector tends to pick a diverse set: once a relevant feature is in the selection, correlated copies of it score poorly and are deprioritised in favour of genuinely new signal.

In [11]:

Copied!

import pandas as pd

cv_results_df = pd.DataFrame(selector.cv_results_)

cv_results_df[["mean_test_score", "n_features"]].sort_values(
    "mean_test_score", ascending=False
).head(10)
import pandas as pd

cv_results_df = pd.DataFrame(selector.cv_results_)

cv_results_df[["mean_test_score", "n_features"]].sort_values(
    "mean_test_score", ascending=False
).head(10)

Out[11]:

	mean_test_score	n_features
6	0.927898	7
5	0.927798	6
7	0.927718	8
10	0.927558	12
12	0.927478	16
18	0.927478	44
17	0.927478	37
16	0.927478	31
15	0.927478	26
14	0.927478	22

The best AUC score obtained with MRMR forward selection is a clear improvement over the baseline.

We can also visualise the full CV curve with the built-in plot method, which highlights the best iteration:

In [12]:

Copied!

selector.plot()
selector.plot()

Out[12]:

<Axes: title={'center': 'Forward Feature Selection Plot\nBest Number of Features: 7\nBest Test Score: 0.928\nBest Train Score: 0.931'}, xlabel='n_features', ylabel='score'>

Looking at the curve, we can decide to pick a slightly smaller subset if the scores are comparable — a smaller model is simpler and generalises better.

We can do this using the method set_n_features_to_select. This will change the support of the selector as well as the behaviour of the transform method. Note that only values that were evaluated during the forward selection loop are valid choices (i.e. multiples of step, starting from 1).

In [13]:

Copied!

import numpy as np

# Show the index of the selected features — indices 0-5 are informative, 6-15 are redundant, >15 is noise
np.arange(0, X.shape[1])[selector.support_]
import numpy as np

# Show the index of the selected features — indices 0-5 are informative, 6-15 are redundant, >15 is noise
np.arange(0, X.shape[1])[selector.support_]

Out[13]:

array([ 4,  5,  7,  8, 13, 14, 75])

In [14]:

Copied!

selector.ranking_
selector.ranking_

Out[14]:

array([50, 51, 10, 14,  6,  3, 11,  7,  1, 13, 12, 51,  9,  4,  5,  8, 51,
       51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51,
       51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51,
       51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51,
       51, 51, 51, 51, 51, 51, 51,  2, 51, 51, 51, 51, 51, 51, 51, 51, 51,
       51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51,
       51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51,
       51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51,
       51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51,
       51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51,
       51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51,
       51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51, 51])

In [15]:

Copied!

import numpy as np
# top 5 Ranking of features with MRMR-CV indices 0-5 are informative, 6-15 are redundant, >15 is noise

np.arange(0, X.shape[1])[np.argsort(selector.ranking_)[:5]]
import numpy as np
# top 5 Ranking of features with MRMR-CV indices 0-5 are informative, 6-15 are redundant, >15 is noise

np.arange(0, X.shape[1])[np.argsort(selector.ranking_)[:5]]

Out[15]:

array([ 8, 75,  5, 13, 14])

We can see that most of the selected features have a low index (informative or redundant region), confirming that MRMR successfully identified the signal-carrying features while ignoring the bulk of random noise.

Because MRMR explicitly penalises redundancy, it also avoids selecting many highly correlated copies from the redundant block — a behaviour that pure relevance-based ranking would not guarantee.

	estimator	LogisticRegre...ndom_state=42)
	step	0.2
	min_features_to_select	None
	max_features_to_select	50
	cv	StratifiedKFo... shuffle=True)
	scoring	'roc_auc'
	verbose	0
	n_jobs	None
	random_state	42
	scheme	'difference'
	n_neighbors	3
	discrete_features	'auto'
	relevance_func	None
	redundance_func	None
	redundancy_aggregation	'max'
	min_relevance	0.05
	max_redundancy	None
	callbacks	[<function plo...t 0x11aa50720>]
	best_iteration_selection_criteria	'mean_test_score'

	penalty penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning:: Some penalties may not work with some solvers. See the parameter `solver` below, to know the compatibility between the penalty and solver. .. versionadded:: 0.19 l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8 `penalty` was deprecated in version 1.8 and will be removed in 1.10. Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for `'penalty='elasticnet'`.	'deprecated'
	C C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.	1.0
	l1_ratio l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning:: Certain values of `l1_ratio`, i.e. some penalties, may not work with some solvers. See the parameter `solver` below, to know the compatibility between the penalty and solver. .. versionchanged:: 1.8 Default value changed from None to 0.0. .. deprecated:: 1.8 `None` is deprecated and will be removed in version 1.10. Always use `l1_ratio` to specify the penalty type.	0.0
	dual dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.	False
	tol tol: float, default=1e-4 Tolerance for stopping criteria.	0.0001
	fit_intercept fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.	True
	intercept_scaling intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a "synthetic" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note:: The synthetic feature weight is subject to L1 or L2 regularization as all other features. To lessen the effect of regularization on synthetic feature weight (and therefore on the intercept) `intercept_scaling` has to be increased.	1
	class_weight class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The "balanced" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17 class_weight='balanced'	None
	random_state random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.	42
	solver solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except 'liblinear' minimize the full multinomial loss, 'liblinear' will raise an error. - 'newton-cholesky' is a good choice for `n_samples` >> `n_features * n_classes`, especially with one-hot encoded categorical features with rare categories. Be aware that the memory usage of this solver has a quadratic dependency on `n_features * n_classes` because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag' and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a one-versus-rest scheme for the multiclass setting one can wrap it with the :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning:: The choice of the algorithm depends on the penalty chosen (`l1_ratio=0` for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for Elastic-Net) and on (multinomial) multiclass support: ================= ======================== ====================== solver l1_ratio multinomial multiclass ================= ======================== ====================== 'lbfgs' l1_ratio=0 yes 'liblinear' l1_ratio=1 or l1_ratio=0 no 'newton-cg' l1_ratio=0 yes 'newton-cholesky' l1_ratio=0 yes 'sag' l1_ratio=0 yes 'saga' 0<=l1_ratio<=1 yes ================= ======================== ====================== .. note:: 'sag' and 'saga' fast convergence is only guaranteed on features with approximately the same scale. You can preprocess the data with a scaler from :mod:`sklearn.preprocessing`. .. seealso:: Refer to the :ref:`User Guide ` for more information regarding :class:`LogisticRegression` and more specifically the :ref:`Table ` summarizing solver/penalty supports. .. versionadded:: 0.17 Stochastic Average Gradient (SAG) descent solver. Multinomial support in version 0.18. .. versionadded:: 0.19 SAGA solver. .. versionchanged:: 0.22 The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2 newton-cholesky solver. Multinomial support in version 1.6.	'lbfgs'
	max_iter max_iter: int, default=100 Maximum number of iterations taken for the solvers to converge.	100
	verbose verbose: int, default=0 For the liblinear and lbfgs solvers set verbose to any positive number for verbosity.	0
	warm_start warm_start: bool, default=False When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. Useless for liblinear solver. See :term:`the Glossary `. .. versionadded:: 0.17 warm_start to support lbfgs, newton-cg, sag, saga solvers.	False
	n_jobs n_jobs: int, default=None Does not have any effect. .. deprecated:: 1.8 `n_jobs` is deprecated in version 1.8 and will be removed in 1.10.	None