MRMR

Module with tools to perform forward feature selection using the Minimum Redundancy Maximum Relevance (MRMR) framework.

This module contains:

MRMRRanker: stateful importance-getter callable implementing the MRMR score (mutual information relevance / absolute Pearson correlation redundance), suitable for use with ForwardSelectorCV.
MRMRCV: preset of ForwardSelectorCV wired with MRMRRanker.

`MRMRCV(estimator, *, step=1, min_features_to_select=None, max_features_to_select=None, cv=None, scoring=None, verbose=0, n_jobs=None, random_state=None, scheme='difference', n_neighbors=3, discrete_features='auto', relevance_func=None, redundance_func=None, redundancy_aggregation='max', min_relevance_perc=0.01, max_redundancy=None, discrete_imputer=None, continuous_imputer=None, max_samples=None, callbacks=None, best_iteration_selection_criteria='mean_test_score')`

Bases: ForwardSelectorCV

Forward feature selector using Minimum Redundancy Maximum Relevance (MRMR) scoring.

Performs forward feature selection driven by MRMR scores, using cross-validation to determine the optimal number of features.

The selector starts by ranking all features by their relevance to the target and picks the highest-scoring one. It then iteratively selects the feature that maximises relevance minus (or divided by) redundance with already-selected features. Cross-validation scores the model at each evaluated step.

By default both relevance (feature-vs-target) and redundance (feature-vs-already-selected-feature) are computed with mutual information, which handles continuous and categorical features transparently when discrete_features is supplied. Both functions can be swapped out via relevance_func / redundance_func.

Parameters:

estimator (``Estimator`` instance) –

A supervised learning estimator with a fit method, used to score candidate feature subsets via cross-validation. Also used to detect classification vs. regression for the MRMR ranker (via is_classifier).
step (int or float, default: 1 ) –

Number of features added between two consecutive cross-validation evaluations. If greater than or equal to 1, this is the integer number of features added per evaluation. If within (0.0, 1.0), it is the fraction (rounded down, with a floor of 1) of the already- selected features added per evaluation, growing the selection geometrically. Selection within a step still happens one feature at a time.
min_features_to_select (int, default: None ) –

Minimum number of features that must be selected before the first cross-validation evaluation. Features are still selected via MRMR scoring before this threshold, but no CV scoring takes place. If None, defaults to 1 (CV evaluation starts from the very first selected feature).
max_features_to_select (int, default: None ) –

Maximum number of features to select. The forward process stops once this many features have been selected. If None, defaults to all features in X.
cv (int, cross-validation generator or an iterable, default: None ) –

Determines the cross-validation splitting strategy. See ~sklearn.model_selection.check_cv for accepted inputs.
scoring ((str, callable or None), default: None ) –

Scorer used to evaluate the estimator on each CV fold.
verbose (int, default: 0 ) –

Controls verbosity of output.
n_jobs (int or None, default: None ) –

Number of cores to run in parallel while fitting across folds. Also forwarded to the default mutual information estimators used for MRMR scoring.
random_state (int, RandomState instance or None, default: None ) –

Seed used by the default mutual information estimators and by plot.
scheme ((ratio, difference), default: 'ratio' ) –
How to combine relevance and redundance:
- 'ratio': relevance / redundance (MIQ-style).
- 'difference': relevance - redundance (MID-style).
n_neighbors (int, default: 3 ) –

Number of neighbors used by the default mutual information estimators. Ignored when both relevance_func and redundance_func are overridden.
discrete_features ((auto, bool or array - like), default: 'auto' ) –
Indicates which input features are categorical. Accepted formats match sklearn.feature_selection.mutual_info_classif:
- 'auto': infer from dtype when X is a :class:pandas.DataFrame — columns with categorical, string, or object dtype are treated as discrete; all others as continuous. Falls back to all-continuous for plain arrays.
- True: treat all features as discrete.
- boolean mask of shape (n_features,).
- integer array of indices of the discrete features.
Used by the default relevance and redundance functions, both to tell the mutual information estimator which inputs are categorical and to decide whether to use the classifier or regressor estimator when a categorical feature is the target of a redundance computation. Ignored when both relevance_func and redundance_func are overridden.
relevance_func (callable, default: None ) –

Optional override for the relevance computation. Signature: relevance_func(X, y) -> ndarray of shape (n_features,), scoring each feature against the target. When None (default), mutual information is used (handles categorical features via discrete_features). Use abs_pearson_correlation for a fast Pearson-based alternative on purely numeric data.
redundance_func (callable, default: None ) –

Optional override for the redundance computation. Signature: redundance_func(X, y_feature) -> ndarray of shape (n_features,), scoring each feature against the already-selected y_feature. When None (default), mutual information is used (the classifier vs. regressor estimator is chosen based on whether the target column is marked as categorical in discrete_features).
redundancy_aggregation ((max, mean), default: 'max' ) –
How to aggregate per-selected-feature redundancy scores into a single redundancy value before combining with relevance:
- 'max': take the element-wise maximum across all already-selected features. A candidate is penalised as soon as it is highly redundant with any selected feature, making the criterion more conservative.
- 'mean': take the element-wise mean, matching the formulation in the original MRMR paper (Peng et al., 2005).
- callable: a function with signature f(redundancy_matrix) -> ndarray of shape (n_features,), where redundancy_matrix has shape (n_selected, n_features). Rows correspond to already-selected features; columns to candidate features.
Note: The default 'max' deviates from the original MRMR paper, which uses the mean. 'max' is chosen as the default because it more aggressively avoids adding features that duplicate information already captured, which tends to work better in practice for forward selection with CV scoring.
min_relevance_perc (float or None, default: 0.01 ) –

If set, features are filtered based on cumulative relevance. After computing relevance scores, a minimum relevance threshold is derived as min_relevance_perc * sum(relevance scores). Features are then ordered by relevance ascending and their cumulative relevance is computed; any feature whose cumulative relevance (from the least relevant up to and including itself) is strictly below the threshold is assigned -inf and will never be selected. This removes the low-relevance tail that together contributes less than min_relevance_perc of the total relevance.
max_redundancy (float or None, default: None ) –

If set, features whose aggregated redundancy with the already-selected features exceeds this threshold are assigned -inf and will not be selected in that round. The aggregation is controlled by redundancy_aggregation. Only applied when at least one feature has already been selected.
discrete_imputer (sklearn-compatible transformer or None, default: None ) –

Forwarded to :class:MRMRRanker. Imputer for discrete (categorical) columns. When None, defaults to SimpleImputer(strategy='constant', fill_value='MISSING').
continuous_imputer (sklearn-compatible transformer or None, default: None ) –

Forwarded to :class:MRMRRanker. Imputer for continuous (numeric) columns. When None, defaults to SimpleImputer(strategy='median').
max_samples ((int, float or None), default: None ) –

Forwarded to :class:MRMRRanker. Number of samples used when computing mutual information scores. None means all samples. See :class:MRMRRanker for the full description.
callbacks (list of callable, default: None ) –

List of callables called at the end of each evaluated step. Each callable receives (selector, scores) where scores is the last array of MRMR scores.
best_iteration_selection_criteria (str or callable, default: 'mean_test_score' ) –

Either a key into cv_results_ (the iteration that maximises that key is picked) or a callable f(cv_results) -> n_features that must return one of the values in cv_results_["n_features"].

Examples:

>>> from felimination.mrmr import MRMRCV
>>> from sklearn.datasets import make_classification
>>> from sklearn.linear_model import LogisticRegression
>>> X, y = make_classification(n_samples=200, n_features=10, random_state=0)
>>> selector = MRMRCV(
...     LogisticRegression(),
...     min_features_to_select=2,
...     max_features_to_select=8,
...     step=1,
...     cv=3,
...     random_state=0,
... ).fit(X, y)
>>> selector.support_.sum() > 0
True

Source code in felimination/mrmr.py

def __init__(
    self,
    estimator,
    *,
    step=1,
    min_features_to_select=None,
    max_features_to_select=None,
    cv=None,
    scoring=None,
    verbose=0,
    n_jobs=None,
    random_state=None,
    scheme="difference",
    n_neighbors=3,
    discrete_features="auto",
    relevance_func=None,
    redundance_func=None,
    redundancy_aggregation="max",
    min_relevance_perc=0.01,
    max_redundancy=None,
    discrete_imputer=None,
    continuous_imputer=None,
    max_samples=None,
    callbacks=None,
    best_iteration_selection_criteria="mean_test_score",
) -> None:
    self.scheme = scheme
    self.n_neighbors = n_neighbors
    self.discrete_features = discrete_features
    self.relevance_func = relevance_func
    self.redundance_func = redundance_func
    self.redundancy_aggregation = redundancy_aggregation
    self.min_relevance_perc = min_relevance_perc
    self.max_redundancy = max_redundancy
    self.discrete_imputer = discrete_imputer
    self.continuous_imputer = continuous_imputer
    self.max_samples = max_samples
    super().__init__(
        estimator,
        step=step,
        min_features_to_select=min_features_to_select,
        max_features_to_select=max_features_to_select,
        cv=cv,
        scoring=scoring,
        verbose=verbose,
        n_jobs=n_jobs,
        random_state=random_state,
        importance_getter=MRMRRanker(
            regression=not is_classifier(estimator),
            scheme=scheme,
            n_neighbors=n_neighbors,
            discrete_features=discrete_features,
            random_state=random_state,
            n_jobs=n_jobs,
            relevance_func=relevance_func,
            redundance_func=redundance_func,
            redundancy_aggregation=redundancy_aggregation,
            min_relevance_perc=min_relevance_perc,
            max_redundancy=max_redundancy,
            discrete_imputer=discrete_imputer,
            continuous_imputer=continuous_imputer,
            max_samples=max_samples,
        ),
        callbacks=callbacks,
        best_iteration_selection_criteria=best_iteration_selection_criteria,
    )

`plot(**kwargs)`

Plot the cross-validation curve over number of features.

Parameters:

**kwargs (dict, default: {} ) –

Forwarded to seaborn.lineplot.

Returns:

Axes –

Source code in felimination/forward.py

def plot(self, **kwargs):
    """Plot the cross-validation curve over number of features.

    Parameters
    ----------
    **kwargs : dict
        Forwarded to `seaborn.lineplot`.

    Returns
    -------
    matplotlib.axes.Axes
    """
    check_is_fitted(self)
    best_n = self.select_best_iteration(self.cv_results_)
    best_index = self.cv_results_["n_features"].index(best_n)
    best_train_score = self.cv_results_["mean_train_score"][best_index]
    best_test_score = self.cv_results_["mean_test_score"][best_index]
    df = pd.DataFrame(self.cv_results_)
    split_score_cols = [c for c in df if "split" in c]
    df_long = df[split_score_cols + ["n_features"]].melt(
        id_vars=["n_features"],
        value_vars=split_score_cols,
        var_name="split",
        value_name="score",
    )
    df_long["set"] = np.where(
        df_long["split"].str.contains("train"), "train", "validation"
    )
    lineplot_kwargs = dict(
        x="n_features",
        y="score",
        hue="set",
        markers=True,
        style="set",
        hue_order=["validation", "train"],
        style_order=["validation", "train"],
        seed=self.random_state,
        zorder=0,
    )
    lineplot_kwargs.update(**kwargs)
    ax = sns.lineplot(data=df_long, **lineplot_kwargs)
    ax.set_xticks(df.n_features)
    ax.plot(
        best_n,
        best_test_score,
        color="red",
        label="Best Iteration",
        zorder=1,
        marker="*",
        markersize=10,
        markeredgewidth=2,
        markeredgecolor="red",
        fillstyle="none",
    )
    ax.legend()
    ax.set_title(
        "\n".join(
            (
                "Forward Feature Selection Plot",
                f"Best Number of Features: {best_n}",
                f"Best Test Score: {best_test_score:.3f}",
                f"Best Train Score: {best_train_score:.3f}",
            )
        )
    )
    return ax

`select_best_iteration(cv_results)`

Return the best n_features value given cv_results_.

Source code in felimination/forward.py

def select_best_iteration(self, cv_results):
    """Return the best `n_features` value given ``cv_results_``."""
    if callable(self.best_iteration_selection_criteria):
        return self.best_iteration_selection_criteria(cv_results)
    return cv_results["n_features"][
        int(np.argmax(cv_results[self.best_iteration_selection_criteria]))
    ]

`set_n_features_to_select(n_features_to_select)`

Change the number of selected features after fitting.

The underlying estimator is not retrained — predict / predict_proba keep using the model fit on the originally selected features. Only support_, transform and get_feature_names_out are affected.

Parameters:

n_features_to_select (int) –

Must be one of the values in cv_results_["n_features"].

Source code in felimination/forward.py

def set_n_features_to_select(self, n_features_to_select):
    """Change the number of selected features after fitting.

    The underlying estimator is **not** retrained — `predict` /
    `predict_proba` keep using the model fit on the originally
    selected features. Only `support_`, `transform` and
    `get_feature_names_out` are affected.

    Parameters
    ----------
    n_features_to_select : int
        Must be one of the values in ``cv_results_["n_features"]``.
    """
    check_is_fitted(self)
    if n_features_to_select not in self.cv_results_["n_features"]:
        raise ValueError(
            f"This selector has not been evaluated with "
            f"{n_features_to_select} features. Pick one of "
            f"{sorted(set(self.cv_results_['n_features']))}."
        )
    support_ = np.zeros_like(self.support_, dtype=bool)
    support_[np.argsort(self.ranking_)[:n_features_to_select]] = True
    self.support_ = support_
    self.n_features_ = n_features_to_select
    return self

`MRMRRanker(regression=False, scheme='difference', n_neighbors=3, discrete_features='auto', random_state=None, n_jobs=None, relevance_func=None, redundance_func=None, redundancy_aggregation='max', min_relevance_perc=0.01, max_redundancy=None, discrete_imputer=None, continuous_imputer=None, max_samples=None)`

Importance getter implementing the Minimum Redundancy Maximum Relevance score.

By default both relevance (feature-vs-target) and redundance (feature-vs-already-selected-feature) are computed with mutual information, which handles continuous and categorical features transparently when discrete_features is supplied. Both functions can be swapped out via relevance_func / redundance_func.

The ranker is lazy and stateful. On every call it computes a lightweight fingerprint of (X, y) (shape, dtype, boundary values). If the fingerprint matches the previous call, all cached state — relevance and per-feature redundance vectors — is reused. If it differs (different dataset or different CV fold), the caches are reset automatically before re-initialising. This means the same instance can be reused across successive fit calls efficiently.

Redundance is stored per feature in _redundance_cache: the redundance vector for a given feature is computed at most once per dataset — if the same feature appears in selected_idx of a later call, the cached vector is reused directly.

The ranker auto-initialises on its first call.

Parameters:

regression (bool, default: False ) –

Whether the target is continuous. Switches the default relevance between mutual_info_regression and mutual_info_classif. Ignored when relevance_func is set.
scheme ((ratio, difference), default: 'ratio' ) –
How to combine relevance and redundance:
- 'ratio': relevance / redundance (MIQ-style).
- 'difference': relevance - redundance (MID-style).
n_neighbors (int, default: 3 ) –

Number of neighbors used by the default mutual information estimators. Ignored when both functions are overridden.
discrete_features ('auto', bool, or array-like, default: 'auto' ) –
Indicates which input features are categorical. Accepted formats match sklearn.feature_selection.mutual_info_classif:
- 'auto': infer from dtype when X is a :class:pandas.DataFrame — columns with categorical, string, or object dtype are treated as discrete; all others as continuous. Falls back to all-continuous for plain arrays.
- True: treat all features as discrete.
- boolean mask of shape (n_features,).
- integer array of indices of the discrete features.
Used by the default relevance and redundance functions, both to tell the mutual information estimator which inputs are categorical and to decide whether to use the classifier or regressor estimator when a categorical feature is the target of a redundance computation. Ignored when both relevance_func and redundance_func are overridden.
random_state (int, RandomState instance or None, default: None ) –

Seed used by the default mutual information estimators.
n_jobs (int or None, default: None ) –

Forwarded to the default mutual information estimators.
relevance_func (callable, default: None ) –

Optional override for the relevance computation. Signature: relevance_func(X, y) -> ndarray of shape (n_features,), scoring each feature against the target. When None (default), mutual information is used (handles categorical features via discrete_features). Use abs_pearson_correlation for a fast Pearson-based alternative on purely numeric data.
redundance_func (callable, default: None ) –

Optional override for the redundance computation. Signature: redundance_func(X, y_feature) -> ndarray of shape (n_features,), scoring each feature against the already-selected y_feature. When None (default), mutual information is used (the classifier vs. regressor estimator is chosen based on whether the target column is marked as categorical in discrete_features).
redundancy_aggregation ((max, mean), default: 'max' ) –
How to aggregate per-selected-feature redundancy scores into a single redundancy value before combining with relevance:
- 'max': take the element-wise maximum across all already-selected features. A candidate is penalised as soon as it is highly redundant with any selected feature, making the criterion more conservative.
- 'mean': take the element-wise mean, matching the formulation in the original MRMR paper (Peng et al., 2005).
- callable: a function with signature f(redundancy_matrix) -> ndarray of shape (n_features,), where redundancy_matrix has shape (n_selected, n_features). Rows correspond to already-selected features; columns to candidate features.
Note: The default 'max' deviates from the original MRMR paper, which uses the mean. 'max' is chosen as the default because it more aggressively avoids adding features that duplicate information already captured, which tends to work better in practice for forward selection with CV scoring.
min_relevance_perc (float or None, default: 0.01 ) –

If set, features are filtered based on cumulative relevance. After computing relevance scores, a minimum relevance threshold is derived as min_relevance_perc * sum(relevance scores). Features are then ordered by relevance ascending and their cumulative relevance is computed; any feature whose cumulative relevance (from the least relevant up to and including itself) is strictly below the threshold is assigned -inf and will never be selected. This removes the low-relevance tail that together contributes less than min_relevance_perc of the total relevance.
max_redundancy (float or None, default: None ) –

If set, features whose aggregated redundancy with the already-selected features exceeds this threshold are assigned -inf and will not be selected in that round. The aggregation is controlled by redundancy_aggregation. Only applied when at least one feature has already been selected.
discrete_imputer (sklearn-compatible transformer or None, default: None ) –

Imputer applied to discrete (categorical) feature columns before encoding. When None, defaults to SimpleImputer(strategy='constant', fill_value='MISSING'), replacing missing values with the string 'MISSING' (treated as an additional category). Pass any sklearn-compatible transformer with fit/transform. Ignored when there are no discrete columns.
continuous_imputer (sklearn-compatible transformer or None, default: None ) –

Imputer applied to continuous (numeric) feature columns before the mutual information computation. When None, defaults to SimpleImputer(strategy='median'). Pass any sklearn-compatible transformer with fit/transform. Ignored when there are no continuous columns. For arrays with non-object dtype, applied to all columns regardless of discrete_features.
max_samples ((int, float or None), default: None ) –
Number of samples used when computing mutual information scores. Imputers are still fitted on the full training set; only the MI scoring (relevance on the first call and redundance on subsequent calls) uses the subsample.
- None: use all samples (no subsampling).
- int: use exactly this many samples (capped at n_samples).
- float in (0.0, 1.0]: use this fraction of the training set (at least 1 sample).
The same row indices are drawn once per forward-selection run (controlled by random_state) and reused for every subsequent redundance computation, keeping relevance and redundance comparable.

Attributes:

relevance_ (ndarray of shape (n_features,)) –

Per-feature relevance, populated on the first call.

Source code in felimination/mrmr.py

def __init__(
    self,
    regression=False,
    scheme="difference",
    n_neighbors=3,
    discrete_features="auto",
    random_state=None,
    n_jobs=None,
    relevance_func=None,
    redundance_func=None,
    redundancy_aggregation="max",
    min_relevance_perc=0.01,
    max_redundancy=None,
    discrete_imputer=None,
    continuous_imputer=None,
    max_samples=None,
):
    if scheme not in ("ratio", "difference"):
        raise ValueError(f"scheme must be 'ratio' or 'difference', got {scheme!r}")
    if not callable(redundancy_aggregation) and redundancy_aggregation not in (
        "max",
        "mean",
    ):
        raise ValueError(
            f"redundancy_aggregation must be 'max', 'mean', or a callable, "
            f"got {redundancy_aggregation!r}"
        )
    self.regression = regression
    self.scheme = scheme
    self.n_neighbors = n_neighbors
    self.discrete_features = discrete_features
    self.random_state = random_state
    self.n_jobs = n_jobs
    self.relevance_func = relevance_func
    self.redundance_func = redundance_func
    self.redundancy_aggregation = redundancy_aggregation
    self.min_relevance_perc = min_relevance_perc
    self.max_redundancy = max_redundancy
    self.discrete_imputer = discrete_imputer
    self.continuous_imputer = continuous_imputer
    self.max_samples = max_samples
    self._reset()

`abs_pearson_correlation(X, y)`

Absolute Pearson correlation between each column of X and y.

Convenience helper for use as relevance_func or redundance_func in MRMRRanker. Only suitable for numeric data; use mutual-information based scoring (the default) when categorical features are present.

Parameters:

X (array-like of shape (n_samples, n_features)) –
y (array-like of shape (n_samples,)) –

Returns:

ndarray of shape (n_features,) –

Absolute Pearson correlation per feature.

Source code in felimination/mrmr.py

def abs_pearson_correlation(X, y):
    """Absolute Pearson correlation between each column of ``X`` and ``y``.

    Convenience helper for use as ``relevance_func`` or
    ``redundance_func`` in [`MRMRRanker`](/felimination/reference/selectors/MRMR/#felimination.mrmr.MRMRRanker). Only suitable for numeric data;
    use mutual-information based scoring (the default) when categorical
    features are present.

    Parameters
    ----------
    X : array-like of shape (n_samples, n_features)
    y : array-like of shape (n_samples,)

    Returns
    -------
    ndarray of shape (n_features,)
        Absolute Pearson correlation per feature.
    """
    X_arr = np.asarray(_as_dense_array(X), dtype=float)
    y_arr = np.asarray(y, dtype=float).ravel()
    n = X_arr.shape[0]
    y_centered = y_arr - y_arr.mean()
    X_centered = X_arr - X_arr.mean(axis=0)
    cov = X_centered.T @ y_centered / max(n - 1, 1)
    y_std = y_arr.std(ddof=1)
    X_std = X_arr.std(axis=0, ddof=1)
    denom = X_std * y_std
    denom = np.where(np.abs(denom) < 1e-12, 1e-12, denom)
    return np.abs(cov / denom)