Forward selection
Module with tools to perform forward feature selection with cross-validation.
This module contains:
ForwardSelectorCV: forward feature selector driven by a pluggable per-step importance getter, with cross-validation to choose the best number of features.
ForwardSelectorCV(estimator, *, step=1, min_features_to_select=None, max_features_to_select=None, cv=None, scoring=None, verbose=0, n_jobs=None, random_state=None, importance_getter='auto', callbacks=None, best_iteration_selection_criteria='mean_test_score')
Bases: MetaEstimatorMixin, SelectorMixin, BaseEstimator
Forward feature selection with cross-validation.
The selector starts by asking importance_getter for scores against
an empty selection and picks the top-scoring feature. It then
iteratively asks the importance getter again — passing the indices of
already-selected features — and adds the highest-scoring not-yet-
selected feature. Cross-validation is used to score the model trained
on the running selection.
The algorithm:
scores = importance_getter(X, y, [])
selected = [argmax(scores)]
while len(selected) < max_features_to_select:
scores = importance_getter(X, y, selected)
scores[already_selected] = -inf
selected.append(argmax(scores))
if len(selected) >= min_features_to_select and
(len(selected) - min_features_to_select) is a multiple of step:
evaluate(selected) via cross-validation
Parameters:
-
estimator(``Estimator`` instance) –A supervised learning estimator with a
fitmethod, used to score candidate feature subsets via cross-validation. The estimator is not used to drive selection. -
step(int or float, default:1) –Number of features added between two consecutive cross-validation evaluations. If greater than or equal to 1, this is the integer number of features added per evaluation. If within (0.0, 1.0), it is the fraction (rounded down, with a floor of 1) of the already- selected features added per evaluation, growing the selection geometrically. Selection within a step still happens one feature at a time, calling
importance_getterafter every addition. -
min_features_to_select(int, default:None) –Minimum number of features that must be selected before the first cross-validation evaluation. Features are still selected via the importance getter before this threshold, but no CV scoring takes place. If
None, defaults to 1 (CV evaluation starts from the very first selected feature). -
max_features_to_select(int, default:None) –Maximum number of features to select. The forward process stops once this many features have been selected. If
None, defaults to all features inX. -
cv(int, cross-validation generator or an iterable, default:None) –Determines the cross-validation splitting strategy. See
~sklearn.model_selection.check_cvfor accepted inputs. -
scoring((str, callable or None), default:None) –Scorer used to evaluate the estimator on each CV fold.
-
verbose(int, default:0) –Controls verbosity of output.
-
n_jobs(int or None, default:None) –Number of cores to run in parallel while fitting across folds.
-
random_state(int, RandomState instance or None, default:None) –Seed used by the default mutual-information importance getter and by
plot. -
importance_getter(auto or callable, default:'auto') –Feature scoring strategy used to drive selection.
'auto': usemutual_info_classifwhenestimatoris a classifier, otherwisemutual_info_regression. Scores are computed once on the full(X, y)and reused for every step, so the order of selection is just descending mutual information.- callable: a function with signature
importance_getter(X, y, selected_idx) -> scoreswhereselected_idxis a list of indices of currently-selected features andscoresis an array of shape (n_features,). The feature with the highest score among those not inselected_idxis added next; already-selected features are masked by the selector, so the callable may return any value for them. The selector always starts a fresh selection by calling the callable with an empty list, so stateful scorers may use that signal to invalidate caches.
-
callbacks(list of callable, default:None) –List of callables called at the end of each evaluated step. Each callable receives
(selector, scores)wherescoresis the last array returned byimportance_getter. -
best_iteration_selection_criteria(str or callable, default:'mean_test_score') –Either a key into
cv_results_(the iteration that maximises that key is picked) or a callablef(cv_results) -> n_featuresthat must return one of the values incv_results_["n_features"].
Attributes:
-
classes_(ndarray of shape (n_classes,)) –The classes labels. Only available when
estimatoris a classifier. -
estimator_(``Estimator`` instance) –The estimator refit on the selected features.
-
cv_results_(dict of lists) –A dict with keys
n_features,mean_test_score,std_test_score,mean_train_score,std_train_scoreandsplit{k}_{train,test}_scorefor each CV fold. -
n_features_(int) –The number of selected features (after picking the best CV iteration).
-
n_features_in_(int) –Number of features seen during :term:
fit. -
feature_names_in_(ndarray of shape (`n_features_in_`,)) –Names of features seen during :term:
fit. Defined only whenXhas feature names that are all strings. -
ranking_(ndarray of shape (n_features_in_,)) –The order in which features were selected.
ranking_[i] == 1means featureiwas the first selected. Features that were never selected receive a rank greater than the highest assigned one. -
support_(ndarray of shape (n_features_in_,)) –The mask of currently selected features. Can be changed via
set_n_features_to_select.
Examples:
>>> from felimination.forward import ForwardSelectorCV
>>> from sklearn.datasets import make_classification
>>> from sklearn.linear_model import LogisticRegression
>>> X, y = make_classification(n_samples=200, n_features=10, random_state=0)
>>> selector = ForwardSelectorCV(
... LogisticRegression(),
... min_features_to_select=2,
... max_features_to_select=8,
... step=1,
... cv=3,
... random_state=0,
... ).fit(X, y)
>>> selector.support_.sum() > 0
True
Source code in felimination/forward.py
plot(**kwargs)
Plot the cross-validation curve over number of features.
Parameters:
-
**kwargs(dict, default:{}) –Forwarded to
seaborn.lineplot.
Returns:
-
Axes–
Source code in felimination/forward.py
select_best_iteration(cv_results)
Return the best n_features value given cv_results_.
Source code in felimination/forward.py
set_n_features_to_select(n_features_to_select)
Change the number of selected features after fitting.
The underlying estimator is not retrained — predict /
predict_proba keep using the model fit on the originally
selected features. Only support_, transform and
get_feature_names_out are affected.
Parameters:
-
n_features_to_select(int) –Must be one of the values in
cv_results_["n_features"].