Genetic algorithms
This module contains the implementation of the Hybrid Genetic Algorithm-Importance with
Cross-Validation. The algorithm is implemented in the HybridImportanceGACVFeatureSelector class.
HybridImportanceGACVFeatureSelector(estimator, *, cv=5, scoring=None, random_state=None, n_jobs=None, importance_getter='auto', min_features_to_select=1, init_avg_features_num=15, init_std_features_num=5, pool_size=20, is_parent_selection_chance_proportional_to_fitness=True, n_children_cross_over=5, n_parents_cross_over=2, n_mutations=5, range_change_n_features_mutation=(-2, 3), range_randomly_swapped_features_mutation=(1, 4), max_generations=100, patience=5, callbacks=None, fitness_function='mean_test_score', mutation_candidate_scorer=None, mutation_candidate_selection='sample')
Bases: SelectorMixin, MetaEstimatorMixin, BaseEstimator
Feature selection using Hybrid Genetic Algorithm-Importance with Cross-Validation.
This feature selector uses a genetic algorithm to select features. The genetic algorithm is hybridized with feature importance. The feature importance is calculated using a cross-validation scheme. The algorithm works as follows:
Pool initialization: The pool is initialized with random features. The number of features is randomly generated using a normal distribution with the average number of features to select and the standard deviation of the number of features to select as parameters. The number of features is clipped to be between the minimum number of features to select and the number of features in the dataset.
Cross Over: The cross over is done by combining the features of the parents. The features are sorted by importance and the children are created by combining the features of the parents in a round-robin fashion. The number of features of the children is the average of the number of features of the parents. In this way, the children will have the most important features of the parents.
Mutation: The mutation is done by randomly changing the number of features and replacing the least important features with random features.
Selection: The selection is done by selecting the top pool_size solutions based on the
fitness function.
Parameters:
-
estimator(object) –An estimator that follows the scikit-learn API and has a
fitmethod. -
cv(int, cross-validation generator or an iterable, default:5) –Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 5-fold cross-validation, - int, to specify the number of folds in a (Stratified)KFold, - :term:
CV splitter, - An iterable yielding (train, test) splits as arrays of indices. -
scoring((str, callable or None), default:None) –A string (see model evaluation documentation) or a scorer callable object / function with signature
scorer(estimator, X, y). -
random_state(int or None, default:None) –Controls the random seed given at the beginning of the algorithm.
-
n_jobs(int or None, default:None) –The number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.
-
importance_getter(str or callable, default:'auto') –If 'auto', uses the feature importance either through a
coef_orfeature_importances_attributes of estimator.Also accepts a string that specifies an attribute name/path for extracting feature importance. For example, give
regressor_.coef_in case of~sklearn.compose.TransformedTargetRegressorornamed_steps.clf.feature_importances_in case of~sklearn.pipeline.Pipelinewith its last step namedclf.If
callable, overrides the default feature importance getter. The callable is passed with the fitted estimator and the validation set (X_val, y_val, estimator) and it should return importance for each feature. -
min_features_to_select(int or float, default:1) –The minimum number of features to select. If float, it represents the fraction of features to select.
-
init_avg_features_num(float, default:15) –The average number of features to select in the initial pool of solutions.
-
init_std_features_num(float, default:5) –The standard deviation of the number of features to select in the initial pool of solutions.
-
pool_size(int, default:20) –The number of solutions in the pool.
-
n_children_cross_over(int, default:5) –The number of children to create by cross-over.
-
is_parent_selection_chance_proportional_to_fitness(bool, default:True) –If True, the probability of selecting a parent is proportional to its fitness. This means that the fittest parents are more likely to be selected during crossover.
-
n_parents_cross_over(int, default:2) –The number of parents to select in each crossover. More than 2 parents can be selected during crossover. In that case, the top features of each parent are combined in a round-robin fashion to create a children. The number of features of the children is the average of the number of features of the parents.
-
n_mutations(int, default:5) –The number of mutations to apply to the pool.
-
range_change_n_features_mutation(tuple, default:(-2, 3)) –The range of the number of features to change during mutation. The first element is the minimum number of features to change and the second element is the maximum number of features to change. The right limit is exclusive.
-
range_randomly_swapped_features_mutation(tuple, default:(1, 4)) –The range of the number of features to replace during mutation. The first element is the minimum number of features to replace and the second element is the maximum number of features to replace. The right limit is exclusive.
-
mutation_candidate_scorer(callable or None, default:None) –Optional scoring function used to rank candidate features (those not already in the mutated element) when selecting replacements during mutation. Signature:
scorer(X, y, selected_features) -> array-like of shape (n_features,), where higher scores indicate more desirable candidates andselected_featuresis the list of features currently in the pool element being mutated (same type as the feature identifiers used throughout — integer column indices for arrays, column names for DataFrames). WhenNone, replacement features are chosen uniformly at random (original behaviour). The scorer is called once per mutation with the full training data and the element's current feature set. :class:~felimination.mrmr.MRMRRankercan be passed directly — it auto-initialises on the first call and caches per-feature redundance vectors across subsequent calls within the samefit. -
mutation_candidate_selection((best, sample), default:'best') –How to pick replacement features from the scored candidates. Only used when
mutation_candidate_scoreris notNone:'best': deterministically select the highest-scored candidates.'sample': sample without replacement with probability proportional to the score, preserving diversity while favouring high-scoring features.
-
max_generations(int, default:100) –The maximum number of generations to run the genetic algorithm.
-
patience(int, default:5) –The number of generations without improvement to wait before stopping the algorithm.
-
callbacks(list of callable, default:None) –A list of callables that are called after each generation. Each callable should accept the selector and the pool as arguments.
-
fitness_function(str or callable, default:"mean_test_score") –The fitness function to use. Possible string values are:
'mean_test_score','mean_train_score', If a callable is passed, it should accept a list of dictionaries where each dictionary has the following keys 'features', 'mean_test_score', 'mean_train_score' and return a list of floats with the fitness of each element in the pool. Defaults to rank_mean_test_score_overfit_fitness
Attributes:
-
estimator_(object) –The fitted estimator.
-
support_(array of shape (n_features,)) –The mask of selected features.
-
best_solution_(dict) –The best solution found by the genetic algorithm. It is a dictionary with the following keys - features: list of int The features selected for this element. - mean_test_score: float The mean test score of the element. - mean_train_score: float The mean train score of the element. - train_scores_per_fold: list of float The train score of each fold. - test_scores_per_fold: list of float The test score of each fold. - cv_importances: list of array The importances of each fold. - mean_cv_importances: array The mean importances of each fold. - generation: int The generation at which this solution was the best in the pool.
-
best_solutions_(list of dict) –The best solutions found by the genetic algorithm at each generation. Each element is defined as in
best_solution_. -
evaluation_cache_(dict) –Cache mapping
frozenset(features)to evaluation results (scores and importances). Populated duringfit; allows_evaluate_calculate_importancesto skip CV for feature sets that have already been evaluated in the samefitcall.
Examples:
>>> from felimination.ga import HybridImportanceGACVFeatureSelector
>>> from sklearn.datasets import make_classification
>>> from sklearn.linear_model import LogisticRegression
>>> X, y = make_classification(
n_samples=sample_size,
n_features=2,
n_classes=2,
n_redundant=0,
n_clusters_per_class=1,
random_state=random_state,
)
>>> estimator = LogisticRegression(random_state=42)
>>> selector = selector = HybridImportanceGACVFeatureSelector(
random_state=random_state,
init_avg_features_num=2,
init_std_features_num=1,
)
>>> selector = selector.fit(X, y)
>>> selector.support_
array([ True, True, True, True, True, False, False, False, False,
False])
Source code in felimination/ga.py
decision_function(X)
Compute the decision function of X.
Parameters:
-
X(array-like or sparse matrix, default:array-like or sparse matrix) –The input samples. Internally, it will be converted to
dtype=np.float32and if a sparse matrix is provided to a sparsecsr_matrix.
Returns:
-
score(array, shape = [n_samples, n_classes] or [n_samples]) –The decision function of the input samples. The order of the classes corresponds to that in the attribute :term:
classes_. Regression and binary classification produce an array of shape [n_samples].
Source code in felimination/ga.py
fit(X, y, groups=None, **params)
Fit the selector and then the underlying estimator on the selected features.
Parameters:
-
X(array-like, sparse matrix, default:array-like) –The training input samples.
-
y(array-like of shape (n_samples,)) –The target values.
-
**params(dict, default:{}) –Additional parameters passed to the
fitmethod of the underlying estimator.
Returns:
-
self(object) –Fitted estimator.
Source code in felimination/ga.py
616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 | |
plot(**kwargs)
Plot the mean test score and mean train score of the best solution at each generation.
Parameters:
-
**kwargs(dict, default:{}) –Additional parameters passed to seaborn.lineplot. For a list of possible options, please visit seaborn.lineplot # noqa
Returns:
-
Axes–The axis where the plot has been plotted.
Source code in felimination/ga.py
predict(X)
Reduce X to the selected features and predict using the estimator.
Parameters:
-
X(array of shape [n_samples, n_features]) –The input samples.
Returns:
-
y(array of shape [n_samples]) –The predicted target values.
Source code in felimination/ga.py
predict_log_proba(X)
Predict class log-probabilities for X.
Parameters:
-
X(array of shape [n_samples, n_features]) –The input samples.
Returns:
-
p(array of shape (n_samples, n_classes)) –The class log-probabilities of the input samples. The order of the classes corresponds to that in the attribute :term:
classes_.
Source code in felimination/ga.py
predict_proba(X)
Predict class probabilities for X.
Parameters:
-
X(array-like or sparse matrix, default:array-like or sparse matrix) –The input samples. Internally, it will be converted to
dtype=np.float32and if a sparse matrix is provided to a sparsecsr_matrix.
Returns:
-
p(array of shape (n_samples, n_classes)) –The class probabilities of the input samples. The order of the classes corresponds to that in the attribute :term:
classes_.
Source code in felimination/ga.py
score(X, y, **fit_params)
Reduce X to the selected features and return the score of the estimator.
Parameters:
-
X(array of shape [n_samples, n_features]) –The input samples.
-
y(array of shape [n_samples]) –The target values.
-
**fit_params(dict, default:{}) –Parameters to pass to the
scoremethod of the underlying estimator... versionadded:: 1.0
Returns:
-
score(float) –Score of the underlying base estimator computed with the selected features returned by
rfe.transform(X)andy.
Source code in felimination/ga.py
rank_mean_test_score_fitness(pool)
Define the fitness function as the rank of the mean test score.
The rank of the mean test score is calculated by ranking the mean test score in ascending order.
Parameters:
-
pool(list of dict) –Each element in the list is a dictionary with the following keys: - features: list of int The features selected for this element. - mean_test_score: float The mean test score of the element. - mean_train_score: float The mean train score of the element.
Returns:
-
fitness(list of float) –The fitness of each element in the pool.
Source code in felimination/ga.py
rank_mean_test_score_overfit_fitness(pool)
Define the fitness function as the sum of the rank of the mean test score and the rank of the overfit.
The rank of the mean test score is calculated by ranking the mean test score in ascending order. The rank of the overfit is calculated by ranking the overfit in ascending order. The overfit is calculated as the difference between the mean train score and the mean test score. The fitness is the sum of the rank of the mean test score and the rank of the overfit.
Parameters:
-
pool(list of dict) –Each element in the list is a dictionary with the following keys: - features: list of int The features selected for this element. - mean_test_score: float The mean test score of the element. - mean_train_score: float The mean train score of the element.
Returns:
-
fitness(list of float) –The fitness of each element in the pool.