Skip to content

TargetEncoder

sklearo.encoding.TargetEncoder(columns=(nw.Categorical, nw.String), unseen='raise', fill_value_unseen='mean', missing_values='encode', underrepresented_categories='raise', fill_values_underrepresented='mean', target_type='auto', smooth='auto', cv=5)

Target Encoder for categorical features.

This class provides functionality to encode categorical features using the Target Encoding technique. Target Encoding replaces each category with the mean of the target variable for that category. This method is particularly useful for handling categorical variables in machine learning models, especially when the number of categories is large.

The mean target per category is blended with the overall mean target using a smoothing parameter. The smoothing parameter is calculated as explained here.

Notes

Cross-fitting 🏋️‍♂️

This implementation uses an internal cross-fitting strategy to calculate the mean target values for the fit_transform method. This means that calling .fit(X, y).transform(X) will not return the same result as calling .fit_transform(X, y). When calling .fit_transform(X, y), the dataset is initially split into k folds (configurable via the cv parameter) then for each fold the mean target values are calculated using the data from all other folds. Finally, the transformer is fitted on the entire dataset. This is done to prevent leakage of the target information into the training data. This idea has been taken from scikit-learn's implementation of TargetEncoder. The reader is encouraged to learn more about cross-fitting on the scikit-learn documentation.

Parameters:

  • columns ((str, list[str], list[DTypes]), default: (Categorical, String) ) –

    List of columns to encode.

    • If a list of strings is passed, it is treated as a list of column names to encode.
    • If a single string is passed instead, it is treated as a regular expression pattern to match column names.
    • If a list of narwhals.typing.DTypes is passed, it will select all columns matching the specified dtype.
  • unseen (str, default: 'raise' ) –

    Strategy to handle categories that appear during the transform step but were never encountered in the fit step.

    • If 'raise', an error is raised when unseen categories are found.
    • If 'ignore', the unseen categories are encoded with the fill_value_unseen.
  • fill_value_unseen ((int, float, None | Literal['mean']), default: 'mean' ) –

    Fill value to use for unseen categories. Defaults to "mean", which will use the mean of the target variable.

  • missing_values (str, default: 'encode' ) –

    Strategy to handle missing values.

    • If 'encode', missing values are initially replaced with a specified fill value and the mean is computed as if it were a regular category.
    • If 'ignore', missing values are left as is.
    • If 'raise', an error is raised when missing values are found.
  • underrepresented_categories (str, default: 'raise' ) –

    Strategy to handle categories that are underrepresented in the training data.

    • If 'raise', an error is raised when underrepresented categories are found.
    • If 'fill', underrepresented categories are filled with a specified fill value.
  • fill_values_underrepresented ((float, None | Literal['mean']), default: 'mean' ) –

    Fill value to use for underrepresented categories. Defaults to "mean", which will use the mean of the target variable.

  • target_type (str, default: 'auto' ) –

    Type of the target variable.

    • If 'auto', the type is inferred from the target variable using infer_target_type.
    • If 'binary', the target variable is binary.
    • If 'multiclass', the target variable is multiclass.
    • If 'continuous', the target variable is continuous.
  • smooth ((float, Literal['auto']), default: 'auto' ) –

    Smoothing parameter to avoid overfitting. If 'auto', the smoothing parameter is calculated based on the variance of the target variable.

  • cv (int, default: 5 ) –

    Number of cross-validation folds to use for calculating the target encoding.

Attributes:

  • columns_ (list[str]) –

    List of columns to be encoded, learned during fit.

  • encoding_map_ (dict[str, float]) –

    Mapping of categories to their mean target values, learned during fit.

Examples:

import pandas as pd
from sklearo.encoding import TargetEncoder
data = {
    "category": ["A", "A", "B", "B", "C", "C"],
    "target": [1, 0, 1, 0, 1, 0],
}
df = pd.DataFrame(data)
encoder = TargetEncoder()
encoder.fit(df[["category"]], df["target"])
encoded = encoder.transform(df[["category"]])
print(encoded)
category
0 0.5
1 0.5
2 0.5
3 0.5
4 0.5
5 0.5

Class constructor for TargetEncoder.

check_target_type(y)

Check the type of the target variable.

fit(X, y)

Fit the encoder.

Parameters:

  • X (DataFrame) –

    The input data.

  • y (Series) –

    The target variable.

fit_transform(X, y)

Fit the encoder and transform the dataframe using cross-fitting.

Notes

Due to the cross fitting nature of target encoding, the fit_transform method is NOT equivalent to calling fit followed by transform. Please refer to the note on cross fitting.

Parameters:

  • X (DataFrame) –

    The input data.

  • y (Series) –

    The target variable.

Returns:

  • BaseTargetEncoder ( BaseTargetEncoder ) –

    The fitted encoder.

get_feature_names_out()

Get the output feature names.

transform(X)

Transform the data.

Parameters:

  • X (DataFrame) –

    The input data.