WOEEncoder
sklearo.encoding.WOEEncoder(columns=(nw.Categorical, nw.String), underrepresented_categories='raise', fill_values_underrepresented=(-999.0, 999.0), unseen='raise', fill_value_unseen=0.0, missing_values='encode', cv=5)
Weight of Evidence (WOE) Encoder with support for multiclass classification.
This class provides functionality to encode categorical features using the Weight of Evidence (WOE) technique. WOE is commonly used in credit scoring and other binary classification problems to transform categorical variables into continuous variables, however it can easily be extended to all sort of classification problems, including multiclass classification.
WOE is defined as the natural logarithm of the ratio of the distribution of events for a class over the distribution of non-events for that class.
Some articles explain it as ln((% of non events) / (% of events))
, but in this way the WOE
will be inversely correlated to the target variable. In this implementation, the WOE is
calculated as the first formula, making it directly correlated to the target variable. I
personally think that it makes the interpretation of the WOE easier and it won't affect the
performance of the model.
So let's say that the event to predict is default on a loan (class 1) and the non-event is not defaulting on a loan (class 0). The WOE for a category is calculated as follows:
WOE = ln((% of defaults with the category) / (% of non-defaults in the category))
= ln(
(number of defaults from the category / total number of defaults) /
(number of non-defaults from the category / total number of non-defaults)
)
The WOE value defined like this will be positive if the category is more likely to be default (positive class) and negative if it is more likely to be repaid (positive class).
The WOE encoding is useful for logistic regression and other linear models, as it transforms the categorical variables into continuous variables that can be used as input features.
Notes
Cross-fitting ๐๏ธโโ๏ธ
This implementation uses an internal cross-fitting strategy to calculate the WOE values for
the fit_transform
method. This means that calling .fit(X, y).transform(X)
will not
return the same result as calling .fit_transform(X, y)
. When calling .fit_transform(X,
y)
, the dataset is initially split into k folds (configurable via the cv
parameter) then
for each fold the WOE values are calculated using the data from all other folds. Finally,
the transformer is fitted on the entire dataset. This is done to prevent leakage of
the target information into the training data. This idea has been taken from scikit-learn's
implementation of
TargetEncoder.
The reader is encouraged to learn more about cross-fitting on the scikit-learn
documentation.
Parameters:
-
columns
((str, list[str], list[DTypes])
, default:(Categorical, String)
) โlist of columns to encode.
- If a list of strings is passed, it is treated as a list of column names to encode.
- If a single string is passed instead, it is treated as a regular expression pattern to match column names.
- If a list of
narwhals.typing.DTypes
is passed, it will select all columns matching the specified dtype.
Defaults to
[narwhals.Categorical, narwhals.String]
, meaning that all categorical and string columns are selected by default. -
underrepresented_categories
(str
, default:'raise'
) โStrategy to handle underrepresented categories. Underrepresented categories in this context are categories that are never associated with one of the target classes. In this case the WOE is undefined (mathematically it would be either -inf or inf).
- If
'raise'
, an error is raised when a category is underrepresented. - If
'fill'
, the underrepresented categories are encoded using the fill_values_underrepresented values.
- If
-
fill_values_underrepresented
(list[int, float, None]
, default:(-999.0, 999.0)
) โFill values to use for underrepresented categories. The first value is used when the category has no events (e.g. defaults) and the second value is used when the category has no non-events (e.g. non defaults). Only used when
underrepresented_categories='fill'
. -
unseen
(str
, default:'raise'
) โStrategy to handle categories that appear during the
transform
step but where never encountered in thefit
step.- If
'raise'
, an error is raised when unseen categories are found. - If
'ignore'
, the unseen categories are encoded with the fill_value_unseen.
- If
-
fill_value_unseen
((int, float, None)
, default:0.0
) โFill value to use for unseen categories. Only used when
unseen='ignore'
. -
missing_values
(str
, default:'encode'
) โStrategy to handle missing values.
- If
'encode'
, missing values are initially replaced with'MISSING'
and the WOE is computed as if it were a regular category. - If
'ignore'
, missing values are left as is. - If
'raise'
, an error is raised when missing values are found.
- If
-
cv
(int
, default:5
) โNumber of cross-validation folds to use when calculating the WOE.
Attributes:
-
columns_
(list[str]
) โList of columns to be encoded, learned during fit.
-
encoding_map_
(dict[str, dict[str, float]]
) โNested dictionary mapping columns to their WOE values for each class, learned during fit.
-
feature_names_in_
(list[str]
) โList of feature names seen during fit.
Examples:
import pandas as pd
from sklearo.encoding import WOEEncoder
data = {
"category": ["A", "A", "A", "B", "B", "B", "C", "C", "C"],
"target": [1, 0, 0, 1, 1, 0, 1, 1, 0],
}
df = pd.DataFrame(data)
encoder = WOEEncoder()
encoder.fit(df[["category"]], df["target"])
encoded = encoder.transform(df[["category"]])
print(encoded)
category
0 -0.916291
1 -0.916291
2 -0.916291
3 0.470004
4 0.470004
5 0.470004
6 0.470004
7 0.470004
8 0.470004
Initializes the WoEEncoder with the specified parameters.
check_target_type(y)
Check the type of the target variable.
fit(X, y)
Fit the encoder.
Parameters:
-
X
(DataFrame
) โThe input data.
-
y
(Series
) โThe target variable.
fit_transform(X, y)
Fit the encoder and transform the dataframe using cross-fitting.
Notes
Due to the cross fitting nature of target encoding, the fit_transform
method
is NOT equivalent to calling fit
followed by transform
. Please refer to
the note on cross fitting.
Parameters:
-
X
(DataFrame
) โThe input data.
-
y
(Series
) โThe target variable.
Returns:
-
BaseTargetEncoder
(BaseTargetEncoder
) โThe fitted encoder.
get_feature_names_out()
Get the output feature names.
transform(X)
Transform the data.
Parameters:
-
X
(DataFrame
) โThe input data.