1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192
|
.. _common_pitfalls:
=========================================
Common pitfalls and recommended practices
=========================================
This section is a complement to the documentation given
`[here] <https://scikit-learn.org/dev/common_pitfalls.html>`_ in scikit-learn.
Indeed, we will highlight the issue of misusing resampling, leading to a
**data leakage**. Due to this leakage, the performance of a model reported
will be over-optimistic.
Data leakage
============
As mentioned in the scikit-learn documentation, data leakage occurs when
information that would not be available at prediction time is used when
building the model.
In the resampling setting, there is a common pitfall that corresponds to
resample the **entire** dataset before splitting it into a train and a test
partitions. Note that it would be equivalent to resample the train and test
partitions as well.
Such of a processing leads to two issues:
* the model will not be tested on a dataset with class distribution similar
to the real use-case. Indeed, by resampling the entire dataset, both the
training and testing set will be potentially balanced while the model should
be tested on the natural imbalanced dataset to evaluate the potential bias
of the model;
* the resampling procedure might use information about samples in the dataset
to either generate or select some of the samples. Therefore, we might use
information of samples which will be later used as testing samples which
is the typical data leakage issue.
We will demonstrate the wrong and right ways to do some sampling and emphasize
the tools that one should use, avoiding to fall in the trap.
We will use the adult census dataset. For the sake of simplicity, we will only
use the numerical features. Also, we will make the dataset more imbalanced to
increase the effect of the wrongdoings::
>>> from sklearn.datasets import fetch_openml
>>> from imblearn.datasets import make_imbalance
>>> X, y = fetch_openml(
... data_id=1119, as_frame=True, return_X_y=True
... )
>>> X = X.select_dtypes(include="number")
>>> X, y = make_imbalance(
... X, y, sampling_strategy={">50K": 300}, random_state=1
... )
Let's first check the balancing ratio on this dataset::
>>> from collections import Counter
>>> {key: value / len(y) for key, value in Counter(y).items()}
{'<=50K': 0.988..., '>50K': 0.011...}
To later highlight some of the issue, we will keep aside a left-out set that we
will not use for the evaluation of the model::
>>> from sklearn.model_selection import train_test_split
>>> X, X_left_out, y, y_left_out = train_test_split(
... X, y, stratify=y, random_state=0
... )
We will use a :class:`sklearn.ensemble.HistGradientBoostingClassifier` as a
baseline classifier. First, we will train and check the performance of this
classifier, without any preprocessing to alleviate the bias toward the majority
class. We evaluate the generalization performance of the classifier via
cross-validation::
>>> from sklearn.ensemble import HistGradientBoostingClassifier
>>> from sklearn.model_selection import cross_validate
>>> model = HistGradientBoostingClassifier(random_state=0)
>>> cv_results = cross_validate(
... model, X, y, scoring="balanced_accuracy",
... return_train_score=True, return_estimator=True,
... n_jobs=-1
... )
>>> print(
... f"Balanced accuracy mean +/- std. dev.: "
... f"{cv_results['test_score'].mean():.3f} +/- "
... f"{cv_results['test_score'].std():.3f}"
... )
Balanced accuracy mean +/- std. dev.: 0.609 +/- 0.024
We see that the classifier does not give good performance in terms of balanced
accuracy mainly due to the class imbalance issue.
In the cross-validation, we stored the different classifiers of all folds. We
will show that evaluating these classifiers on the left-out data will give
close statistical performance::
>>> import numpy as np
>>> from sklearn.metrics import balanced_accuracy_score
>>> scores = []
>>> for fold_id, cv_model in enumerate(cv_results["estimator"]):
... scores.append(
... balanced_accuracy_score(
... y_left_out, cv_model.predict(X_left_out)
... )
... )
>>> print(
... f"Balanced accuracy mean +/- std. dev.: "
... f"{np.mean(scores):.3f} +/- {np.std(scores):.3f}"
... )
Balanced accuracy mean +/- std. dev.: 0.628 +/- 0.009
Let's now show the **wrong** pattern to apply when it comes to resampling to
alleviate the class imbalance issue. We will use a sampler to balance the
**entire** dataset and check the statistical performance of our classifier via
cross-validation::
>>> from imblearn.under_sampling import RandomUnderSampler
>>> sampler = RandomUnderSampler(random_state=0)
>>> X_resampled, y_resampled = sampler.fit_resample(X, y)
>>> model = HistGradientBoostingClassifier(random_state=0)
>>> cv_results = cross_validate(
... model, X_resampled, y_resampled, scoring="balanced_accuracy",
... return_train_score=True, return_estimator=True,
... n_jobs=-1
... )
>>> print(
... f"Balanced accuracy mean +/- std. dev.: "
... f"{cv_results['test_score'].mean():.3f} +/- "
... f"{cv_results['test_score'].std():.3f}"
... )
Balanced accuracy mean +/- std. dev.: 0.724 +/- 0.042
The cross-validation performance looks good, but evaluating the classifiers
on the left-out data shows a different picture::
>>> scores = []
>>> for fold_id, cv_model in enumerate(cv_results["estimator"]):
... scores.append(
... balanced_accuracy_score(
... y_left_out, cv_model.predict(X_left_out)
... )
... )
>>> print(
... f"Balanced accuracy mean +/- std. dev.: "
... f"{np.mean(scores):.3f} +/- {np.std(scores):.3f}"
... )
Balanced accuracy mean +/- std. dev.: 0.698 +/- 0.014
We see that the performance is now worse than the cross-validated performance.
Indeed, the data leakage gave us too optimistic results due to the reason
stated earlier in this section.
We will now illustrate the correct pattern to use. Indeed, as in scikit-learn,
using a :class:`~imblearn.pipeline.Pipeline` avoids to make any data leakage
because the resampling will be delegated to imbalanced-learn and does not
require any manual steps::
>>> from imblearn.pipeline import make_pipeline
>>> model = make_pipeline(
... RandomUnderSampler(random_state=0),
... HistGradientBoostingClassifier(random_state=0)
... )
>>> cv_results = cross_validate(
... model, X, y, scoring="balanced_accuracy",
... return_train_score=True, return_estimator=True,
... n_jobs=-1
... )
>>> print(
... f"Balanced accuracy mean +/- std. dev.: "
... f"{cv_results['test_score'].mean():.3f} +/- "
... f"{cv_results['test_score'].std():.3f}"
... )
Balanced accuracy mean +/- std. dev.: 0.732 +/- 0.019
We observe that we get good statistical performance as well. However, now we
can check the performance of the model from each cross-validation fold to
ensure that we have similar performance::
>>> scores = []
>>> for fold_id, cv_model in enumerate(cv_results["estimator"]):
... scores.append(
... balanced_accuracy_score(
... y_left_out, cv_model.predict(X_left_out)
... )
... )
>>> print(
... f"Balanced accuracy mean +/- std. dev.: "
... f"{np.mean(scores):.3f} +/- {np.std(scores):.3f}"
... )
Balanced accuracy mean +/- std. dev.: 0.727 +/- 0.008
We see that the statistical performance are very close to the cross-validation
study that we perform, without any sign of over-optimistic results.
|