1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253
|
"""
=====================================
Multi-class AdaBoosted Decision Trees
=====================================
This example shows how boosting can improve the prediction accuracy on a
multi-label classification problem. It reproduces a similar experiment as
depicted by Figure 1 in Zhu et al [1]_.
The core principle of AdaBoost (Adaptive Boosting) is to fit a sequence of weak
learners (e.g. Decision Trees) on repeatedly re-sampled versions of the data.
Each sample carries a weight that is adjusted after each training step, such
that misclassified samples will be assigned higher weights. The re-sampling
process with replacement takes into account the weights assigned to each sample.
Samples with higher weights have a greater chance of being selected multiple
times in the new data set, while samples with lower weights are less likely to
be selected. This ensures that subsequent iterations of the algorithm focus on
the difficult-to-classify samples.
.. topic:: References:
.. [1] :doi:`J. Zhu, H. Zou, S. Rosset, T. Hastie, "Multi-class adaboost."
Statistics and its Interface 2.3 (2009): 349-360.
<10.4310/SII.2009.v2.n3.a8>`
"""
# Noel Dawe <noel.dawe@gmail.com>
# License: BSD 3 clause
# %%
# Creating the dataset
# --------------------
# The classification dataset is constructed by taking a ten-dimensional standard
# normal distribution (:math:`x` in :math:`R^{10}`) and defining three classes
# separated by nested concentric ten-dimensional spheres such that roughly equal
# numbers of samples are in each class (quantiles of the :math:`\chi^2`
# distribution).
from sklearn.datasets import make_gaussian_quantiles
X, y = make_gaussian_quantiles(
n_samples=2_000, n_features=10, n_classes=3, random_state=1
)
# %%
# We split the dataset into 2 sets: 70 percent of the samples are used for
# training and the remaining 30 percent for testing.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=0.7, random_state=42
)
# %%
# Training the `AdaBoostClassifier`
# ---------------------------------
# We train the :class:`~sklearn.ensemble.AdaBoostClassifier`. The estimator
# utilizes boosting to improve the classification accuracy. Boosting is a method
# designed to train weak learners (i.e. `estimator`) that learn from their
# predecessor's mistakes.
#
# Here, we define the weak learner as a
# :class:`~sklearn.tree.DecisionTreeClassifier` and set the maximum number of
# leaves to 8. In a real setting, this parameter should be tuned. We set it to a
# rather low value to limit the runtime of the example.
#
# The `SAMME` algorithm build into the
# :class:`~sklearn.ensemble.AdaBoostClassifier` then uses the correct or
# incorrect predictions made be the current weak learner to update the sample
# weights used for training the consecutive weak learners. Also, the weight of
# the weak learner itself is calculated based on its accuracy in classifying the
# training examples. The weight of the weak learner determines its influence on
# the final ensemble prediction.
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
weak_learner = DecisionTreeClassifier(max_leaf_nodes=8)
n_estimators = 300
adaboost_clf = AdaBoostClassifier(
estimator=weak_learner,
n_estimators=n_estimators,
algorithm="SAMME",
random_state=42,
).fit(X_train, y_train)
# %%
# Analysis
# --------
# Convergence of the `AdaBoostClassifier`
# ***************************************
# To demonstrate the effectiveness of boosting in improving accuracy, we
# evaluate the misclassification error of the boosted trees in comparison to two
# baseline scores. The first baseline score is the `misclassification_error`
# obtained from a single weak-learner (i.e.
# :class:`~sklearn.tree.DecisionTreeClassifier`), which serves as a reference
# point. The second baseline score is obtained from the
# :class:`~sklearn.dummy.DummyClassifier`, which predicts the most prevalent
# class in a dataset.
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score
dummy_clf = DummyClassifier()
def misclassification_error(y_true, y_pred):
return 1 - accuracy_score(y_true, y_pred)
weak_learners_misclassification_error = misclassification_error(
y_test, weak_learner.fit(X_train, y_train).predict(X_test)
)
dummy_classifiers_misclassification_error = misclassification_error(
y_test, dummy_clf.fit(X_train, y_train).predict(X_test)
)
print(
"DecisionTreeClassifier's misclassification_error: "
f"{weak_learners_misclassification_error:.3f}"
)
print(
"DummyClassifier's misclassification_error: "
f"{dummy_classifiers_misclassification_error:.3f}"
)
# %%
# After training the :class:`~sklearn.tree.DecisionTreeClassifier` model, the
# achieved error surpasses the expected value that would have been obtained by
# guessing the most frequent class label, as the
# :class:`~sklearn.dummy.DummyClassifier` does.
#
# Now, we calculate the `misclassification_error`, i.e. `1 - accuracy`, of the
# additive model (:class:`~sklearn.tree.DecisionTreeClassifier`) at each
# boosting iteration on the test set to assess its performance.
#
# We use :meth:`~sklearn.ensemble.AdaBoostClassifier.staged_predict` that makes
# as many iterations as the number of fitted estimator (i.e. corresponding to
# `n_estimators`). At iteration `n`, the predictions of AdaBoost only use the
# `n` first weak learners. We compare these predictions with the true
# predictions `y_test` and we, therefore, conclude on the benefit (or not) of adding a
# new weak learner into the chain.
#
# We plot the misclassification error for the different stages:
import matplotlib.pyplot as plt
import pandas as pd
boosting_errors = pd.DataFrame(
{
"Number of trees": range(1, n_estimators + 1),
"AdaBoost": [
misclassification_error(y_test, y_pred)
for y_pred in adaboost_clf.staged_predict(X_test)
],
}
).set_index("Number of trees")
ax = boosting_errors.plot()
ax.set_ylabel("Misclassification error on test set")
ax.set_title("Convergence of AdaBoost algorithm")
plt.plot(
[boosting_errors.index.min(), boosting_errors.index.max()],
[weak_learners_misclassification_error, weak_learners_misclassification_error],
color="tab:orange",
linestyle="dashed",
)
plt.plot(
[boosting_errors.index.min(), boosting_errors.index.max()],
[
dummy_classifiers_misclassification_error,
dummy_classifiers_misclassification_error,
],
color="c",
linestyle="dotted",
)
plt.legend(["AdaBoost", "DecisionTreeClassifier", "DummyClassifier"], loc=1)
plt.show()
# %%
# The plot shows the missclassification error on the test set after each
# boosting iteration. We see that the error of the boosted trees converges to an
# error of around 0.3 after 50 iterations, indicating a significantly higher
# accuracy compared to a single tree, as illustrated by the dashed line in the
# plot.
#
# The misclassification error jitters because the `SAMME` algorithm uses the
# discrete outputs of the weak learners to train the boosted model.
#
# The convergence of :class:`~sklearn.ensemble.AdaBoostClassifier` is mainly
# influenced by the learning rate (i.e. `learning_rate`), the number of weak
# learners used (`n_estimators`), and the expressivity of the weak learners
# (e.g. `max_leaf_nodes`).
# %%
# Errors and weights of the Weak Learners
# ***************************************
# As previously mentioned, AdaBoost is a forward stagewise additive model. We
# now focus on understanding the relationship between the attributed weights of
# the weak learners and their statistical performance.
#
# We use the fitted :class:`~sklearn.ensemble.AdaBoostClassifier`'s attributes
# `estimator_errors_` and `estimator_weights_` to investigate this link.
weak_learners_info = pd.DataFrame(
{
"Number of trees": range(1, n_estimators + 1),
"Errors": adaboost_clf.estimator_errors_,
"Weights": adaboost_clf.estimator_weights_,
}
).set_index("Number of trees")
axs = weak_learners_info.plot(
subplots=True, layout=(1, 2), figsize=(10, 4), legend=False, color="tab:blue"
)
axs[0, 0].set_ylabel("Train error")
axs[0, 0].set_title("Weak learner's training error")
axs[0, 1].set_ylabel("Weight")
axs[0, 1].set_title("Weak learner's weight")
fig = axs[0, 0].get_figure()
fig.suptitle("Weak learner's errors and weights for the AdaBoostClassifier")
fig.tight_layout()
# %%
# On the left plot, we show the weighted error of each weak learner on the
# reweighted training set at each boosting iteration. On the right plot, we show
# the weights associated with each weak learner later used to make the
# predictions of the final additive model.
#
# We see that the error of the weak learner is the inverse of the weights. It
# means that our additive model will trust more a weak learner that makes
# smaller errors (on the training set) by increasing its impact on the final
# decision. Indeed, this exactly is the formulation of updating the base
# estimators' weights after each iteration in AdaBoost.
#
# |details-start| Mathematical details |details-split|
#
# The weight associated with a weak learner trained at the stage :math:`m` is
# inversely associated with its misclassification error such that:
#
# .. math:: \alpha^{(m)} = \log \frac{1 - err^{(m)}}{err^{(m)}} + \log (K - 1),
#
# where :math:`\alpha^{(m)}` and :math:`err^{(m)}` are the weight and the error
# of the :math:`m` th weak learner, respectively, and :math:`K` is the number of
# classes in our classification problem. |details-end|
#
# Another interesting observation boils down to the fact that the first weak
# learners of the model make fewer errors than later weak learners of the
# boosting chain.
#
# The intuition behind this observation is the following: due to the sample
# reweighting, later classifiers are forced to try to classify more difficult or
# noisy samples and to ignore already well classified samples. Therefore, the
# overall error on the training set will increase. That's why the weak learner's
# weights are built to counter-balance the worse performing weak learners.
|