File: plot_validation_curve.py

package info (click to toggle)
imbalanced-learn 0.12.4-1
  • links: PTS, VCS
  • area: main
  • in suites: sid, trixie
  • size: 2,160 kB
  • sloc: python: 17,221; sh: 481; makefile: 187; javascript: 50
file content (123 lines) | stat: -rw-r--r-- 3,153 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
"""
==========================
Plotting Validation Curves
==========================

In this example the impact of the :class:`~imblearn.over_sampling.SMOTE`'s
`k_neighbors` parameter is examined. In the plot you can see the validation
scores of a SMOTE-CART classifier for different values of the
:class:`~imblearn.over_sampling.SMOTE`'s `k_neighbors` parameter.
"""

# Authors: Christos Aridas
#          Guillaume Lemaitre <g.lemaitre58@gmail.com>
# License: MIT

# %%
print(__doc__)

import seaborn as sns

sns.set_context("poster")


RANDOM_STATE = 42

# %% [markdown]
# Let's first generate a dataset with imbalanced class distribution.

# %%
from sklearn.datasets import make_classification

X, y = make_classification(
    n_classes=2,
    class_sep=2,
    weights=[0.1, 0.9],
    n_informative=10,
    n_redundant=1,
    flip_y=0,
    n_features=20,
    n_clusters_per_class=4,
    n_samples=5000,
    random_state=RANDOM_STATE,
)

# %% [markdown]
# We will use an over-sampler :class:`~imblearn.over_sampling.SMOTE` followed
# by a :class:`~sklearn.tree.DecisionTreeClassifier`. The aim will be to
# search which `k_neighbors` parameter is the most adequate with the dataset
# that we generated.

from sklearn.tree import DecisionTreeClassifier

# %%
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline

model = make_pipeline(
    SMOTE(random_state=RANDOM_STATE), DecisionTreeClassifier(random_state=RANDOM_STATE)
)

# %% [markdown]
# We can use the :class:`~sklearn.model_selection.validation_curve` to inspect
# the impact of varying the parameter `k_neighbors`. In this case, we need
# to use a score to evaluate the generalization score during the
# cross-validation.

# %%
from sklearn.metrics import cohen_kappa_score, make_scorer
from sklearn.model_selection import validation_curve

scorer = make_scorer(cohen_kappa_score)
param_range = range(1, 11)
train_scores, test_scores = validation_curve(
    model,
    X,
    y,
    param_name="smote__k_neighbors",
    param_range=param_range,
    cv=3,
    scoring=scorer,
)

# %%
train_scores_mean = train_scores.mean(axis=1)
train_scores_std = train_scores.std(axis=1)
test_scores_mean = test_scores.mean(axis=1)
test_scores_std = test_scores.std(axis=1)

# %% [markdown]
# We can now plot the results of the cross-validation for the different
# parameter values that we tried.

# %%
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(7, 7))
ax.plot(param_range, test_scores_mean, label="SMOTE")
ax.fill_between(
    param_range,
    test_scores_mean + test_scores_std,
    test_scores_mean - test_scores_std,
    alpha=0.2,
)
idx_max = test_scores_mean.argmax()
ax.scatter(
    param_range[idx_max],
    test_scores_mean[idx_max],
    label=r"Cohen Kappa: ${:.2f}\pm{:.2f}$".format(
        test_scores_mean[idx_max], test_scores_std[idx_max]
    ),
)

fig.suptitle("Validation Curve with SMOTE-CART")
ax.set_xlabel("Number of neighbors")
ax.set_ylabel("Cohen's kappa")

# make nice plotting
sns.despine(ax=ax, offset=10)
ax.set_xlim([1, 10])
ax.set_ylim([0.4, 0.8])
ax.legend(loc="lower right", fontsize=16)
plt.tight_layout()
plt.show()