File: plot_tree_regression.py

package info (click to toggle)
scikit-learn 1.7.2%2Bdfsg-3
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 25,752 kB
  • sloc: python: 219,120; cpp: 5,790; ansic: 846; makefile: 191; javascript: 110
file content (144 lines) | stat: -rw-r--r-- 4,461 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
"""
========================
Decision Tree Regression
========================
In this example, we demonstrate the effect of changing the maximum depth of a
decision tree on how it fits to the data. We perform this once on a 1D regression
task and once on a multi-output regression task.
"""

# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause

# %%
# Decision Tree on a 1D Regression Task
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#
# Here we fit a tree on a 1D regression task.
#
# The :ref:`decision trees <tree>` is
# used to fit a sine curve with addition noisy observation. As a result, it
# learns local linear regressions approximating the sine curve.
#
# We can see that if the maximum depth of the tree (controlled by the
# `max_depth` parameter) is set too high, the decision trees learn too fine
# details of the training data and learn from the noise, i.e. they overfit.
#
# Create a random 1D dataset
# --------------------------
import numpy as np

rng = np.random.RandomState(1)
X = np.sort(5 * rng.rand(80, 1), axis=0)
y = np.sin(X).ravel()
y[::5] += 3 * (0.5 - rng.rand(16))

# %%
# Fit regression model
# --------------------
# Here we fit two models with different maximum depths
from sklearn.tree import DecisionTreeRegressor

regr_1 = DecisionTreeRegressor(max_depth=2)
regr_2 = DecisionTreeRegressor(max_depth=5)
regr_1.fit(X, y)
regr_2.fit(X, y)

# %%
# Predict
# -------
# Get predictions on the test set
X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis]
y_1 = regr_1.predict(X_test)
y_2 = regr_2.predict(X_test)

# %%
# Plot the results
# ----------------
import matplotlib.pyplot as plt

plt.figure()
plt.scatter(X, y, s=20, edgecolor="black", c="darkorange", label="data")
plt.plot(X_test, y_1, color="cornflowerblue", label="max_depth=2", linewidth=2)
plt.plot(X_test, y_2, color="yellowgreen", label="max_depth=5", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()

# %%
# As you can see, the model with a depth of 5 (yellow) learns the details of the
# training data to the point that it overfits to the noise. On the other hand,
# the model with a depth of 2 (blue) learns the major tendencies in the data well
# and does not overfit. In real use cases, you need to make sure that the tree
# is not overfitting the training data, which can be done using cross-validation.

# %%
# Decision Tree Regression with Multi-Output Targets
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#
# Here the :ref:`decision trees <tree>`
# is used to predict simultaneously the noisy `x` and `y` observations of a circle
# given a single underlying feature. As a result, it learns local linear
# regressions approximating the circle.
#
# We can see that if the maximum depth of the tree (controlled by the
# `max_depth` parameter) is set too high, the decision trees learn too fine
# details of the training data and learn from the noise, i.e. they overfit.

# %%
# Create a random dataset
# -----------------------
rng = np.random.RandomState(1)
X = np.sort(200 * rng.rand(100, 1) - 100, axis=0)
y = np.array([np.pi * np.sin(X).ravel(), np.pi * np.cos(X).ravel()]).T
y[::5, :] += 0.5 - rng.rand(20, 2)

# %%
# Fit regression model
# --------------------
regr_1 = DecisionTreeRegressor(max_depth=2)
regr_2 = DecisionTreeRegressor(max_depth=5)
regr_3 = DecisionTreeRegressor(max_depth=8)
regr_1.fit(X, y)
regr_2.fit(X, y)
regr_3.fit(X, y)

# %%
# Predict
# -------
# Get predictions on the test set
X_test = np.arange(-100.0, 100.0, 0.01)[:, np.newaxis]
y_1 = regr_1.predict(X_test)
y_2 = regr_2.predict(X_test)
y_3 = regr_3.predict(X_test)

# %%
# Plot the results
# ----------------
plt.figure()
s = 25
plt.scatter(y[:, 0], y[:, 1], c="yellow", s=s, edgecolor="black", label="data")
plt.scatter(
    y_1[:, 0],
    y_1[:, 1],
    c="cornflowerblue",
    s=s,
    edgecolor="black",
    label="max_depth=2",
)
plt.scatter(y_2[:, 0], y_2[:, 1], c="red", s=s, edgecolor="black", label="max_depth=5")
plt.scatter(y_3[:, 0], y_3[:, 1], c="blue", s=s, edgecolor="black", label="max_depth=8")
plt.xlim([-6, 6])
plt.ylim([-6, 6])
plt.xlabel("target 1")
plt.ylabel("target 2")
plt.title("Multi-output Decision Tree Regression")
plt.legend(loc="best")
plt.show()

# %%
# As you can see, the higher the value of `max_depth`, the more details of the data
# are caught by the model. However, the model also overfits to the data and is
# influenced by the noise.