File: kernel_density_estimation.rst

package info (click to toggle)
scipy 1.16.3-3
  • links: PTS, VCS
  • area: main
  • in suites: forky
  • size: 236,088 kB
  • sloc: cpp: 503,720; python: 345,302; ansic: 195,677; javascript: 89,566; fortran: 56,210; cs: 3,081; f90: 1,150; sh: 857; makefile: 771; pascal: 284; csh: 135; lisp: 134; xml: 56; perl: 51
file content (180 lines) | stat: -rw-r--r-- 6,964 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
Kernel density estimation
-------------------------

A common task in statistics is to estimate the probability density function
(PDF) of a random variable from a set of data samples. This task is called
density estimation. The most well-known tool to do this is the histogram.
A histogram is a useful tool for visualization (mainly because everyone
understands it), but doesn't use the available data very efficiently. Kernel
density estimation (KDE) is a more efficient tool for the same task. The
:func:`scipy.stats.gaussian_kde` estimator can be used to estimate the PDF of
univariate as well as multivariate data. It works best if the data is unimodal.


Univariate estimation
^^^^^^^^^^^^^^^^^^^^^

We start with a minimal amount of data in order to see how
:func:`scipy.stats.gaussian_kde` works and what the different options for
bandwidth selection do. The data sampled from the PDF are shown as blue dashes
at the bottom of the figure (this is called a rug plot):

.. plot::
    :alt: " "

    >>> import numpy as np
    >>> from scipy import stats
    >>> import matplotlib.pyplot as plt

    >>> x1 = np.array([-7, -5, 1, 4, 5], dtype=np.float64)
    >>> kde1 = stats.gaussian_kde(x1)
    >>> kde2 = stats.gaussian_kde(x1, bw_method='silverman')

    >>> fig = plt.figure()
    >>> ax = fig.add_subplot(111)

    >>> ax.plot(x1, np.zeros(x1.shape), 'b+', ms=20)  # rug plot
    >>> x_eval = np.linspace(-10, 10, num=200)
    >>> ax.plot(x_eval, kde1(x_eval), 'k-', label="Scott's Rule")
    >>> ax.plot(x_eval, kde2(x_eval), 'r-', label="Silverman's Rule")

    >>> plt.show()

We see that there is very little difference between Scott's Rule and
Silverman's Rule, and that the bandwidth selection with a limited amount of
data is probably a bit too wide. We can define our own bandwidth function to
get a less smoothed-out result.

    >>> def my_kde_bandwidth(obj, fac=1./5):
    ...     """We use Scott's Rule, multiplied by a constant factor."""
    ...     return np.power(obj.n, -1./(obj.d+4)) * fac

    >>> fig = plt.figure()
    >>> ax = fig.add_subplot(111)

    >>> ax.plot(x1, np.zeros(x1.shape), 'b+', ms=20)  # rug plot
    >>> kde3 = stats.gaussian_kde(x1, bw_method=my_kde_bandwidth)
    >>> ax.plot(x_eval, kde3(x_eval), 'g-', label="With smaller BW")

    >>> plt.show()

.. plot:: tutorial/stats/plots/kde_plot2.py
   :align: center
   :alt: " "
   :include-source: 0

We see that if we set bandwidth to be very narrow, the obtained estimate for
the probability density function (PDF) is simply the sum of Gaussians around
each data point.

We now take a more realistic example and look at the difference between the
two available bandwidth selection rules. Those rules are known to work well
for (close to) normal distributions, but even for unimodal distributions that
are quite strongly non-normal they work reasonably well. As a non-normal
distribution we take a Student's T distribution with 5 degrees of freedom.

.. plot:: tutorial/stats/plots/kde_plot3.py
   :align: center
   :alt: " "
   :include-source: 1

We now take a look at a bimodal distribution with one wider and one narrower
Gaussian feature. We expect that this will be a more difficult density to
approximate, due to the different bandwidths required to accurately resolve
each feature.

    >>> from functools import partial

    >>> loc1, scale1, size1 = (-2, 1, 175)
    >>> loc2, scale2, size2 = (2, 0.2, 50)
    >>> x2 = np.concatenate([np.random.normal(loc=loc1, scale=scale1, size=size1),
    ...                      np.random.normal(loc=loc2, scale=scale2, size=size2)])

    >>> x_eval = np.linspace(x2.min() - 1, x2.max() + 1, 500)

    >>> kde = stats.gaussian_kde(x2)
    >>> kde2 = stats.gaussian_kde(x2, bw_method='silverman')
    >>> kde3 = stats.gaussian_kde(x2, bw_method=partial(my_kde_bandwidth, fac=0.2))
    >>> kde4 = stats.gaussian_kde(x2, bw_method=partial(my_kde_bandwidth, fac=0.5))

    >>> pdf = stats.norm.pdf
    >>> bimodal_pdf = pdf(x_eval, loc=loc1, scale=scale1) * float(size1) / x2.size + \
    ...               pdf(x_eval, loc=loc2, scale=scale2) * float(size2) / x2.size

    >>> fig = plt.figure(figsize=(8, 6))
    >>> ax = fig.add_subplot(111)

    >>> ax.plot(x2, np.zeros(x2.shape), 'b+', ms=12)
    >>> ax.plot(x_eval, kde(x_eval), 'k-', label="Scott's Rule")
    >>> ax.plot(x_eval, kde2(x_eval), 'b-', label="Silverman's Rule")
    >>> ax.plot(x_eval, kde3(x_eval), 'g-', label="Scott * 0.2")
    >>> ax.plot(x_eval, kde4(x_eval), 'c-', label="Scott * 0.5")
    >>> ax.plot(x_eval, bimodal_pdf, 'r--', label="Actual PDF")

    >>> ax.set_xlim([x_eval.min(), x_eval.max()])
    >>> ax.legend(loc=2)
    >>> ax.set_xlabel('x')
    >>> ax.set_ylabel('Density')
    >>> plt.show()

.. plot:: tutorial/stats/plots/kde_plot4.py
   :align: center
   :alt: " "
   :include-source: 0

As expected, the KDE is not as close to the true PDF as we would like due to
the different characteristic size of the two features of the bimodal
distribution. By halving the default bandwidth (``Scott * 0.5``), we can do
somewhat better, while using a factor 5 smaller bandwidth than the default
doesn't smooth enough. What we really need, though, in this case, is a
non-uniform (adaptive) bandwidth.


Multivariate estimation
^^^^^^^^^^^^^^^^^^^^^^^

With :func:`scipy.stats.gaussian_kde` we can perform multivariate, as well as
univariate estimation. We demonstrate the bivariate case. First, we generate
some random data with a model in which the two variates are correlated.

    >>> def measure(n):
    ...     """Measurement model, return two coupled measurements."""
    ...     m1 = np.random.normal(size=n)
    ...     m2 = np.random.normal(scale=0.5, size=n)
    ...     return m1+m2, m1-m2

    >>> m1, m2 = measure(2000)
    >>> xmin = m1.min()
    >>> xmax = m1.max()
    >>> ymin = m2.min()
    >>> ymax = m2.max()

Then we apply the KDE to the data:

    >>> X, Y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
    >>> positions = np.vstack([X.ravel(), Y.ravel()])
    >>> values = np.vstack([m1, m2])
    >>> kernel = stats.gaussian_kde(values)
    >>> Z = np.reshape(kernel.evaluate(positions).T, X.shape)

Finally, we plot the estimated bivariate distribution as a colormap and plot
the individual data points on top.

    >>> fig = plt.figure(figsize=(8, 6))
    >>> ax = fig.add_subplot(111)

    >>> ax.imshow(np.rot90(Z), cmap=plt.cm.gist_earth_r,
    ...           extent=[xmin, xmax, ymin, ymax])
    >>> ax.plot(m1, m2, 'k.', markersize=2)

    >>> ax.set_xlim([xmin, xmax])
    >>> ax.set_ylim([ymin, ymax])

    >>> plt.show()

.. plot:: tutorial/stats/plots/kde_plot5.py
   :align: center
   :alt: "An X-Y plot showing a random scattering of points around a 2-D gaussian. The distribution has a semi-major axis at 45 degrees with a semi-minor axis about half as large. Each point in the plot is highlighted with the outer region in red, then yellow, then green, with the center in blue. "
   :include-source: 0