1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
|
Kernel density estimation
-------------------------
A common task in statistics is to estimate the probability density function
(PDF) of a random variable from a set of data samples. This task is called
density estimation. The most well-known tool to do this is the histogram.
A histogram is a useful tool for visualization (mainly because everyone
understands it), but doesn't use the available data very efficiently. Kernel
density estimation (KDE) is a more efficient tool for the same task. The
:func:`scipy.stats.gaussian_kde` estimator can be used to estimate the PDF of
univariate as well as multivariate data. It works best if the data is unimodal.
Univariate estimation
^^^^^^^^^^^^^^^^^^^^^
We start with a minimal amount of data in order to see how
:func:`scipy.stats.gaussian_kde` works and what the different options for
bandwidth selection do. The data sampled from the PDF are shown as blue dashes
at the bottom of the figure (this is called a rug plot):
.. plot::
:alt: " "
>>> import numpy as np
>>> from scipy import stats
>>> import matplotlib.pyplot as plt
>>> x1 = np.array([-7, -5, 1, 4, 5], dtype=np.float64)
>>> kde1 = stats.gaussian_kde(x1)
>>> kde2 = stats.gaussian_kde(x1, bw_method='silverman')
>>> fig = plt.figure()
>>> ax = fig.add_subplot(111)
>>> ax.plot(x1, np.zeros(x1.shape), 'b+', ms=20) # rug plot
>>> x_eval = np.linspace(-10, 10, num=200)
>>> ax.plot(x_eval, kde1(x_eval), 'k-', label="Scott's Rule")
>>> ax.plot(x_eval, kde2(x_eval), 'r-', label="Silverman's Rule")
>>> plt.show()
We see that there is very little difference between Scott's Rule and
Silverman's Rule, and that the bandwidth selection with a limited amount of
data is probably a bit too wide. We can define our own bandwidth function to
get a less smoothed-out result.
>>> def my_kde_bandwidth(obj, fac=1./5):
... """We use Scott's Rule, multiplied by a constant factor."""
... return np.power(obj.n, -1./(obj.d+4)) * fac
>>> fig = plt.figure()
>>> ax = fig.add_subplot(111)
>>> ax.plot(x1, np.zeros(x1.shape), 'b+', ms=20) # rug plot
>>> kde3 = stats.gaussian_kde(x1, bw_method=my_kde_bandwidth)
>>> ax.plot(x_eval, kde3(x_eval), 'g-', label="With smaller BW")
>>> plt.show()
.. plot:: tutorial/stats/plots/kde_plot2.py
:align: center
:alt: " "
:include-source: 0
We see that if we set bandwidth to be very narrow, the obtained estimate for
the probability density function (PDF) is simply the sum of Gaussians around
each data point.
We now take a more realistic example and look at the difference between the
two available bandwidth selection rules. Those rules are known to work well
for (close to) normal distributions, but even for unimodal distributions that
are quite strongly non-normal they work reasonably well. As a non-normal
distribution we take a Student's T distribution with 5 degrees of freedom.
.. plot:: tutorial/stats/plots/kde_plot3.py
:align: center
:alt: " "
:include-source: 1
We now take a look at a bimodal distribution with one wider and one narrower
Gaussian feature. We expect that this will be a more difficult density to
approximate, due to the different bandwidths required to accurately resolve
each feature.
>>> from functools import partial
>>> loc1, scale1, size1 = (-2, 1, 175)
>>> loc2, scale2, size2 = (2, 0.2, 50)
>>> x2 = np.concatenate([np.random.normal(loc=loc1, scale=scale1, size=size1),
... np.random.normal(loc=loc2, scale=scale2, size=size2)])
>>> x_eval = np.linspace(x2.min() - 1, x2.max() + 1, 500)
>>> kde = stats.gaussian_kde(x2)
>>> kde2 = stats.gaussian_kde(x2, bw_method='silverman')
>>> kde3 = stats.gaussian_kde(x2, bw_method=partial(my_kde_bandwidth, fac=0.2))
>>> kde4 = stats.gaussian_kde(x2, bw_method=partial(my_kde_bandwidth, fac=0.5))
>>> pdf = stats.norm.pdf
>>> bimodal_pdf = pdf(x_eval, loc=loc1, scale=scale1) * float(size1) / x2.size + \
... pdf(x_eval, loc=loc2, scale=scale2) * float(size2) / x2.size
>>> fig = plt.figure(figsize=(8, 6))
>>> ax = fig.add_subplot(111)
>>> ax.plot(x2, np.zeros(x2.shape), 'b+', ms=12)
>>> ax.plot(x_eval, kde(x_eval), 'k-', label="Scott's Rule")
>>> ax.plot(x_eval, kde2(x_eval), 'b-', label="Silverman's Rule")
>>> ax.plot(x_eval, kde3(x_eval), 'g-', label="Scott * 0.2")
>>> ax.plot(x_eval, kde4(x_eval), 'c-', label="Scott * 0.5")
>>> ax.plot(x_eval, bimodal_pdf, 'r--', label="Actual PDF")
>>> ax.set_xlim([x_eval.min(), x_eval.max()])
>>> ax.legend(loc=2)
>>> ax.set_xlabel('x')
>>> ax.set_ylabel('Density')
>>> plt.show()
.. plot:: tutorial/stats/plots/kde_plot4.py
:align: center
:alt: " "
:include-source: 0
As expected, the KDE is not as close to the true PDF as we would like due to
the different characteristic size of the two features of the bimodal
distribution. By halving the default bandwidth (``Scott * 0.5``), we can do
somewhat better, while using a factor 5 smaller bandwidth than the default
doesn't smooth enough. What we really need, though, in this case, is a
non-uniform (adaptive) bandwidth.
Multivariate estimation
^^^^^^^^^^^^^^^^^^^^^^^
With :func:`scipy.stats.gaussian_kde` we can perform multivariate, as well as
univariate estimation. We demonstrate the bivariate case. First, we generate
some random data with a model in which the two variates are correlated.
>>> def measure(n):
... """Measurement model, return two coupled measurements."""
... m1 = np.random.normal(size=n)
... m2 = np.random.normal(scale=0.5, size=n)
... return m1+m2, m1-m2
>>> m1, m2 = measure(2000)
>>> xmin = m1.min()
>>> xmax = m1.max()
>>> ymin = m2.min()
>>> ymax = m2.max()
Then we apply the KDE to the data:
>>> X, Y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
>>> positions = np.vstack([X.ravel(), Y.ravel()])
>>> values = np.vstack([m1, m2])
>>> kernel = stats.gaussian_kde(values)
>>> Z = np.reshape(kernel.evaluate(positions).T, X.shape)
Finally, we plot the estimated bivariate distribution as a colormap and plot
the individual data points on top.
>>> fig = plt.figure(figsize=(8, 6))
>>> ax = fig.add_subplot(111)
>>> ax.imshow(np.rot90(Z), cmap=plt.cm.gist_earth_r,
... extent=[xmin, xmax, ymin, ymax])
>>> ax.plot(m1, m2, 'k.', markersize=2)
>>> ax.set_xlim([xmin, xmax])
>>> ax.set_ylim([ymin, ymax])
>>> plt.show()
.. plot:: tutorial/stats/plots/kde_plot5.py
:align: center
:alt: "An X-Y plot showing a random scattering of points around a 2-D gaussian. The distribution has a semi-major axis at 45 degrees with a semi-minor axis about half as large. Each point in the plot is highlighted with the outer region in red, then yellow, then green, with the center in blue. "
:include-source: 0
|