1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186
|
.. _sampling-hinv:
Hermite interpolation based INVersion of CDF (HINV)
===================================================
.. currentmodule:: scipy.stats.sampling
* Required: CDF
* Optional: PDF, dPDF
* Speed:
* Set-up: (very) slow
* Sampling: (very) fast
HINV is a variant of numerical inversion, where the inverse CDF is approximated using
Hermite interpolation, i.e., the interval [0,1] is split into several intervals and
in each interval the inverse CDF is approximated by polynomials constructed by means
of values of the CDF and PDF at interval boundaries. This makes it possible to improve
the accuracy by splitting a particular interval without recomputations in unaffected
intervals. Three types of splines are implemented: linear, cubic, and quintic
interpolation. For linear interpolation only the CDF is required. Cubic interpolation
also requires PDF and quintic interpolation PDF and its derivative.
These splines have to be computed in a setup step. However, it only works for
distributions with bounded domain; for distributions with unbounded domain the tails
are chopped off such that the probability for the tail regions is small compared to
the given u-resolution.
The method is not exact, as it only produces random variates of the approximated
distribution. Nevertheless, the maximal numerical error in "u-direction" (i.e.
``|U - CDF(X)|`` where ``X`` is the approximate percentile corresponding to the
quantile ``U`` i.e. ``X = approx_ppf(U)``) can be set to the
required resolution (within machine precision). Notice that very small values of
the u-resolution are possible but may increase the cost for the setup step.
`NumericalInverseHermite` approximates the inverse of a continuous
statistical distribution's CDF with a Hermite spline. Order of the
hermite spline can be specified by passing the `order` parameter.
As described in [1]_, it begins by evaluating the distribution's PDF and
CDF at a mesh of quantiles ``x`` within the distribution's support.
It uses the results to fit a Hermite spline ``H`` such that
``H(p) == x``, where ``p`` is the array of percentiles corresponding
with the quantiles ``x``. Therefore, the spline approximates the inverse
of the distribution's CDF to machine precision at the percentiles ``p``,
but typically, the spline will not be as accurate at the midpoints between
the percentile points::
p_mid = (p[:-1] + p[1:])/2
so the mesh of quantiles is refined as needed to reduce the maximum
"u-error"::
u_error = np.max(np.abs(dist.cdf(H(p_mid)) - p_mid))
below the specified tolerance `u_resolution`. Refinement stops when the required
tolerance is achieved or when the number of mesh intervals after the next
refinement could exceed the maximum allowed number of intervals (100000).
>>> import numpy as np
>>> from scipy.stats.sampling import NumericalInverseHermite
>>> from scipy.stats import norm, genexpon
>>> from scipy.special import ndtr
To create a generator to sample from the standard normal distribution, do:
>>> class StandardNormal:
... def pdf(self, x):
... return 1/np.sqrt(2*np.pi) * np.exp(-x**2 / 2)
... def cdf(self, x):
... return ndtr(x)
...
>>> dist = StandardNormal()
>>> urng = np.random.default_rng()
>>> rng = NumericalInverseHermite(dist, random_state=urng)
The `NumericalInverseHermite` has a method that approximates the PPF of the
distribution.
>>> rng = NumericalInverseHermite(dist)
>>> p = np.linspace(0.01, 0.99, 99) # percentiles from 1% to 99%
>>> np.allclose(rng.ppf(p), norm.ppf(p))
True
Depending on the implementation of the distribution's random sampling
method, the random variates generated may be nearly identical, given
the same random state.
>>> dist = genexpon(9, 16, 3)
>>> rng = NumericalInverseHermite(dist)
>>> # `seed` ensures identical random streams are used by each `rvs` method
>>> seed = 500072020
>>> rvs1 = dist.rvs(size=100, random_state=np.random.default_rng(seed))
>>> rvs2 = rng.rvs(size=100, random_state=np.random.default_rng(seed))
>>> np.allclose(rvs1, rvs2)
True
To check that the random variates closely follow the given distribution, we can
look at its histogram:
>>> import matplotlib.pyplot as plt
>>> dist = StandardNormal()
>>> urng = np.random.default_rng()
>>> rng = NumericalInverseHermite(dist, random_state=urng)
>>> rvs = rng.rvs(10000)
>>> x = np.linspace(rvs.min()-0.1, rvs.max()+0.1, 1000)
>>> fx = norm.pdf(x)
>>> plt.plot(x, fx, 'r-', lw=2, label='true distribution')
>>> plt.hist(rvs, bins=20, density=True, alpha=0.8, label='random variates')
>>> plt.xlabel('x')
>>> plt.ylabel('PDF(x)')
>>> plt.title('Numerical Inverse Hermite Samples')
>>> plt.legend()
>>> plt.show()
.. plot:: tutorial/stats/plots/hinv_plot.py
:align: center
:alt: " "
:include-source: 0
Given the derivative of the PDF w.r.t the variate (i.e. ``x``), we can use
quintic Hermite interpolation to approximate the PPF by passing the `order`
parameter:
>>> class StandardNormal:
... def pdf(self, x):
... return 1/np.sqrt(2*np.pi) * np.exp(-x**2 / 2)
... def dpdf(self, x):
... return -1/np.sqrt(2*np.pi) * x * np.exp(-x**2 / 2)
... def cdf(self, x):
... return ndtr(x)
...
>>> dist = StandardNormal()
>>> urng = np.random.default_rng()
>>> rng = NumericalInverseHermite(dist, order=5, random_state=urng)
Higher orders result in a fewer number of intervals:
>>> rng3 = NumericalInverseHermite(dist, order=3)
>>> rng5 = NumericalInverseHermite(dist, order=5)
>>> rng3.intervals, rng5.intervals
(3000, 522)
The u-error can be estimated by calling the `u_error` method. It runs a small
Monte-Carlo simulation to estimate the u-error. By default, 100,000 samples are
used. This can be changed by passing the `sample_size` argument:
>>> rng1 = NumericalInverseHermite(dist, u_resolution=1e-10)
>>> rng1.u_error(sample_size=1000000) # uses one million samples
UError(max_error=9.53167544892608e-11, mean_absolute_error=2.2450136432146864e-11)
This returns a namedtuple which contains the maximum u-error and the mean
absolute u-error.
The u-error can be reduced by decreasing the u-resolution (maximum allowed u-error):
>>> rng2 = NumericalInverseHermite(dist, u_resolution=1e-13)
>>> rng2.u_error(sample_size=1000000)
UError(max_error=9.32027892364129e-14, mean_absolute_error=1.5194172675685075e-14)
Note that this comes at the cost of computation time as a result of the increased
setup time and number of intervals.
>>> rng1.intervals
1022
>>> rng2.intervals
5687
>>> from timeit import timeit
>>> f = lambda: NumericalInverseHermite(dist, u_resolution=1e-10)
>>> timeit(f, number=1)
0.017409582000254886 # may vary
>>> f = lambda: NumericalInverseHermite(dist, u_resolution=1e-13)
>>> timeit(f, number=1)
0.08671202100003939 # may vary
See [1]_ and [2]_ for more details on this method.
References
----------
.. [1] Hörmann, Wolfgang, and Josef Leydold. "Continuous random variate
generation by fast numerical inversion." ACM Transactions on
Modeling and Computer Simulation (TOMACS) 13.4 (2003): 347-362.
.. [2] UNU.RAN reference manual, Section 5.3.5,
"HINV - Hermite interpolation based INVersion of CDF",
https://statmath.wu.ac.at/software/unuran/doc/unuran.html#HINV
|