1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182
|
.. _metropolis_hastings:
The Metropolis-Hastings Algorithm
---------------------------------
**Markov chain.** Considering a :math:`\sigma`-algebra :math:`\cA` on
:math:`\Omega`, a Markov chain is a process
:math:`{(X_k)}_{k\in\Nset}` such that
.. math::
\begin{aligned}
\forall{}(A,x_0,\ldots,x_{k-1})\in\cA\times\Omega^k
\quad \Prob{X_k\in A \,|\, X_0=x_0, \ldots, X_{k-1}=x_{k-1}}
= \Prob{X_k\in A \,|\, X_{k-1}=x_{k-1}}.
\end{aligned}
An example is the *random walk* for which
:math:`X_k = X_{k-1} + \varepsilon_k` where the steps
:math:`\varepsilon_k` are independent and identically distributed.
| **Transition kernel.** A transition kernel on :math:`(\Omega, \cA)` is
a mapping :math:`K: (\Omega, \cA) \rightarrow [0, 1]` such that
- For all :math:`A\in\cA, \quad K(., A)` is measurable;
- For all :math:`x\in\Omega, \quad K(x, .)` is a probability distribution on :math:`(\Omega, \cA)`.
The kernel :math:`K` has density :math:`k` if
:math:`\forall(x,A)\in\Omega\times\cA \quad K(x, A) = \displaystyle\int_A \: k(x, y) \mbox{d}y`.
| :math:`{(X_k)}_{k\in\Nset}` is a homogeneous Markov Chain of
transition :math:`K` if
:math:`\forall(A,x)\in\cA\times\Omega \quad \Prob{X_k\in{}A | X_{k-1}=x} = K(x, A)`.
| **Some Notations.** Let :math:`{(X_k)}_{k\in\Nset}` be a homogeneous
Markov Chain of transition :math:`K` on :math:`(\Omega, \cA)` with
initial distribution :math:`\nu` (that is :math:`X_0 \sim \nu`):
- :math:`K_\nu` denotes the probability distribution of the Markov
Chain :math:`{(X_k)}_{k\in\Nset}`;
- :math:`\nu{}K^k` denotes the probability distribution of :math:`X_k`
(:math:`X_k \sim \nu{}K^k`);
- | :math:`K^k` denotes the mapping defined by
:math:`K^k(x,A) = \Prob{X_k\in{}A|X_0=x}` for all
:math:`(x,A)\in\Omega\times\cA`.
| **Total variation convergence.** A Markov Chain of distribution
:math:`K_\nu` is said to converge in total variation distance towards
the distribution :math:`t` if
.. math::
\begin{aligned}
\lim_{k\to+\infty} \sup_{A\in\cA} \left|
\nu{}K^k(A) - t(A)
\right| = 0.
\end{aligned}
Then the notation used here is :math:`\nu{}K^k \rightarrow_{TV} t`.
| **Some interesting properties.** Let :math:`t` be a (target)
distribution on :math:`(\Omega, \cA)`, then a transition kernel
:math:`K` is said to be:
- :math:`t`-invariant if :math:`t{}K = t`;
- :math:`t`-irreducible if, :math:`\forall(A,x)\in\Omega\times\cA` such
that :math:`t(A)>0`, :math:`\exists{}k\in\cN^* \quad {}K^k(x, A) > 0`
holds.
Markov Chain Monte-Carlo techniques allows one to sample and integrate
according to a distribution :math:`t` which is only known up to a
multiplicative constant. This situation is common in Bayesian statistics
where the “target” distribution, the posterior one
:math:`t(\vect{\theta})=\pi(\vect{\theta} | \mat{y})`, is proportional
to the product of prior and likelihood: see equation :eq:`postdistribution`.
In particular, given a “target” distribution :math:`t` and a
:math:`t`-irreducible kernel transition :math:`Q`, the
Metropolis-Hastings algorithm produces a Markov chain
:math:`{(X_k)}_{k\in\Nset}` of distribution :math:`K_\nu` with the
following properties:
- the transition kernel of the Markov chain is :math:`t`-invariant;
- :math:`\nu{}K^k \rightarrow_{TV} t`;
- the Markov chain satisfies the *ergodic theorem*: let :math:`\phi` be
a real-valued function such that
:math:`\mathbb{E}_{X\sim{}t}\left[ |\phi(X)| \right] <+\infty`, then, whatever the
initial distribution :math:`\nu` is:
.. math::
\begin{aligned}
\displaystyle\frac{1}{n} \sum_{k=1}^n \: \phi(X_k) \tendto{k}{+\infty} \mathbb{E}_{X\sim{}t}\left[ |\phi(X)| \right]
\mbox{ almost surely}.
\end{aligned}
In that sense, simulating :math:`{(X_k)}_{k\in\Nset}` amounts to
sampling according to :math:`t` and can be used to integrate relatively
to the probability measure :math:`t`. Let us remark that the ergodic
theorem implies that
:math:`\displaystyle\frac{1}{n} \sum_{k=1}^n \: \fcar{A}{X_k} \tendto{k}{+\infty} \ProbCond{X\sim{}t}{X\in{}A}` almost surely.
By abusing the notation, :math:`t(x)` represents, in the remainder of
this section, a function of :math:`x` which is proportional to the PDF
of the target distribution :math:`t`. Given a transition kernel
:math:`Q` of density :math:`q`, the scheme of the Metropolis-Hastings
algorithm is the following (lower case letters are used hereafter for
both random variables and realizations as usual in the Bayesian
literature):
0)
Draw :math:`x_0 \sim \nu` and set :math:`k = 1`.
1)
Draw a candidate for :math:`x_k` according to the given transition
kernel :math:`Q`: :math:`\tilde{x} \sim Q(x_{k-1}, .)`.
2)
Compute the ratio
:math:`\rho = \displaystyle\frac{t(\tilde{x})/q(x_{k-1},\tilde{x})} {t(x_{k-1})/q(\tilde{x},x_{k-1})}`.
3)
Draw :math:`u \sim \cU([0, 1])`; if :math:`u \leq \rho` then set
:math:`x_k = \tilde{x}`, otherwise set :math:`x_k = x_{k-1}`.
4)
Set :math:`k=k+1` and go back to 1).
Of course, if :math:`t` is replaced by a different function of :math:`x`
which is proportional to it, the algorithm keeps unchanged, since
:math:`t` only takes part in the latter in the ratio
:math:`\frac{t(\tilde{x})}{t(x_{k-1})}`. Moreover, if :math:`Q` proposes
some candidates in a uniform manner (constant density :math:`q`), the
candidate :math:`\tilde{x}` is accepted according to a ratio
:math:`\rho` which reduces to the previous “natural” ratio
:math:`\frac{t(\tilde{x})}{t(x_{k-1})}` of PDF. The introduction of
:math:`q` in the ratio :math:`\rho` prevents from the bias of a
non-uniform proposition of candidates which would favor some areas of
:math:`\Omega`.
The :math:`t`-invariance is ensured by the symmetry of the expression of
:math:`\rho` (:math:`t`-reversibility).
In practice, :math:`Q` is specified as a random walk
(:math:`\exists{}q_{RW}` such that :math:`q(x,y)=q_{RW}(x-y)`) or as a
independent sampling (:math:`\exists{}q_{IS}` such that
:math:`q(x,y)=q_{IS}(y)`), or as a mixture of random walk and
independent sampling.
| The important property the practitioner have to keep in mind when
choosing the transition kernel :math:`Q` is the
:math:`t`-irreducibility. Moreover, for efficient convergence,
:math:`Q` has to be chosen so as to explore quickly the whole support
of :math:`t` without conducting to a too small acceptance ratio (the
ratio of accepted candidates :math:`\tilde{x}` ). It is usually
recommended that this latter ratio is about :math:`0.2` but such a
ratio is neither a warranty of efficiency, nor a substitute to a
convergence diagnosis.
.. topic:: API:
- See :class:`~openturns.MetropolisHastings`
- See :class:`~openturns.Gibbs`
- See :class:`~openturns.RandomWalkMetropolisHastings`
- See :class:`~openturns.IndependentMetropolisHastings`
.. topic:: Examples:
- See :doc:`/auto_calibration/bayesian_calibration/plot_bayesian_calibration`
- See :doc:`/auto_calibration/bayesian_calibration/plot_bayesian_calibration_flooding`
- See :doc:`/auto_calibration/bayesian_calibration/plot_rwmh_python_distribution`
- See :doc:`/auto_calibration/bayesian_calibration/plot_imh_python_distribution`
.. topic:: References:
- Robert, C.P. and Casella, G. (2004). *Monte Carlo Statistical Methods* (Second Edition), Springer.
- Meyn, S. and Tweedie R.L. (2009). *Markov Chains and Stochastic Stability* (Second Edition), Cambridge University Press.
|