1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200
|
# Fitting Poisson data
# ====================
#
# Data from poisson processes, such as the number of counts per unit time
# or counts per unit area, do not have the same pattern of uncertainties
# as data from gaussian processes. Poisson data consists of natural
# numbers occurring at some underlying rate. The fitting process checks
# if the number of counts observed is consistent with the proposed rate
# for each point in the dataset, much like the fitting process for gaussian
# data checks if the observed value is consistent with the proposed value
# within the measurement uncertainty.
#
# Using :class:`bumps.curve.PoissonCurve` instead of :class:`bumps.curve.Curve`,
# we can fit a set of *counts* at conditions *x* using a function
# *f(x, p1, p2, ...)* to propose rates for the various *x* values given the
# parameters, yielding parameter values *p1, p2, ...* that are most consistent
# with the *counts* at *x*. When measuring poisson processes, the underlying
# rate is not known, so the measurement variance, which is a property of the
# rate, is not associated with the data but instead associated with the
# theory function which predicts the rates. This is opposite from what we
# have with gaussian data, in which the uncertainty is associated with the
# measurement device, and explains why the call to PoissonCurve only accepts
# *x* and *counts*, not *x*, *y*, and *dy*.
#
# One property of the Poisson distribution is that it is well approximated
# by a gaussian distribution for values above about 10. It will never be
# perfect match since numbers from a poisson distribution can never be
# negative, whereas gaussian numbers can always be negative, albeit with
# vanishingly small probability some of the time. Below 10, there are
# various ways you can approximate the poisson distribution with a gaussian.
# This example explores some of the options.
#
# In particular, the handling of zero counts can be problematic when treating
# the measurement as gaussian. You cannot simply drop the points with zero
# counts. Once you've done various reduction steps, the resulting non-zero
# value for the uncertainty will carry meaning. The longer you count,
# the smaller the uncertainty should be, once you've normalized for counting
# time or monitor. Being off by a factor of 2 on the residuals is much
# better than being off by a factor of infinity using uncertainty = zero,
# and better than dropping the point altogether.
#
# There are a few things you can do with zero counts without being
# completely arbitrary:
#
# 1) $\lambda = (k+1) \pm \sqrt{k+1}$ for all $k$
# 2) $\lambda = (k+1/2) \pm \sqrt{k+1/4}$ for all k
# 3) $\lambda = k \pm \sqrt{k+1}$ for all k
# 4) $\lambda = k \pm \sqrt{k}$ for $k>0$, $1/2 \pm 1/2$ for $k = 0$
# 5) $\lambda = k \pm \sqrt{k}$ for $k>0$, $0 \pm 1$ for $k = 0$
#
# See the notes from the CDF Statistics Committee for details at
# `<https://www-cdf.fnal.gov/physics/statistics/notes/pois_eb.txt>`_.
#
# Of these, option 5 works slightly better for fitting, giving the best
# estimate of the background.
#
# The ideal case is to have your model produce an expected number of counts
# on the detector. It is then trivial to compute the probability of
# seeing the observed counts from the expected counts and fit the parameters
# using PoissonCurve. Unfortunately, this means incorporating all
# instrumental effects when modelling the measurement rather than correcting
# for instrumental effects in a data reduction program, and using a common
# sample model independent of instrument.
#
# Setting $\lambda = k$ is good since that is the maximum likelihood value
# for $\lambda$ given observed $k$, but this breaks down at $k=0$, giving zero
# uncertainty regardless of how long we measured.
#
# Since the Poisson distribution is slightly skew, a good estimate is
# $\lambda = k+1$ (option 1 above). This follows from the formula for the
# expected value of a distribution:
#
# .. math::
#
# E[x] = \int_{-\infty}^\infty x P(x) dx
#
# For the poisson distribution, this is:
#
# .. math::
#
# E[\lambda] = \int_0^\infty \lambda \frac{\lambda^k e^{-\lambda}}{k!} d\lambda
#
# Running some simulations, we can see that $\hat\lambda=(k+1)\pm\sqrt{k+1}$
# (see `sim.py <sim.html>`_). This is the best fit RMS value to the distribution
# of possible $\lambda$ values that could give rise to the observed $k$.
#
# The current practice is to use $\hat\lambda=k\pm\sqrt{k}$. Convincing the
# world to accept $\lambda = k+1$ would be challenging since the expected
# value is not the most likely value. As a compromise, one can use $0 \pm 1$
# for zero counts, and $k \pm \sqrt{k}$ for other values. This provides a
# reasonable estimate for the uncertainty on zero counts, which after
# normalization becomes smaller for longer counting times or higher incident
# flux.
#
# Another option is to choose the center and bounds so that the
# uncertainty covers $1-\sigma$ from the distribution (68%). A simple
# approximation which does this is $(n+1/2) \pm \sqrt{n+1/4}$.
# Again, hard to convince the world to do, so one could compromise and
# choose $1/2 \pm 1/2$ for $k=0$ and $k \pm \sqrt{k}$ otherwise.
#
# What follows is a model which allows us to fit a simulated peak using
# these various definitions of $\lambda$ and see which version best recovers
# the true parameters which generated the peak.
from bumps.names import *
# Define the peak shape. We are using a simple gaussian with center, width,
# scale and background.
def peak(x, scale, center, width, background):
return scale * np.exp(-0.5 * (x - center) ** 2 / width**2) + background
# Generate simulated peak data with poisson noise. When running the fit,
# you can choose various values for the peak intensity. We are using a
# large number of points so that the peak is highly constrained by the
# data, and the returned parameters are consistent from run to run. Real
# data is likely not so heavily sampled.
x = np.linspace(5, 20, 345)
# y = np.random.poisson(peak(x, 1000, 12, 1.0, 1))
# y = np.random.poisson(peak(x, 300, 12, 1.5, 1))
y = np.random.poisson(peak(x, 3, 12, 1.5, 1))
# Define the various conditions. These can be selected on the command
# line by listing the condition name after the model file. Note that
# bumps will make any option not preceded by "-" available to the model
# file as elements of *sys.argv*. *sys.argv[0]* is the model file itself.
#
# The options correspond to the five options listed above, with an additional
# option "poisson" which is used to select PoissonCurve rather than Curve
# in the fit.
cond = sys.argv[1] if len(sys.argv) > 1 else "pearson"
if cond == "poisson": # option 0: use PoissonCurve rather than Curve to fit
pass
elif cond == "expected": # option 1: L = (y+1) +/- sqrt(y+1)
y += 1
dy = np.sqrt(y)
elif cond == "pearson": # option 2: L = (y + 0.5) +/- sqrt(y + 1/4)
dy = np.sqrt(y + 0.25)
y = y + 0.5
elif cond == "expected_mle": # option 3: L = y +/- sqrt(y+1)
dy = np.sqrt(y + 1)
elif cond == "pearson_zero": # option 4: L = y +/- sqrt(y); L[0] = 0.5 +/- 0.5
dy = np.sqrt(y)
y = np.asarray(y, "d")
y[y == 0] = 0.5
dy[y == 0] = 0.5
elif cond == "expected_zero": # option 5: L = y +/- sqrt(y); L[0] = 0 +/- 1
dy = np.sqrt(y)
dy[y == 0] = 1.0
else:
raise RuntimeError(
"Need to select uncertainty: pearson, pearson_zero, expected, expected_zero, expected_mle, poisson"
)
# Build the fitter, and set the range on the fit parameters.
if cond == "poisson":
M = PoissonCurve(peak, x, y, scale=1, center=2, width=2, background=0)
else:
M = Curve(peak, x, y, dy, scale=1, center=2, width=2, background=0)
dx = max(x) - min(x)
M.scale.range(0, max(y) * 1.5)
M.center.range(min(x) - 0.2 * dx, max(x) + 0.2 * dx)
M.width.range(0, 0.7 * dx)
M.background.range(0, max(y))
# Set the fit problem as usual.
problem = FitProblem(M)
# We can now load and run the fit. Be sure to substitute COND for one of the
# conditions defined above:
#
# .. parsed-literal::
#
# $ bumps.py poisson.py --fit=dream --burn=600 --store=/tmp/T1 COND
#
# Comparing the results for the various conditions, we can see that all methods
# yield a good fit to the underlying center, scale and width. It is only the
# background that causes problems. Using poisson statistics for the fit gives
# the proper background estimate, and using the traditional method of
# $\lambda = k \pm \sqrt{k}$ for $k>0$, and $0 \pm 1$ for $k=1$ gives the
# best gaussian approximation.
#
# .. table:: Fit results
#
# = ================= ==========
# # method background
# = ================= ==========
# 0 poisson 1.0
# 1 expected 1.55
# 2 pearson 0.16
# 3 expected_mle 0.55
# 4 pearson_zero 0.34
# 5 expected_zero 0.75
# = ================= ==========
|