File: poisson.py

package info (click to toggle)
python-bumps 1.0.3-3
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 6,200 kB
  • sloc: python: 24,517; xml: 493; ansic: 373; makefile: 211; javascript: 99; sh: 94
file content (200 lines) | stat: -rw-r--r-- 8,914 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
# Fitting Poisson data
# ====================
#
# Data from poisson processes, such as the number of counts per unit time
# or counts per unit area, do not have the same pattern of uncertainties
# as data from gaussian processes.  Poisson data consists of natural
# numbers occurring at some underlying rate.  The fitting process checks
# if the number of counts observed is consistent with the proposed rate
# for each point in the dataset, much like the fitting process for gaussian
# data checks if the observed value is consistent with the proposed value
# within the measurement uncertainty.
#
# Using :class:`bumps.curve.PoissonCurve` instead of :class:`bumps.curve.Curve`,
# we can fit a set of *counts* at conditions *x* using a function
# *f(x, p1, p2, ...)* to propose rates for the various *x* values given the
# parameters, yielding parameter values *p1, p2, ...* that are most consistent
# with the *counts* at *x*. When measuring poisson processes, the underlying
# rate is not known, so the measurement variance, which is a property of the
# rate, is not associated with the data but instead associated with the
# theory function which predicts the rates.  This is opposite from what we
# have with gaussian data, in which the uncertainty is associated with the
# measurement device, and explains why the call to PoissonCurve only accepts
# *x* and *counts*, not *x*, *y*, and *dy*.
#
# One property of the Poisson distribution is that it is well approximated
# by a gaussian distribution for values above about 10.   It will never be
# perfect match since numbers from a poisson distribution can never be
# negative, whereas gaussian numbers can always be negative, albeit with
# vanishingly small probability some of the time.  Below 10, there are
# various ways you can approximate the poisson distribution with a gaussian.
# This example explores some of the options.
#
# In particular, the handling of zero counts can be problematic when treating
# the measurement as gaussian.  You cannot simply drop the points with zero
# counts. Once you've done various reduction steps, the resulting non-zero
# value for the uncertainty will carry meaning.  The longer you count,
# the smaller the uncertainty should be, once you've normalized for counting
# time or monitor.  Being off by a factor of 2 on the residuals is much
# better than being off by a factor of infinity using uncertainty = zero,
# and better than dropping the point altogether.
#
# There are a few things you can do with zero counts without being
# completely arbitrary:
#
#   1) $\lambda = (k+1) \pm \sqrt{k+1}$ for all $k$
#   2) $\lambda = (k+1/2) \pm \sqrt{k+1/4}$ for all k
#   3) $\lambda = k \pm \sqrt{k+1}$ for all k
#   4) $\lambda = k \pm \sqrt{k}$ for $k>0$, $1/2 \pm 1/2$ for $k = 0$
#   5) $\lambda = k \pm \sqrt{k}$ for $k>0$, $0 \pm 1$ for $k = 0$
#
# See the notes from the CDF Statistics Committee for details at
# `<https://www-cdf.fnal.gov/physics/statistics/notes/pois_eb.txt>`_.
#
# Of these, option 5 works slightly better for fitting, giving the best
# estimate of the background.
#
# The ideal case is to have your model produce an expected number of counts
# on the detector.  It is then trivial to compute the probability of
# seeing the observed counts from the expected counts and fit the parameters
# using PoissonCurve.  Unfortunately, this means incorporating all
# instrumental effects when modelling the measurement rather than correcting
# for instrumental effects in a data reduction program, and using a common
# sample model independent of instrument.
#
# Setting $\lambda = k$ is good since that is the maximum likelihood value
# for $\lambda$ given observed $k$, but this breaks down at $k=0$, giving zero
# uncertainty regardless of how long we measured.
#
# Since the Poisson distribution is slightly skew, a good estimate is
# $\lambda = k+1$ (option 1 above).  This follows from the formula for the
# expected value of a distribution:
#
# .. math::
#
#    E[x] = \int_{-\infty}^\infty x P(x) dx
#
# For the poisson distribution, this is:
#
# .. math::
#
#    E[\lambda] = \int_0^\infty \lambda \frac{\lambda^k e^{-\lambda}}{k!} d\lambda
#
# Running some simulations, we can see that $\hat\lambda=(k+1)\pm\sqrt{k+1}$
# (see `sim.py <sim.html>`_). This is the best fit RMS value to the distribution
# of possible $\lambda$ values that could give rise to the observed $k$.
#
# The current practice is to use $\hat\lambda=k\pm\sqrt{k}$. Convincing the
# world to accept $\lambda = k+1$ would be challenging since the expected
# value is not the most likely value.  As a compromise, one can use $0 \pm 1$
# for zero counts, and $k \pm \sqrt{k}$ for other values. This provides a
# reasonable estimate for the uncertainty on zero counts, which after
# normalization becomes smaller for longer counting times or higher incident
# flux.
#
# Another option is to choose the center and bounds so that the
# uncertainty covers $1-\sigma$ from the distribution (68%).  A simple
# approximation which does this is $(n+1/2) \pm \sqrt{n+1/4}$.
# Again, hard to convince the world to do, so one could compromise and
# choose $1/2 \pm 1/2$ for $k=0$ and $k \pm \sqrt{k}$ otherwise.
#
# What follows is a model which allows us to fit a simulated peak using
# these various definitions of $\lambda$ and see which version best recovers
# the true parameters which generated the peak.

from bumps.names import *

# Define the peak shape.  We are using a simple gaussian with center, width,
# scale and background.


def peak(x, scale, center, width, background):
    return scale * np.exp(-0.5 * (x - center) ** 2 / width**2) + background


# Generate simulated peak data with poisson noise.  When running the fit,
# you can choose various values for the peak intensity.  We are using a
# large number of points so that the peak is highly constrained by the
# data, and the returned parameters are consistent from run to run.  Real
# data is likely not so heavily sampled.

x = np.linspace(5, 20, 345)
# y = np.random.poisson(peak(x, 1000, 12, 1.0, 1))
# y = np.random.poisson(peak(x, 300, 12, 1.5, 1))
y = np.random.poisson(peak(x, 3, 12, 1.5, 1))

# Define the various conditions.  These can be selected on the command
# line by listing the condition name after the model file.  Note that
# bumps will make any option not preceded by "-" available to the model
# file as elements of *sys.argv*.  *sys.argv[0]* is the model file itself.
#
# The options correspond to the five options listed above, with an additional
# option "poisson" which is used to select PoissonCurve rather than Curve
# in the fit.

cond = sys.argv[1] if len(sys.argv) > 1 else "pearson"
if cond == "poisson":  # option 0: use PoissonCurve rather than Curve to fit
    pass
elif cond == "expected":  # option 1: L = (y+1) +/- sqrt(y+1)
    y += 1
    dy = np.sqrt(y)
elif cond == "pearson":  # option 2: L = (y + 0.5)  +/- sqrt(y + 1/4)
    dy = np.sqrt(y + 0.25)
    y = y + 0.5
elif cond == "expected_mle":  # option 3: L = y +/- sqrt(y+1)
    dy = np.sqrt(y + 1)
elif cond == "pearson_zero":  # option 4: L = y +/- sqrt(y); L[0] = 0.5 +/- 0.5
    dy = np.sqrt(y)
    y = np.asarray(y, "d")
    y[y == 0] = 0.5
    dy[y == 0] = 0.5
elif cond == "expected_zero":  # option 5: L = y +/- sqrt(y);  L[0] = 0 +/- 1
    dy = np.sqrt(y)
    dy[y == 0] = 1.0
else:
    raise RuntimeError(
        "Need to select uncertainty: pearson, pearson_zero, expected, expected_zero, expected_mle, poisson"
    )

# Build the fitter, and set the range on the fit parameters.

if cond == "poisson":
    M = PoissonCurve(peak, x, y, scale=1, center=2, width=2, background=0)
else:
    M = Curve(peak, x, y, dy, scale=1, center=2, width=2, background=0)
dx = max(x) - min(x)
M.scale.range(0, max(y) * 1.5)
M.center.range(min(x) - 0.2 * dx, max(x) + 0.2 * dx)
M.width.range(0, 0.7 * dx)
M.background.range(0, max(y))

# Set the fit problem as usual.

problem = FitProblem(M)

# We can now load and run the fit.  Be sure to substitute COND for one of the
# conditions defined above:
#
# .. parsed-literal::
#
#    $ bumps.py poisson.py --fit=dream --burn=600 --store=/tmp/T1 COND
#
# Comparing the results for the various conditions, we can see that all methods
# yield a good fit to the underlying center, scale and width.  It is only the
# background that causes problems.  Using poisson statistics for the fit gives
# the proper background estimate, and using the traditional method of
# $\lambda = k \pm \sqrt{k}$ for $k>0$, and $0 \pm 1$ for $k=1$ gives the
# best gaussian approximation.
#
# .. table:: Fit results
#
#     = ================= ==========
#     # method            background
#     = ================= ==========
#     0 poisson           1.0
#     1 expected          1.55
#     2 pearson           0.16
#     3 expected_mle      0.55
#     4 pearson_zero      0.34
#     5 expected_zero     0.75
#     = ================= ==========