File: postestimation_poisson.py

package info (click to toggle)
statsmodels 0.14.6%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 49,956 kB
  • sloc: python: 254,365; f90: 612; sh: 560; javascript: 337; asm: 156; makefile: 145; ansic: 32; xml: 9
file content (378 lines) | stat: -rw-r--r-- 13,168 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
#!/usr/bin/env python
# coding: utf-8

# DO NOT EDIT
# Autogenerated from the notebook postestimation_poisson.ipynb.
# Edit the notebook and then sync the output with this file.
#
# flake8: noqa
# DO NOT EDIT

# # Post-estimation Overview - Poisson
#
# This notebook provides an overview of post-estimation results that are
# available in several models, illustrated for the Poisson Model.
#
# see also https://github.com/statsmodels/statsmodels/issues/7707
#
# Traditionally the results classes for the models provided Wald inference
# and prediction. Several models now have additional methods for
# postestimation results, for inference, prediction and specification or
# diagnostic tests.
#
# The following is based on the current pattern for maximum likelihood
# models outside tsa, mainly for the discrete models. Other models still
# follow to some extend a different API pattern. Linear models like OLS and
# WLS have their special implementation, for example OLS influence. GLM also
# still has some features that are model specific.
#
# The main post-estimation features are
#
# - Inference - Wald tests [section](#Inference---Wald)
# - Inference - score tests [section](#Inference---score_test)
# - `get_prediction` prediction with inferential statistics
# [section](#Prediction)
# - `get_distribution` distribution class based on estimated parameters
# [section](#Distribution)
# - `get_diagnostic` diagnostic and specification tests, measures and
# plots [section](#Diagnostic)
# - `get_influence` outlier and influence diagnostics [section](#Outliers-
# and-Influence)
#
# **Warning** Recently added features are not stable.
# The main features have been unit tested and verified against other
# statistical packages. However, not every option is fully tested. The API,
# options, defaults and return types may still change as more features are
# added.
# (The current emphasis is on adding features and not on finding a
# convenient and futureproof interface.)
#
#

# ## A simulated example
#
# For the illustration we simulate data for the Poisson regression, that
# is correctly specified and has a relatively large sample. One regressor is
# categorical with two levels, The second regressor is uniformly distributed
# on the unit interval.
#

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

from statsmodels.discrete.discrete_model import Poisson
from statsmodels.discrete.diagnostic import PoissonDiagnostic

np.random.seed(983154356)

nr = 10
n_groups = 2
labels = np.arange(n_groups)
x = np.repeat(labels, np.array([40, 60]) * nr)
nobs = x.shape[0]
exog = (x[:, None] == labels).astype(np.float64)
xc = np.random.rand(len(x))
exog = np.column_stack((exog, xc))
# reparameterize to explicit constant
# exog[:, 1] = 1
beta = np.array([0.2, 0.3, 0.5], np.float64)

linpred = exog @ beta
mean = np.exp(linpred)
y = np.random.poisson(mean)
len(y), y.mean(), (y == 0).mean()

res = Poisson(y, exog).fit(disp=0)
print(res.summary())

# ## Inference - Wald
#
# Wald tests and other inferential statistics like confidence intervals
# based on Wald test have been a feature of the models since the beginning.
# Wald inference is based on the Hessian or expected information matrix
# evaluted at the estimated parameters.
# The covariance matrix of the parameter is optionally of the sandwich
# form which is robust to unspecified heteroscedasticity or serial or
# cluster correlation (`cov_type` option for `fit`).
#
# The currently available methods, aside from the statistics in the
# parmeter table, are
#
# - t_test
# - wald_test
# - t_test_pairwise
# - wald_test_terms
#
# `f_test` is available as legacy method. It is the same as `wald_test`
# with keyword option `use_f=True`.

res.t_test("x1=x2")

res.wald_test("x1=x2, x3", scalar=True)

# ## Inference - score_test
#
# new in statsmodels 0.14 for most discrete models and for GLM.
#
# Score or lagrange multiplier (LM) tests are based on the model estimated
# under the null hypothesis. A common example are variable addition tests
# for which we estimate the model parameters under null restrictions but
# evaluate the score and hessian under for the full model to test whether an
# additional variable is statistically significant.
#
#
# **Note:** Similar to the Wald tests, the score test implemented in the
# discrete models and GLM also has the option to use a heteroscedasticity or
# correlation robust covariance type.
# It currently uses the same implementation and defaults for the robust
# covariance matrix as in the Wald tests. In some cases the small sample
# corrections included in the `cov_type` for Wald tests will not be
# appropriate for score tests. In many cases Wald tests overjects but score
# tests can underreject. Using the Wald small sample corrections for score
# tests might leads then to more conservative p-values.
# (The defaults for small sample corrections might change in future. There
# is currently only little general information available about small sample
# corrections for heteroscedasticity and correlation robust score tests.
# Other statistical packages only implement it for a few special cases.)
#
# We can use the variable addition score_test for specification testing.
# In the following example we test whether there is some misspecified
# nonlinearity in the model by adding quadratic or polynomial tersm.
#
# In our example we can expect that these specification tests do not
# reject the null hypotheses because the model is correctly specified and
# the sample size is large,

res.score_test(exog_extra=xc**2)

# A reset test is a test for the correct specification of the link
# function. The standard form of the test adds polynomial terms of the
# linear predictor as extra regressors and test for their significance.
#
# Here we use the variable addition score test for the reset test with
# powers 2 and 3.

linpred = res.predict(which="linear")
res.score_test(exog_extra=linpred[:, None]**[2, 3])

# ## Prediction
#
# The model and results classes have `predict` method which only returns
# the predicted values. The `get_prediction` method adds inferential
# statistics for the prediction, standard errors, pvalues and confidence
# intervals.
#
#
# For the following example, we create new sets of explanatory variables
# that is split by the categorical level and over a uniform grid of the
# continuous variable.

n = 11
exc = np.linspace(0, 1, n)
ex1 = np.column_stack((np.ones(n), np.zeros(n), exc))
ex2 = np.column_stack((np.zeros(n), np.ones(n), exc))

m1 = res.get_prediction(ex1)
m2 = res.get_prediction(ex2)

# The available methods and attributes of the prediction results class are

[i for i in dir(m1) if not i.startswith("_")]

plt.plot(exc, np.column_stack([m1.predicted, m2.predicted]))
ci = m1.conf_int()
plt.fill_between(exc, ci[:, 0], ci[:, 1], color='b', alpha=.1)
ci = m2.conf_int()
plt.fill_between(exc, ci[:, 0], ci[:, 1], color='r', alpha=.1)
# to add observed points:
# y1 = y[x == 0]
# plt.plot(xc[x == 0], y1, ".", color="b", alpha=.3)
# y2 = y[x == 1]
# plt.plot(xc[x == 1], y2, ".", color="r", alpha=.3)

y.max()

# One of the available statistics that we can predict, specified by the
# "which" keyword, is the expected frequencies or probabilities of the
# predictive distribution. This shows us what the predicted probability of
# obsering count = 1, 2, 3, ... is for a given set of explanatory variables.

y_max = 5
f1 = res.get_prediction(ex1, which="prob", y_values=np.arange(y_max + 1))
f2 = res.get_prediction(ex2, which="prob", y_values=np.arange(y_max + 1))
f1.predicted.mean(0), f2.predicted.mean(0)

# We can also get the confidence intervals for the predicted
# probabilities.
# However, if we want the confidence interval for the average predicted
# probabilities, then we need to aggregate inside the predict function. The
# relevant keyword is "average" which computes the average of the
# predictions over the observations given by the `exog` array.

f1 = res.get_prediction(ex1,
                        which="prob",
                        y_values=np.arange(y_max + 1),
                        average=True)
f2 = res.get_prediction(ex2,
                        which="prob",
                        y_values=np.arange(y_max + 1),
                        average=True)
f1.predicted, f2.predicted

f1.conf_int()

f2.conf_int()

# To get more information about the predict methods and the available
# options, see
# `help(res.get_prediction)`
# `help(res.model.predict)`

# ## Distribution
#
# For given parameters we can create an instance of a scipy or scipy-
# compatible distribution class. This provides us with access to any of the
# methods in the distribution, pmf/pdf, cdf, stats.
#
# The `get_distribution` method of the results class uses the provided
# array of explanatory variables and the estimated parameters to specify a
# vectorized distribution. The `get_prediction` method of the model can be
# used for user specified parameters `params`.

distr = res.get_distribution()
distr

distr.pmf(0)[:10]

# The mean of the conditional distribution is the same as the predicted
# mean from the model.

distr.mean()[:10]

res.predict()[:10]

# We can also obtain the distribution for a new set of explanatory
# variables. Explanatory variables can be provided in the same way as for
# the predict method.
#
# We use again the grid of explanatory variables from the prediction
# section. As example for its usage we can compute the probability that a
# count (strictly) larger than 5 will be observed conditional on the values
# of the explanatory variables.

distr1 = res.get_distribution(ex1)
distr2 = res.get_distribution(ex2)

distr1.sf(5), distr2.sf(5)

plt.plot(exc, np.column_stack([distr1.sf(5), distr2.sf(5)]))

# We can also use the distribution to find an upper confidence limit on a
# new observation. The following plot and table show the upper limit counts
# for given explanatory variables. The probability of observing this count
# or less is at least 0.99.
#
# Note, this takes parameters as fixed and does not take parameter
# uncertainty into account.

plt.plot(exc, np.column_stack([distr1.ppf(0.99), distr2.ppf(0.99)]))

[distr1.ppf(0.99), distr2.ppf(0.99)]

# ## Diagnostic
#
# Poisson is the first model that has a diagnostic class that can be
# obtained from the results using `get_diagnostic`. Other count models have
# a generic count diagnostic class that has currently only a limited number
# of methods.
#
# The Poisson model in our example is correctly specified. Additionally we
# have a large sample size. So, in this case none of the diagnostic tests
# reject the null hypothesis of correct specification.

dia = res.get_diagnostic()
[i for i in dir(dia) if not i.startswith("_")]

dia.plot_probs()

# **test for excess dispersion**
#
# There are several dispersion tests available for Poisson. Currently all
# of them are returned.
# The DispersionResults class has a summary_frame method. The returned
# dataframe provides an overview of the results that is  easier to read.

td = dia.test_dispersion()
td

df = td.summary_frame()
df

# **test for zero-inflation**

dia.test_poisson_zeroinflation()

# chisquare test for zero-inflation

dia.test_chisquare_prob(bin_edges=np.arange(3))

# **goodness of fit test for predicted frequencies**
#
# This is a chisquare test that takes into account that parameters are
# estimated.
# Counts larger than the largest bin edge will be added to the last bin,
# so that the sum over bins is one.
#
# For example using 5 bins

dt = dia.test_chisquare_prob(bin_edges=np.arange(6))
dt

dt.diff1.mean(0)

vars(dia)

# ## Outliers and Influence
#
# Statsmodels provides a general MLEInfluence class for nonlinear models
# (models with nonlinear expected mean) that for the discrete models and
# other maximum likelihood based models such as the Beta regression model.
# The provided measures are based on general definitions, for example
# generalized leverage instead of the diagonal of the hat matrix in linear
# models.
#
# The results method `get_influence` returns and instance of the
# MLEInfluence class which has various methods for outlier and influence
# measures.
#

infl = res.get_influence()
[i for i in dir(infl) if not i.startswith("_")]

# The influence class has two plot methods. However, the plots are too
# crowded in this case because of the large sample size.

infl.plot_influence()

infl.plot_index(y_var="resid_studentized")

# A `summary_frame` shows the main influence and outlier measures for each
# observations.
#
# We have 1000 observations in our example which is too many to easily
# display. We can sort the summary dataframe by one of the columns and list
# the observations with the largest outlier or influence measure. In the
# example below, we sort by Cook's distance and by `standard_resid` which is
# the Pearson residual in the generic case.
#
# Because we simulated a "nice" model, there are no observations with
# large influence or that are large outliers.

df_infl = infl.summary_frame()
df_infl.head()

df_infl.sort_values("cooks_d", ascending=False)[:10]

df_infl.sort_values("standard_resid", ascending=False)[:10]