File: postestimation_poisson.py

package info (click to toggle)
statsmodels 0.14.6%2Bdfsg-1
links: PTS, VCS
area: main
in suites: sid
size: 49,956 kB
sloc: python: 254,365; f90: 612; sh: 560; javascript: 337; asm: 156; makefile: 145; ansic: 32; xml: 9
file content (378 lines) | stat: -rw-r--r-- 13,168 bytes
parent folder | download | duplicates (2)
#!/usr/bin/env python
# coding: utf-8

# DO NOT EDIT
# Autogenerated from the notebook postestimation_poisson.ipynb.
# Edit the notebook and then sync the output with this file.
#
# flake8: noqa
# DO NOT EDIT

# # Post-estimation Overview - Poisson
#
# This notebook provides an overview of post-estimation results that are
# available in several models, illustrated for the Poisson Model.
#
# see also https://github.com/statsmodels/statsmodels/issues/7707
#
# Traditionally the results classes for the models provided Wald inference
# and prediction. Several models now have additional methods for
# postestimation results, for inference, prediction and specification or
# diagnostic tests.
#
# The following is based on the current pattern for maximum likelihood
# models outside tsa, mainly for the discrete models. Other models still
# follow to some extend a different API pattern. Linear models like OLS and
# WLS have their special implementation, for example OLS influence. GLM also
# still has some features that are model specific.
#
# The main post-estimation features are
#
# - Inference - Wald tests [section](#Inference---Wald)
# - Inference - score tests [section](#Inference---score_test)
# - `get_prediction` prediction with inferential statistics
# [section](#Prediction)
# - `get_distribution` distribution class based on estimated parameters
# [section](#Distribution)
# - `get_diagnostic` diagnostic and specification tests, measures and
# plots [section](#Diagnostic)
# - `get_influence` outlier and influence diagnostics [section](#Outliers-
# and-Influence)
#
# **Warning** Recently added features are not stable.
# The main features have been unit tested and verified against other
# statistical packages. However, not every option is fully tested. The API,
# options, defaults and return types may still change as more features are
# added.
# (The current emphasis is on adding features and not on finding a
# convenient and futureproof interface.)
#
#

# ## A simulated example
#
# For the illustration we simulate data for the Poisson regression, that
# is correctly specified and has a relatively large sample. One regressor is
# categorical with two levels, The second regressor is uniformly distributed
# on the unit interval.
#

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

from statsmodels.discrete.discrete_model import Poisson
from statsmodels.discrete.diagnostic import PoissonDiagnostic

np.random.seed(983154356)

nr = 10
n_groups = 2
labels = np.arange(n_groups)
x = np.repeat(labels, np.array([40, 60]) * nr)
nobs = x.shape[0]
exog = (x[:, None] == labels).astype(np.float64)
xc = np.random.rand(len(x))
exog = np.column_stack((exog, xc))
# reparameterize to explicit constant
# exog[:, 1] = 1
beta = np.array([0.2, 0.3, 0.5], np.float64)

linpred = exog @ beta
mean = np.exp(linpred)
y = np.random.poisson(mean)
len(y), y.mean(), (y == 0).mean()

res = Poisson(y, exog).fit(disp=0)
print(res.summary())

# ## Inference - Wald
#
# Wald tests and other inferential statistics like confidence intervals
# based on Wald test have been a feature of the models since the beginning.
# Wald inference is based on the Hessian or expected information matrix
# evaluted at the estimated parameters.
# The covariance matrix of the parameter is optionally of the sandwich
# form which is robust to unspecified heteroscedasticity or serial or
# cluster correlation (`cov_type` option for `fit`).
#
# The currently available methods, aside from the statistics in the
# parmeter table, are
#
# - t_test
# - wald_test
# - t_test_pairwise
# - wald_test_terms
#
# `f_test` is available as legacy method. It is the same as `wald_test`
# with keyword option `use_f=True`.

res.t_test("x1=x2")

res.wald_test("x1=x2, x3", scalar=True)

# ## Inference - score_test
#
# new in statsmodels 0.14 for most discrete models and for GLM.
#
# Score or lagrange multiplier (LM) tests are based on the model estimated
# under the null hypothesis. A common example are variable addition tests
# for which we estimate the model parameters under null restrictions but
# evaluate the score and hessian under for the full model to test whether an
# additional variable is statistically significant.
#
#
# **Note:** Similar to the Wald tests, the score test implemented in the
# discrete models and GLM also has the option to use a heteroscedasticity or
# correlation robust covariance type.
# It currently uses the same implementation and defaults for the robust
# covariance matrix as in the Wald tests. In some cases the small sample
# corrections included in the `cov_type` for Wald tests will not be
# appropriate for score tests. In many cases Wald tests overjects but score
# tests can underreject. Using the Wald small sample corrections for score
# tests might leads then to more conservative p-values.
# (The defaults for small sample corrections might change in future. There
# is currently only little general information available about small sample
# corrections for heteroscedasticity and correlation robust score tests.
# Other statistical packages only implement it for a few special cases.)
#
# We can use the variable addition score_test for specification testing.
# In the following example we test whether there is some misspecified
# nonlinearity in the model by adding quadratic or polynomial tersm.
#
# In our example we can expect that these specification tests do not
# reject the null hypotheses because the model is correctly specified and
# the sample size is large,

res.score_test(exog_extra=xc**2)

# A reset test is a test for the correct specification of the link
# function. The standard form of the test adds polynomial terms of the
# linear predictor as extra regressors and test for their significance.
#
# Here we use the variable addition score test for the reset test with
# powers 2 and 3.

linpred = res.predict(which="linear")
res.score_test(exog_extra=linpred[:, None]**[2, 3])

# ## Prediction
#
# The model and results classes have `predict` method which only returns
# the predicted values. The `get_prediction` method adds inferential
# statistics for the prediction, standard errors, pvalues and confidence
# intervals.
#
#
# For the following example, we create new sets of explanatory variables
# that is split by the categorical level and over a uniform grid of the
# continuous variable.

n = 11
exc = np.linspace(0, 1, n)
ex1 = np.column_stack((np.ones(n), np.zeros(n), exc))
ex2 = np.column_stack((np.zeros(n), np.ones(n), exc))

m1 = res.get_prediction(ex1)
m2 = res.get_prediction(ex2)

# The available methods and attributes of the prediction results class are

[i for i in dir(m1) if not i.startswith("_")]

plt.plot(exc, np.column_stack([m1.predicted, m2.predicted]))
ci = m1.conf_int()
plt.fill_between(exc, ci[:, 0], ci[:, 1], color='b', alpha=.1)
ci = m2.conf_int()
plt.fill_between(exc, ci[:, 0], ci[:, 1], color='r', alpha=.1)
# to add observed points:
# y1 = y[x == 0]
# plt.plot(xc[x == 0], y1, ".", color="b", alpha=.3)
# y2 = y[x == 1]
# plt.plot(xc[x == 1], y2, ".", color="r", alpha=.3)

y.max()

# One of the available statistics that we can predict, specified by the
# "which" keyword, is the expected frequencies or probabilities of the
# predictive distribution. This shows us what the predicted probability of
# obsering count = 1, 2, 3, ... is for a given set of explanatory variables.

y_max = 5
f1 = res.get_prediction(ex1, which="prob", y_values=np.arange(y_max + 1))
f2 = res.get_prediction(ex2, which="prob", y_values=np.arange(y_max + 1))
f1.predicted.mean(0), f2.predicted.mean(0)

# We can also get the confidence intervals for the predicted
# probabilities.
# However, if we want the confidence interval for the average predicted
# probabilities, then we need to aggregate inside the predict function. The
# relevant keyword is "average" which computes the average of the
# predictions over the observations given by the `exog` array.

f1 = res.get_prediction(ex1,
                        which="prob",
                        y_values=np.arange(y_max + 1),
                        average=True)
f2 = res.get_prediction(ex2,
                        which="prob",
                        y_values=np.arange(y_max + 1),
                        average=True)
f1.predicted, f2.predicted

f1.conf_int()

f2.conf_int()

# To get more information about the predict methods and the available
# options, see
# `help(res.get_prediction)`
# `help(res.model.predict)`

# ## Distribution
#
# For given parameters we can create an instance of a scipy or scipy-
# compatible distribution class. This provides us with access to any of the
# methods in the distribution, pmf/pdf, cdf, stats.
#
# The `get_distribution` method of the results class uses the provided
# array of explanatory variables and the estimated parameters to specify a
# vectorized distribution. The `get_prediction` method of the model can be
# used for user specified parameters `params`.

distr = res.get_distribution()
distr

distr.pmf(0)[:10]

# The mean of the conditional distribution is the same as the predicted
# mean from the model.

distr.mean()[:10]

res.predict()[:10]

# We can also obtain the distribution for a new set of explanatory
# variables. Explanatory variables can be provided in the same way as for
# the predict method.
#
# We use again the grid of explanatory variables from the prediction
# section. As example for its usage we can compute the probability that a
# count (strictly) larger than 5 will be observed conditional on the values
# of the explanatory variables.

distr1 = res.get_distribution(ex1)
distr2 = res.get_distribution(ex2)

distr1.sf(5), distr2.sf(5)

plt.plot(exc, np.column_stack([distr1.sf(5), distr2.sf(5)]))

# We can also use the distribution to find an upper confidence limit on a
# new observation. The following plot and table show the upper limit counts
# for given explanatory variables. The probability of observing this count
# or less is at least 0.99.
#
# Note, this takes parameters as fixed and does not take parameter
# uncertainty into account.

plt.plot(exc, np.column_stack([distr1.ppf(0.99), distr2.ppf(0.99)]))

[distr1.ppf(0.99), distr2.ppf(0.99)]

# ## Diagnostic
#
# Poisson is the first model that has a diagnostic class that can be
# obtained from the results using `get_diagnostic`. Other count models have
# a generic count diagnostic class that has currently only a limited number
# of methods.
#
# The Poisson model in our example is correctly specified. Additionally we
# have a large sample size. So, in this case none of the diagnostic tests
# reject the null hypothesis of correct specification.

dia = res.get_diagnostic()
[i for i in dir(dia) if not i.startswith("_")]

dia.plot_probs()

# **test for excess dispersion**
#
# There are several dispersion tests available for Poisson. Currently all
# of them are returned.
# The DispersionResults class has a summary_frame method. The returned
# dataframe provides an overview of the results that is  easier to read.

td = dia.test_dispersion()
td

df = td.summary_frame()
df

# **test for zero-inflation**

dia.test_poisson_zeroinflation()

# chisquare test for zero-inflation

dia.test_chisquare_prob(bin_edges=np.arange(3))

# **goodness of fit test for predicted frequencies**
#
# This is a chisquare test that takes into account that parameters are
# estimated.
# Counts larger than the largest bin edge will be added to the last bin,
# so that the sum over bins is one.
#
# For example using 5 bins

dt = dia.test_chisquare_prob(bin_edges=np.arange(6))
dt

dt.diff1.mean(0)

vars(dia)

# ## Outliers and Influence
#
# Statsmodels provides a general MLEInfluence class for nonlinear models
# (models with nonlinear expected mean) that for the discrete models and
# other maximum likelihood based models such as the Beta regression model.
# The provided measures are based on general definitions, for example
# generalized leverage instead of the diagonal of the hat matrix in linear
# models.
#
# The results method `get_influence` returns and instance of the
# MLEInfluence class which has various methods for outlier and influence
# measures.
#

infl = res.get_influence()
[i for i in dir(infl) if not i.startswith("_")]

# The influence class has two plot methods. However, the plots are too
# crowded in this case because of the large sample size.

infl.plot_influence()

infl.plot_index(y_var="resid_studentized")

# A `summary_frame` shows the main influence and outlier measures for each
# observations.
#
# We have 1000 observations in our example which is too many to easily
# display. We can sort the summary dataframe by one of the columns and list
# the observations with the largest outlier or influence measure. In the
# example below, we sort by Cook's distance and by `standard_resid` which is
# the Pearson residual in the generic case.
#
# Because we simulated a "nice" model, there are no observations with
# large influence or that are large outliers.

df_infl = infl.summary_frame()
df_infl.head()

df_infl.sort_values("cooks_d", ascending=False)[:10]

df_infl.sort_values("standard_resid", ascending=False)[:10]