File: count_hurdle.py

package info (click to toggle)
statsmodels 0.14.6%2Bdfsg-1
links: PTS, VCS
area: main
in suites: sid
size: 49,956 kB
sloc: python: 254,365; f90: 612; sh: 560; javascript: 337; asm: 156; makefile: 145; ansic: 32; xml: 9
file content (294 lines) | stat: -rw-r--r-- 10,988 bytes
parent folder | download | duplicates (2)
#!/usr/bin/env python
# coding: utf-8

# DO NOT EDIT
# Autogenerated from the notebook count_hurdle.ipynb.
# Edit the notebook and then sync the output with this file.
#
# flake8: noqa
# DO NOT EDIT

# ## Hurdle and truncated count models
#
# Author: Josef Perktold
#
# Statsmodels has now hurdle and truncated count models, added in version
# 0.14.
#
# A hurdle model is composed of a model for zeros and a model for the
# distribution for counts larger than zero. The zero model is a binary model
# for a count of zero versus larger than zero. The count model for nonzero
# counts is a zero truncated count model.
#
# Statsmodels currently supports hurdle models with Poisson and Negative
# Binomial distributions as zero model and as count model. Binary models
# like Logit, Probit or GLM-Binomial are not yet supported as zero model.
# The advantage of Poisson-Poisson hurdle is that the standard Poisson
# model is a special case with equal parameters in both models. This
# provides a simple Wald test for the hurdle model against the Poisson
# model.
#
# The implemented binary model is a censored model where observations are
# right censored at one. That means that only 0 or 1 counts are observed.
#
# The hurdle model can be estimated by separately estimating the zero
# model and the count model for the zero truncated data assuming that
# observations are independently distributed (no correlation across
# observations). The resulting covariance matrix of the parameter estimates
# is block diagonal with diagonal blocks given by the submodels.
# Joint estimation is not yet implemented.
#
# The censored and truncated count models were developed mainly to support
# the hurdle model. However, the left truncated count models have other
# applications than supporting the hurdle models. The right censored models
# are not of separate interest because they only support binary observations
# that can be modeled by GLM-Binomial, Logit or Probit.
#
# For the hurdle model there is a single class `HurdleCountModel`, that
# includes the distributions of the submodels as option.
# Classes for truncated models are currently `TruncatedLFPoisson` and
# `TruncatedLFNegativeBinomialP`, where "LF" stands for left truncation at a
# fixed, observation independent truncation point.

import numpy as np
import pandas as pd

import statsmodels.discrete.truncated_model as smtc

from statsmodels.discrete.discrete_model import (Poisson, NegativeBinomial,
                                                 NegativeBinomialP,
                                                 GeneralizedPoisson)
from statsmodels.discrete.count_model import (ZeroInflatedPoisson,
                                              ZeroInflatedGeneralizedPoisson,
                                              ZeroInflatedNegativeBinomialP)

from statsmodels.discrete.truncated_model import (
    TruncatedLFPoisson,
    TruncatedLFNegativeBinomialP,
    _RCensoredPoisson,
    HurdleCountModel,
)

# ## Simulating a hurdle model
#
# We are simulating a Poisson-Poisson hurdle model explicitly because
# there are not yet any distribution helper functions for it.

np.random.seed(987456348)
# large sample to get strong results
nobs = 5000
x = np.column_stack((np.ones(nobs), np.linspace(0, 1, nobs)))

mu0 = np.exp(0.5 * 2 * x.sum(1))
y = np.random.poisson(mu0, size=nobs)
print(np.bincount(y))
y_ = y
indices = np.arange(len(y))
mask = mask0 = y > 0
for _ in range(10):

    print(mask.sum())
    indices = mask  #indices[mask]
    if not np.any(mask):
        break
    mu_ = np.exp(0.5 * x[indices].sum(1))
    y[indices] = y_ = np.random.poisson(mu_, size=len(mu_))
    np.place(y, mask, y_)
    mask = np.logical_and(mask0, y == 0)

np.bincount(y)

# ## Estimating misspecified Poisson Model
#
# The data that we generated has zero deflation, this is, we observe fewer
# zeros than what we would expect in a Poisson model.
#
# After fitting the model, we can use the plot function in the poisson
# diagnostic class to compare the expected predictive distribution and the
# realized frequencies. The shows that the Poisson model overestimates the
# number of zeros and underestimates counts of one and two.

mod_p = Poisson(y, x)
res_p = mod_p.fit()
print(res_p.summary())

dia_p = res_p.get_diagnostic()
dia_p.plot_probs()

# ## Estimating the Hurdle Model
#
# Next, we estimate the correctly specified Poisson-Poisson hurdle model.
#
# Signature and options for the HurdleCountModel shows that poisson-
# poisson is the default, so we do not need to specify any options when
# creating this model.
#
# `HurdleCountModel(endog, exog, offset=None, dist='poisson',
# zerodist='poisson',
#                   p=2, pzero=2, exposure=None, missing='none',
# **kwargs)`
#
# The results class of the HurdleCountModel has a `get_diagnostic` method.
# However, only part of the diagnostic methods are currently available. The
# plot of the predictive distribution shows very high agreement with the
# data.
#

mod_h = HurdleCountModel(y, x)
res_h = mod_h.fit(disp=False)
print(res_h.summary())

dia_h = res_h.get_diagnostic()
dia_h.plot_probs()

# We can use the Wald test to test whether the parameters of the zero
# model are the same as the parameters of the zero-truncated count model.
# The p-value is very small and correctly rejects that the model is just
# Poisson. We are using a large sample size, so the power of the test will
# be large in this case.

res_h.wald_test("zm_const = const, zm_x1 = x1", scalar=True)

# ## Prediction
#
# The hurdle model can be used for prediction for statistics of the
# overall model and of the two submodels. The statistics that should be
# predicted is specified using the `which` keyword.
#
# The following is taken from the docstring for predict and lists
# available the options.
#
#         which : str (optional)
#             Statitistic to predict. Default is 'mean'.
#
#             - 'mean' : the conditional expectation of endog E(y | x)
#             - 'mean-main' : mean parameter of truncated count model.
#               Note, this is not the mean of the truncated distribution.
#             - 'linear' : the linear predictor of the truncated count
# model.
#             - 'var' : returns the estimated variance of endog implied by
# the
#               model.
#             - 'prob-main' : probability of selecting the main model
# which is
#               the probability of observing a nonzero count P(y > 0 | x).
#             - 'prob-zero' : probability of observing a zero count. P(y=0
# | x).
#               This is equal to is ``1 - prob-main``
#             - 'prob-trunc' : probability of truncation of the truncated
# count
#               model. This is the probability of observing a zero count
# implied
#               by the truncation model.
#             - 'mean-nonzero' : expected value conditional on having
# observation
#               larger than zero, E(y | X, y>0)
#             - 'prob' : probabilities of each count from 0 to max(endog),
# or
#               for y_values if those are provided. This is a multivariate
#               return (2-dim when predicting for several observations).
#
# These options are available in the `predict` and the `get_prediction`
# methods of the results class.
#
# For the following example, we create a set of explanatory variables that
# are taken from the original data at equal spaced intervals. Then we can
# predict the available statistics conditional on these explanatory
# variables.

which_options = [
    "mean", "mean-main", "linear", "mean-nonzero", "prob-zero", "prob-main",
    "prob-trunc", "var", "prob"
]
ex = x[slice(None, None, nobs // 5), :]
ex

for w in which_options:
    print(w)
    pred = res_h.predict(ex, which=w)
    print("    ", pred)

for w in which_options[:-1]:
    print(w)
    pred = res_h.get_prediction(ex, which=w)
    print("    ", pred.predicted)
    print("  se", pred.se)

# The option `which="prob"` returns an array of predicted probabilities
# for each row of the predict `exog`.
# We are often interested in the mean probabilities averaged over all
# exog. The prediction methods have an option `average=True` to compute the
# average of the predicted values across observations and the corresponding
# standard errors and confidence intervals for those averaged predictions.

pred = res_h.get_prediction(ex, which="prob", average=True)
print("    ", pred.predicted)
print("  se", pred.se)

# We use the panda DataFrame to get a display that is easier to read. The
# "predicted" column shows the probability mass function for the predicted
# distribution of response values averaged of our 5 grid points of exog. The
# probabilities do not add up to one because counts larger than those
# observed have positive probability and are missing in the table, although
# in this example that probability is small.

dfp_h = pred.summary_frame()
dfp_h

prob_larger9 = pred.predicted.sum()
prob_larger9, 1 - prob_larger9

# `get_prediction` returns in this case an instance of the base
# `PredictionResultsDelta` class.
#
# Inferential statistics like standard errors, p-values and confidence
# interval for nonlinear functions that depend on several distribution
# parameters are computed using the delta method. Inference for predictions
# is based on the normal distribution.

pred

pred.dist, pred.dist_args

# We can compare the distribution predicted by the hurdle model with the
# one predicted by the Poisson model that we estimated earlier. The last
# column, "diff", shows that Poisson model overestimates the number of zeros
# by around 8% of observations and underestimates the counts of 1 and 2 by
# 7%, resp. 3.7% at the average over the `exog` grid.

pred_p = res_p.get_prediction(ex, which="prob", average=True)
dfp_p = pred_p.summary_frame()
dfp_h["poisson"] = dfp_p["predicted"]
dfp_h["diff"] = dfp_h["poisson"] - dfp_h["predicted"]
dfp_h

# ## Other post-estimation
#
# The estimated hurdle model can be use for wald test of parameters and
# for prediction. Other maximum likelihood statistics such as loglikelihood
# value and information criteria are also available.
#
# However, some post-estimation methods that require helper functions that
# are not needed for estimation, parameter inference and prediction are not
# yet available. The main methods that are not supported yet are
# `score_test`, `get_distribution`, and `get_influence`. Diagnostic measures
# in `get_diagnostics` are only available for statistics that are based on
# prediction.
#
#

res_h.llf, res_h.df_resid, res_h.aic, res_h.bic

# Is there excess dispersion? We can use the pearson residuals to compute
# a pearson chi2 statistics which should be close to 1 if the model is
# correctly specified.

(res_h.resid_pearson**2).sum() / res_h.df_resid

# The diagnostic class also has the predictive distribution which is used
# in the diagnostic plots. No other statistics or tests are currently
# availalbe.

dia_h.probs_predicted.mean(0)

res_h.resid[:10]