1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167
|
#!/usr/bin/env python
# coding: utf-8
# DO NOT EDIT
# Autogenerated from the notebook formulas.ipynb.
# Edit the notebook and then sync the output with this file.
#
# flake8: noqa
# DO NOT EDIT
# # Formulas: Fitting models using R-style formulas
# Since version 0.5.0, ``statsmodels`` allows users to fit statistical
# models using R-style formulas. Internally, ``statsmodels`` uses the
# [patsy](http://patsy.readthedocs.org/) package to convert formulas and
# data to the matrices that are used in model fitting. The formula framework
# is quite powerful; this tutorial only scratches the surface. A full
# description of the formula language can be found in the ``patsy`` docs:
#
# * [Patsy formula language description](http://patsy.readthedocs.org/)
#
# ## Loading modules and functions
import numpy as np # noqa:F401 needed in namespace for patsy
import statsmodels.api as sm
# #### Import convention
# You can import explicitly from statsmodels.formula.api
from statsmodels.formula.api import ols
# Alternatively, you can just use the `formula` namespace of the main
# `statsmodels.api`.
sm.formula.ols
# Or you can use the following convention
import statsmodels.formula.api as smf
# These names are just a convenient way to get access to each model's
# `from_formula` classmethod. See, for instance
sm.OLS.from_formula
# All of the lower case models accept ``formula`` and ``data`` arguments,
# whereas upper case ones take ``endog`` and ``exog`` design matrices.
# ``formula`` accepts a string which describes the model in terms of a
# ``patsy`` formula. ``data`` takes a [pandas](https://pandas.pydata.org/)
# data frame or any other data structure that defines a ``__getitem__`` for
# variable names like a structured array or a dictionary of variables.
#
# ``dir(sm.formula)`` will print a list of available models.
#
# Formula-compatible models have the following generic call signature:
# ``(formula, data, subset=None, *args, **kwargs)``
#
# ## OLS regression using formulas
#
# To begin, we fit the linear model described on the [Getting
# Started](./regression_diagnostics.html) page. Download the data, subset
# columns, and list-wise delete to remove missing observations:
dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=True)
df = dta.data[["Lottery", "Literacy", "Wealth", "Region"]].dropna()
df.head()
# Fit the model:
mod = ols(formula="Lottery ~ Literacy + Wealth + Region", data=df)
res = mod.fit()
print(res.summary())
# ## Categorical variables
#
# Looking at the summary printed above, notice that ``patsy`` determined
# that elements of *Region* were text strings, so it treated *Region* as a
# categorical variable. `patsy`'s default is also to include an intercept,
# so we automatically dropped one of the *Region* categories.
#
# If *Region* had been an integer variable that we wanted to treat
# explicitly as categorical, we could have done so by using the ``C()``
# operator:
res = ols(formula="Lottery ~ Literacy + Wealth + C(Region)", data=df).fit()
print(res.params)
# Patsy's mode advanced features for categorical variables are discussed
# in: [Patsy: Contrast Coding Systems for categorical
# variables](./contrasts.html)
# ## Operators
#
# We have already seen that "~" separates the left-hand side of the model
# from the right-hand side, and that "+" adds new columns to the design
# matrix.
#
# ## Removing variables
#
# The "-" sign can be used to remove columns/variables. For instance, we
# can remove the intercept from a model by:
res = ols(formula="Lottery ~ Literacy + Wealth + C(Region) -1 ", data=df).fit()
print(res.params)
# ## Multiplicative interactions
#
# ":" adds a new column to the design matrix with the interaction of the
# other two columns. "*" will also include the individual columns that were
# multiplied together:
res1 = ols(formula="Lottery ~ Literacy : Wealth - 1", data=df).fit()
res2 = ols(formula="Lottery ~ Literacy * Wealth - 1", data=df).fit()
print(res1.params, "\n")
print(res2.params)
# Many other things are possible with operators. Please consult the [patsy
# docs](https://patsy.readthedocs.org/en/latest/formulas.html) to learn
# more.
# ## Functions
#
# You can apply vectorized functions to the variables in your model:
res = smf.ols(formula="Lottery ~ np.log(Literacy)", data=df).fit()
print(res.params)
# Define a custom function:
def log_plus_1(x):
return np.log(x) + 1.0
res = smf.ols(formula="Lottery ~ log_plus_1(Literacy)", data=df).fit()
print(res.params)
# Any function that is in the calling namespace is available to the
# formula.
# ## Using formulas with models that do not (yet) support them
#
# Even if a given `statsmodels` function does not support formulas, you
# can still use `patsy`'s formula language to produce design matrices. Those
# matrices
# can then be fed to the fitting function as `endog` and `exog` arguments.
#
# To generate ``numpy`` arrays:
import patsy
f = "Lottery ~ Literacy * Wealth"
y, X = patsy.dmatrices(f, df, return_type="matrix")
print(y[:5])
print(X[:5])
# To generate pandas data frames:
f = "Lottery ~ Literacy * Wealth"
y, X = patsy.dmatrices(f, df, return_type="dataframe")
print(y[:5])
print(X[:5])
print(sm.OLS(y, X).fit().summary())
|