File: formulas.py

package info (click to toggle)
statsmodels 0.14.6%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 49,956 kB
  • sloc: python: 254,365; f90: 612; sh: 560; javascript: 337; asm: 156; makefile: 145; ansic: 32; xml: 9
file content (167 lines) | stat: -rw-r--r-- 5,236 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
#!/usr/bin/env python
# coding: utf-8

# DO NOT EDIT
# Autogenerated from the notebook formulas.ipynb.
# Edit the notebook and then sync the output with this file.
#
# flake8: noqa
# DO NOT EDIT

# # Formulas: Fitting models using R-style formulas

# Since version 0.5.0, ``statsmodels`` allows users to fit statistical
# models using R-style formulas. Internally, ``statsmodels`` uses the
# [patsy](http://patsy.readthedocs.org/) package to convert formulas and
# data to the matrices that are used in model fitting. The formula framework
# is quite powerful; this tutorial only scratches the surface. A full
# description of the formula language can be found in the ``patsy`` docs:
#
# * [Patsy formula language description](http://patsy.readthedocs.org/)
#
# ## Loading modules and functions

import numpy as np  # noqa:F401  needed in namespace for patsy
import statsmodels.api as sm

# #### Import convention

# You can import explicitly from statsmodels.formula.api

from statsmodels.formula.api import ols

# Alternatively, you can just use the `formula` namespace of the main
# `statsmodels.api`.

sm.formula.ols

# Or you can use the following convention

import statsmodels.formula.api as smf

# These names are just a convenient way to get access to each model's
# `from_formula` classmethod. See, for instance

sm.OLS.from_formula

# All of the lower case models accept ``formula`` and ``data`` arguments,
# whereas upper case ones take ``endog`` and ``exog`` design matrices.
# ``formula`` accepts a string which describes the model in terms of a
# ``patsy`` formula. ``data`` takes a [pandas](https://pandas.pydata.org/)
# data frame or any other data structure that defines a ``__getitem__`` for
# variable names like a structured array or a dictionary of variables.
#
# ``dir(sm.formula)`` will print a list of available models.
#
# Formula-compatible models have the following generic call signature:
# ``(formula, data, subset=None, *args, **kwargs)``

#
# ## OLS regression using formulas
#
# To begin, we fit the linear model described on the [Getting
# Started](./regression_diagnostics.html) page. Download the data, subset
# columns, and list-wise delete to remove missing observations:

dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=True)

df = dta.data[["Lottery", "Literacy", "Wealth", "Region"]].dropna()
df.head()

# Fit the model:

mod = ols(formula="Lottery ~ Literacy + Wealth + Region", data=df)
res = mod.fit()
print(res.summary())

# ## Categorical variables
#
# Looking at the summary printed above, notice that ``patsy`` determined
# that elements of *Region* were text strings, so it treated *Region* as a
# categorical variable. `patsy`'s default is also to include an intercept,
# so we automatically dropped one of the *Region* categories.
#
# If *Region* had been an integer variable that we wanted to treat
# explicitly as categorical, we could have done so by using the ``C()``
# operator:

res = ols(formula="Lottery ~ Literacy + Wealth + C(Region)", data=df).fit()
print(res.params)

# Patsy's mode advanced features for categorical variables are discussed
# in: [Patsy: Contrast Coding Systems for categorical
# variables](./contrasts.html)

# ## Operators
#
# We have already seen that "~" separates the left-hand side of the model
# from the right-hand side, and that "+" adds new columns to the design
# matrix.
#
# ## Removing variables
#
# The "-" sign can be used to remove columns/variables. For instance, we
# can remove the intercept from a model by:

res = ols(formula="Lottery ~ Literacy + Wealth + C(Region) -1 ", data=df).fit()
print(res.params)

# ## Multiplicative interactions
#
# ":" adds a new column to the design matrix with the interaction of the
# other two columns. "*" will also include the individual columns that were
# multiplied together:

res1 = ols(formula="Lottery ~ Literacy : Wealth - 1", data=df).fit()
res2 = ols(formula="Lottery ~ Literacy * Wealth - 1", data=df).fit()
print(res1.params, "\n")
print(res2.params)

# Many other things are possible with operators. Please consult the [patsy
# docs](https://patsy.readthedocs.org/en/latest/formulas.html) to learn
# more.

# ## Functions
#
# You can apply vectorized functions to the variables in your model:

res = smf.ols(formula="Lottery ~ np.log(Literacy)", data=df).fit()
print(res.params)

# Define a custom function:


def log_plus_1(x):
    return np.log(x) + 1.0


res = smf.ols(formula="Lottery ~ log_plus_1(Literacy)", data=df).fit()
print(res.params)

# Any function that is in the calling namespace is available to the
# formula.

# ## Using formulas with models that do not (yet) support them
#
# Even if a given `statsmodels` function does not support formulas, you
# can still use `patsy`'s formula language to produce design matrices. Those
# matrices
# can then be fed to the fitting function as `endog` and `exog` arguments.
#
# To generate ``numpy`` arrays:

import patsy

f = "Lottery ~ Literacy * Wealth"
y, X = patsy.dmatrices(f, df, return_type="matrix")
print(y[:5])
print(X[:5])

# To generate pandas data frames:

f = "Lottery ~ Literacy * Wealth"
y, X = patsy.dmatrices(f, df, return_type="dataframe")
print(y[:5])
print(X[:5])

print(sm.OLS(y, X).fit().summary())