1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283
|
.. _expert-model-specification:
Model specification for experts and computers
=============================================
.. currentmodule:: patsy
While the formula language is great for interactive model-fitting and
exploratory data analysis, there are times when we want a different or
more systematic interface for creating design matrices. If you ever
find yourself writing code that pastes together bits of strings to
create a formula, then stop! And read this chapter.
Our first option, of course, is that we can go ahead and write some
code to construct our design matrices directly, just like we did in
the old days. Since this is supported directly by :func:`dmatrix` and
:func:`dmatrices`, it also works with any third-party library
functions that use Patsy internally. Just pass in an array_like or
a tuple ``(y_array_like, X_array_like)`` in place of the formula.
.. ipython:: python
from patsy import dmatrix
X = [[1, 10], [1, 20], [1, -2]]
dmatrix(X)
By using a :class:`DesignMatrix` with :class:`DesignInfo` attached, we
can also specify custom names for our custom matrix (or even term
slices and so forth), so that we still get the nice output and such
that Patsy would otherwise provide:
.. ipython:: python
from patsy import DesignMatrix, DesignInfo
design_info = DesignInfo(["Intercept!", "Not intercept!"])
X_dm = DesignMatrix(X, design_info)
dmatrix(X_dm)
Or if all we want to do is to specify column names, we could also just
use a :class:`pandas.DataFrame`:
.. ipython:: python
import pandas
df = pandas.DataFrame([[1, 10], [1, 20], [1, -2]],
columns=["Intercept!", "Not intercept!"])
dmatrix(df)
However, there is also a middle ground between pasting together
strings and going back to putting together design matrices out of
string and baling wire. Patsy has a straightforward Python
interface for representing the result of parsing formulas, and you can
use it directly. This lets you keep Patsy's normal advantages --
handling of categorical data and interactions, predictions, term
tracking, etc. -- while using a nice high-level Python API. An example
of somewhere this might be useful is if, say, you had a GUI with a
tick box next to each variable in your data set, and wanted to
construct a formula containing all the variables that had been
checked, and letting Patsy deal with categorical data handling. Or
this would be the approach you'd take for doing stepwise regression,
where you need to programatically add and remove terms.
Whatever your particular situation, the strategy is this:
#. Construct some factor objects (probably using :class:`LookupFactor` or
:class:`EvalFactor`
#. Put them into some :class:`Term` objects,
#. Put the :class:`Term` objects into two lists, representing the
left- and right-hand side of your formula,
#. And then wrap the whole thing up in a :class:`ModelDesc`.
(See :ref:`formulas` if you need a refresher on what each of these
things are.)
.. ipython:: python
import numpy as np
from patsy import (ModelDesc, EvalEnvironment, Term, EvalFactor,
LookupFactor, demo_data, dmatrix)
data = demo_data("a", "x")
# LookupFactor takes a dictionary key:
a_lookup = LookupFactor("a")
# EvalFactor takes arbitrary Python code:
x_transform = EvalFactor("np.log(x ** 2)")
# First argument is empty list for dmatrix; we would need to put
# something there if we were calling dmatrices.
desc = ModelDesc([],
[Term([a_lookup]),
Term([x_transform]),
# An interaction:
Term([a_lookup, x_transform])])
# Create the matrix (or pass 'desc' to any statistical library
# function that uses patsy.dmatrix internally):
dmatrix(desc, data)
Notice that no intercept term is included. Implicit intercepts are a
feature of the formula parser, not the underlying representation. If you
want an intercept, include the constant :const:`INTERCEPT` in your
list of terms (which is just sugar for ``Term([])``).
.. note::
Another option is to just pass your term lists directly to
:func:`design_matrix_builders`, and skip the :class:`ModelDesc`
entirely -- all of the highlevel API functions like :func:`dmatrix`
accept :class:`DesignMatrixBuilder` objects as well as
:class:`ModelDesc` objects.
Example: say our data has 100 different numerical columns that we want
to include in our design -- and we also have a few categorical
variables with a more complex interaction structure. Here's one
solution:
.. literalinclude:: _examples/add_predictors.py
.. ipython:: python
:suppress:
with open("_examples/add_predictors.py") as f:
exec(f.read())
.. ipython:: python
extra_predictors = ["x%s" % (i,) for i in range(10)]
desc = add_predictors("np.log(y) ~ a*b + c:d", extra_predictors)
desc.describe()
The factor protocol
-------------------
If :class:`LookupFactor` and :class:`EvalFactor` aren't enough for
you, then you can define your own factor class.
The full interface looks like this:
.. class:: factor_protocol
.. method:: name()
This must return a short string describing this factor. It will
be used to create column names, among other things.
.. attribute:: origin
A :class:`patsy.Origin` if this factor has one; otherwise, just
set it to None.
.. method:: __eq__(obj)
__ne__(obj)
__hash__()
If your factor will ever contain categorical data or
participate in interactions, then it's important to make sure
you've defined :meth:`~object.__eq__` and
:meth:`~object.__ne__` and that your type is
:term:`hashable`. These methods will determine which factors
Patsy considers equal for purposes of redundancy elimination.
.. method:: memorize_passes_needed(state, eval_env)
Return the number of passes through the data that this factor
will need in order to set up any :ref:`stateful-transforms`.
If you don't want to support stateful transforms, just return
0. In this case, :meth:`memorize_chunk` and
:meth:`memorize_finish` will never be called.
`state` is an (initially) empty dict which is maintained by the
builder machinery, and that we can do whatever we like with. It
will be passed back in to all memorization and evaluation
methods.
`eval_env` is an :class:`EvalEnvironment` object, describing
the Python environment where the factor is being evaluated.
.. method:: memorize_chunk(state, which_pass, data)
Called repeatedly with each 'chunk' of data produced by the
`data_iter_maker` passed to :func:`design_matrix_builders`.
`state` is the state dictionary. `which_pass` will be zero on
the first pass through the data, and eventually reach the
value you returned from :meth:`memorize_passes_needed`, minus
one.
Return value is ignored.
.. method:: memorize_finish(state, which_pass)
Called once after each pass through the data.
Return value is ignored.
.. method:: eval(state, data)
Evaluate this factor on the given `data`. Return value should
ideally be a 1-d or 2-d array or :func:`Categorical` object,
but this will be checked and converted as needed.
In addition, factor objects should be pickleable/unpickleable, so as
to allow models containing them to be pickled/unpickled. (Or, if for
some reason your factor objects are *not* safely pickleable, you
should consider giving them a `__getstate__` method which raises an
error, so that any users which attempt to pickle a model containing
your factors will get a clear failure immediately, instead of only
later when they try to unpickle.)
.. warning:: Do not store evaluation-related state in
attributes of your factor object! The same factor object may
appear in two totally different formulas, or if you have two
factor objects which compare equally, then only one may be
executed, and which one this is may vary randomly depending
on how :func:`build_design_matrices` is called! Use only the
`state` dictionary for storing state.
The lifecycle of a factor object therefore looks like:
#. Initialized.
#. :meth:`memorize_passes_needed` is called.
#. ``for i in range(passes_needed):``
#. :meth:`memorize_chunk` is called one or more times
#. :meth:`memorize_finish` is called
#. :meth:`eval` is called zero or more times.
Alternative formula implementations
-----------------------------------
Even if you hate Patsy's formulas all together, to the extent that
you're going to go and implement your own competing mechanism for
defining formulas, you can still Patsy-based
interfaces. Unfortunately, this isn't *quite* as clean as we'd like,
because for now there's no way to define a custom
:class:`DesignMatrixBuilder`. So you do still have to go through
Patsy's formula-building machinery. But, this machinery simply
passes numerical data through unchanged, so in extremis you can:
* Define a special factor object that simply defers to your existing
machinery
* Define the magic ``__patsy_get_model_desc__`` method on your
formula object. :func:`dmatrix` and friends check for the presence
of this method on any object that is passed in, and if found, it is
called (passing in the :class:`EvalEnvironment`), and expected to
return a :class:`ModelDesc`. And your :class:`ModelDesc` can, of
course, include your special factor object(s).
Put together, it looks something like this:
.. code-block:: python
class MyAlternativeFactor(object):
# A factor object that simply returns the design
def __init__(self, alternative_formula, side):
self.alternative_formula = alternative_formula
self.side = side
def name(self):
return self.side
def memorize_passes_needed(self, state):
return 0
def eval(self, state, data):
return self.alternative_formula.get_matrix(self.side, data)
class MyAlternativeFormula(object):
...
def __patsy_get_model_desc__(self, eval_env):
return ModelDesc([Term([MyAlternativeFactor(self, side="left")])],
[Term([MyAlternativeFactor(self, side="right")])],
my_formula = MyAlternativeFormula(...)
dmatrix(my_formula, data)
The only downside to this approach is that you can't control the names
of individual columns. (A workaround would be to create multiple terms
each with its own factor that returns a different pieces of your
overall matrix.) If this is a problem for you, though, then let's talk
-- we can probably work something out.
|