File: stateful-transforms.rst

package info (click to toggle)
patsy 0.4.1%2Bgit34-ga5b54c2-1
  • links: PTS, VCS
  • area: main
  • in suites: stretch
  • size: 1,444 kB
  • ctags: 884
  • sloc: python: 8,797; makefile: 130; sh: 15
file content (264 lines) | stat: -rw-r--r-- 9,426 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
.. _stateful-transforms:

Stateful transforms
===================

.. currentmodule:: patsy

There's a subtle problem that sometimes bites people when working with
formulas. Suppose that I have some numerical data called ``x``, and I
would like to center it before fitting. The obvious way would be to
write::

   y ~ I(x - np.mean(x))  # BROKEN! Don't do this!

or, even better we could package it up into a function:

.. ipython:: python

   def naive_center(x):  # BROKEN! don't use!
       x = np.asarray(x)
       return x - np.mean(x)

and then write our formula like::

   y ~ naive_center(x)

Why is this a bad idea? Let's set up an example.

.. ipython:: python

   import numpy as np
   from patsy import dmatrix, build_design_matrices, incr_dbuilder
   data = {"x": [1, 2, 3, 4]}

Now we can build a design matrix and see what we get:

.. ipython:: python

   mat = dmatrix("naive_center(x)", data)
   mat

Those numbers look correct, and in fact they are correct. If all we're
going to do with this model is call :func:`dmatrix` once, then
everything is fine -- which is what makes this problem so insidious.

Often we want to do more with a model than this. For instance, we
might find some new data, and want to feed it into our model to make
predictions. To do this, though, we first need to reapply the same
transformation, like so:

.. ipython:: python

   new_data = {"x": [5, 6, 7, 8]}
   # Broken!
   build_design_matrices([mat.design_info], new_data)[0]

So it's clear what's happened here -- Patsy has centered the new
data, just like it centered the old data. But if you think about what
this means statistically, it makes no sense. According to this, the
new data point where x is 5 will behave exactly like the old data
point where x is 1, because they both produce the same input to the
actual model.

The problem is what it means to apply "the same transformation". Here,
what we really want to do is to subtract the mean *of the original
data* from the new data.

Patsy's solution is called a *stateful transform*. These look like
ordinary functions, but they perform a bit of magic to remember the
state of the original data, and use it in transforming new data.
Several useful stateful transforms are included out of the box,
including one called :func:`center`.

Using :func:`center` instead of :func:`naive_center` produces the same
correct result for our original matrix. It's used in exactly the same
way:

.. ipython:: python

   fixed_mat = dmatrix("center(x)", data)
   fixed_mat

But if we then feed in our new data, we also get out the correct result:

.. ipython:: python

   # Correct!
   build_design_matrices([fixed_mat.design_info], new_data)[0]

Another situation where we need some stateful transform magic is when
we are working with data that is too large to fit into memory at
once. To handle such cases, Patsy allows you to set up a design
matrix while working our way incrementally through the data. But if we
use :func:`naive_center` when building a matrix incrementally, then it
centers each *chunk* of data, not the data as a whole. (Of course,
depending on how your data is distributed, this might end up being
just similar enough for you to miss the problem until it's too late.)

.. ipython:: python

   data_chunked = [{"x": data["x"][:2]},
                   {"x": data["x"][2:]}]
   dinfo = incr_dbuilder("naive_center(x)", lambda: iter(data_chunked))
   # Broken!
   np.row_stack([build_design_matrices([dinfo], chunk)[0]
                 for chunk in data_chunked])

But if we use the proper stateful transform, this just works:

.. ipython:: python

   dinfo = incr_dbuilder("center(x)", lambda: iter(data_chunked))
   # Correct!
   np.row_stack([build_design_matrices([dinfo], chunk)[0]
                 for chunk in data_chunked])

.. note::

   Under the hood, the way this works is that :func:`incr_dbuilder`
   iterates through the data once to calculate the mean, and then we
   use :func:`build_design_matrices` to iterate through it a second
   time creating our design matrix. While taking two passes through a
   large data set may be slow, there's really no other way to
   accomplish what the user asked for. The good news is that Patsy is
   smart enough to make only the minimum number of passes
   necessary. For example, in our example with :func:`naive_center`
   above, :func:`incr_dbuilder` would not have done a full pass
   through the data at all. And if you have multiple stateful
   transforms in the same formula, then Patsy will process them in
   parallel in a single pass.

And, of course, we can use the resulting :class:`DesignInfo` object
for prediction as well:

.. ipython:: python

   # Correct!
   build_design_matrices([dinfo], new_data)[0]

In fact, Patsy's stateful transform handling is clever enough that
it can support arbitrary mixing of stateful transforms with other
Python code. E.g., if :func:`center` and :func:`spline` were both
stateful transforms, then even a silly a formula like this will be
handled 100% correctly::

  y ~ I(spline(center(x1)) + center(x2))

However, it isn't perfect -- there are two things you have to be
careful of. Let's put them in red:

.. warning::

   If you are unwise enough to ignore this section, write a function
   like `naive_center` above, and use it in a formula, then Patsy will
   not notice. If you use that formula with :func:`incr_dbuilders` or
   for predictions, then you will just silently get the wrong
   results. We have a plan to detect such cases, but it isn't
   implemented yet (and in any case can never be 100% reliable). So be
   careful!

.. warning::

   Even if you do use a "real" stateful transform like :func:`center`
   or :func:`standardize`, still have to make sure that Patsy can
   "see" that you are using such a transform. Currently the rule is
   that you must access the stateful transform function using a
   simple, bare variable reference, without any dots or other
   lookups::

     dmatrix("y ~ center(x)", data)  # okay
     asdf = patsy.center
     dmatrix("y ~ asdf(x)", data)  # okay
     dmatrix("y ~ patsy.center(x)", data)  # BROKEN! DON'T DO THIS!
     funcs = {"center": patsy.center}
     dmatrix("y ~ funcs['center'](x)", data)  # BROKEN! DON'T DO THIS!

Builtin stateful transforms
---------------------------

There are a number of builtin stateful transforms beyond
:func:`center`; see :ref:`stateful transforms
<stateful-transforms-list>` in the API reference for a complete list.

.. _stateful-transform-protocol:

Defining a stateful transform
-----------------------------

You can also easily define your own stateful transforms. The first
step is to define a class which fulfills the stateful transform
protocol. The lifecycle of a stateful transform object is as follows:

#. An instance of your type will be constructed.
#. :meth:`memorize_chunk` will be called one or more times.
#. :meth:`memorize_finish` will be called once.
#. :meth:`transform` will be called one or more times, on either the
   same or different data to what was initially passed to
   :meth:`memorize_chunk`. You can trust that any non-data arguments
   will be identical between calls to :meth:`memorize_chunk` and
   :meth:`transform`.

And here are the methods and call signatures you need to define:

.. class:: stateful_transform_protocol

  .. method:: __init__()
     :noindex:

     It must be possible to create an instance of the class by calling
     the constructor with no arguments.

  .. method:: memorize_chunk(*args, **kwargs)

     Update any internal state, based on the data passed into
     `memorize_chunk`.

  .. method:: memorize_finish()

     Do any housekeeping you want to do between the last call to
     :meth:`memorize_chunk` and the first call to
     :meth:`transform`. For example, if you are computing some summary
     statistic that cannot be done incrementally, then your
     :meth:`memorize_chunk` method might just store the data that's
     passed in, and then :meth:`memorize_finish` could compute the
     summary statistic and delete the stored data to free up the
     associated memory.

  .. method:: transform(*args, **kwargs)

     This method should transform the input data passed to it. It
     should be deterministic, and it should be "point-wise", in the
     sense that when passed an array it performs an independent
     transformation on each data point that is not affected by any
     other data points passed to :meth:`transform`.

Then once you have created your class, pass it to
:func:`stateful_transform` to create a callable stateful transform
object suitable for use inside or outside formulas.

Here's a simple example of how you might implement a working version
of :func:`center` (though it's less robust and featureful than the
real builtin)::

  class MyExampleCenter(object):
      def __init__(self):
          self._total = 0
          self._count = 0
          self._mean = None

      def memorize_chunk(self, x):
          self._total += np.sum(x)
          self._count += len(x)

      def memorize_finish(self):
          self._mean = self.total * 1. / self._count

      def transform(self, x):
          return x - self._mean

  my_example_center = patsy.stateful_transform(MyExampleCenter)
  print(my_example_center(np.array([1, 2, 3])))

But of course, if you come up with any useful ones, please let us know
so we can incorporate them into patsy itself!