1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
|
.. _R-comparison:
Differences between R and Patsy formulas
===========================================
.. currentmodule:: patsy
Patsy has a very high degree of compatibility with R. Almost any
formula you would use in R will also work in Patsy -- with a few
caveats.
.. note:: All R quirks described herein were last verified with R
2.15.0.
Differences from R:
- Most obviously, we both support using arbitrary code to perform
variable transformations, but in Patsy this code is written in
Python, not R.
- Patsy has no ``%in%``. In R, ``a %in% b`` is identical to
``b:a``. Patsy only supports the ``b:a`` version of this syntax.
- In Patsy, only ``**`` can be used for exponentiation. In R, both
``^`` and ``**`` can be used for exponentiation, i.e., you can write
either ``(a + b)^2`` or ``(a + b)**2``. In Patsy (as in Python
generally), only ``**`` indicates exponentiation; ``^`` is ignored
by the parser (and if present, will be interpreted as a call to the
Python binary XOR operator).
- In Patsy, the left-hand side of a formula uses the same
evaluation rules as the right-hand side. In R, the left hand side is
treated as R code, so a formula like ``y1 + y2 ~ x1 + x2`` actually
regresses the *sum* of ``y1`` and ``y2`` onto the *set of
predictors* ``x1`` and ``x2``. In Patsy, the only difference
between the left-hand side and the right-hand side is that there is
no automatic intercept added to the left-hand side. (In this regard
Patsy is similar to the R enhanced formula package `Formula
<http://cran.r-project.org/web/packages/Formula/index.html>`_.)
- Patsy produces a different column ordering for formulas involving
numeric predictors. In R, there are two rules for term ordering:
first, lower-order interactions are sorted before higher-order
interactions, and second, interactions of the same order are listed
in whatever order they appeared in the formula. In Patsy, we add
another rule: terms are first grouped together based on which
numeric factors they include. Then within each group, we use the
same ordering as R.
- Patsy has more rigorous handling of the presence or absence of
the intercept term. In R, the rules for when deciding whether to
include an intercept are somewhat idiosyncratic and can ignore
things like parentheses. To understand the difference, first
consider the formula ``a + (b - a)``. In both Patsy and R, we
first evaluate the ``(b - a)`` part; since there is no ``a`` term to
remove, this simplifies to just ``b``. We then evaluate ``a + b``:
the end result is a model which contains an ``a`` term in it.
Now consider the formula ``1 + (b - 1)``. In Patsy, this is
analogous to the case above: first ``(b - 1)`` is reduced to just ``b``,
and then ``1 + b`` produces a model with intercept included. In R, the
parentheses are ignored, and ``1 + (b - 1)`` gives a model that does
*not* include the intercept.
This can be slightly more confusing when it comes to the implicit
intercept term. In Patsy, this is handled exactly as if the
right-hand side of each formula has an invisible ``"1 +"`` inserted at
the beginning. Therefore in Patsy, these formulas are different::
# Python:
dmatrices("y ~ b - 1") # equivalent to 1 + b - 1: no intercept
dmatrices("y ~ (b - 1)") # equivalent to 1 + (b - 1): has intercept
In R, these two formulas are equivalent.
- Patsy has a more accurate algorithm for deciding whether to use a
full- or reduced-rank coding scheme for categorical factors. There
are two situations in which R's coding algorithm for categorical
variables can become confused and produce over- or under-specified
model matrices. Patsy, so far as we are aware, produces correctly
specified matrices in all cases. It's unlikely that you'll run into
these in actual usage, but they're worth mentioning. To illustrate,
let's define ``a`` and ``b`` as categorical predictors, each with 2
levels:
.. code-block:: rconsole
# R:
> a <- factor(c("a1", "a1", "a2", "a2"))
> b <- factor(c("b1", "b2", "b1", "b2"))
.. ipython:: python
:suppress:
a = ["a1", "a1", "a2", "a2"]
b = ["b1", "b2", "b1", "b2"]
from patsy import dmatrix
The first problem occurs for formulas like ``1 + a:b``. This produces
a model matrix with rank 4, just like many other formulas that
include ``a:b``, such as ``0 + a:b``, ``1 + a + a:b``, and ``a*b``:
.. code-block:: rconsole
# R:
> qr(model.matrix(~ 1 + a:b))$rank
[1] 4
However, the matrix produced for this formula has 5 columns, meaning
that it contains redundant overspecification:
.. code-block:: rconsole
# R:
> mat <- model.matrix(~ 1 + a:b)
> ncol(mat)
[1] 5
The underlying problem is that R's algorithm does not pay attention
to 'non-local' redundancies -- it will adjust its coding to avoid a
redundancy between two terms of degree-n, or a term of degree-n and
one of degree-(n+1), but it is blind to a redundancy between a term
of degree-n and one of degree-(n+2), as we have here.
Patsy's algorithm has no such limitation:
.. ipython:: python
# Python:
a = ["a1", "a1", "a2", "a2"]
b = ["b1", "b2", "b1", "b2"]
mat = dmatrix("1 + a:b")
mat.shape[1]
To produce this result, it codes ``a:b`` uses the same columns that
would be used to code ``b + a:b`` in the formula ``"1 + b + a:b"``.
The second problem occurs for formulas involving numeric
predictors. Effectively, when determining coding schemes, R assumes
that all factors are categorical. So for the formula ``0 + a:c +
a:b``, R will notice that if it used a full-rank coding for the ``c``
and ``b`` factors, then both terms would be collinear with ``a``, and
thus each other. Therefore, it encodes ``c`` with a full-rank
encoding, and uses a reduced-rank encoding for ``b``. (And the ``0 +``
lets it avoid the previous bug.) So far, so good.
But now consider the formula ``0 + a:x + a:b``, where ``x`` is
numeric. Here, ``a:x`` and ``a:b`` will not be collinear, even if we do
use a full-rank encoding for ``b``. Therefore, we *should* use a
full-rank encoding for ``b``, and produce a model matrix with 6
columns. But in fact, R gives us only 4:
.. code-block:: rconsole
# R:
> x <- c(1, 2, 3, 4)
> mat <- model.matrix(~ 0 + a:x + a:b)
> ncol(mat)
[1] 4
The problem is that it cannot tell the difference between ``0 + a:x +
a:b`` and ``0 + a:c + a:b``: it uses the same coding for both, whether
it's appropriate or not.
(The alert reader might wonder whether this bug could be triggered
by a simpler formula, like ``0 + x + b``. It turns out that R's code
``do_modelmatrix`` function has a special-case where for first-order
interactions only, it *will* peek at the type of the data before
deciding on a coding scheme.)
Patsy always checks whether each factor is categorical or numeric
before it makes coding decisions, and thus handles this case
correctly:
.. ipython:: python
# Python:
x = [1, 2, 3, 4]
mat = dmatrix("0 + a:x + a:b")
mat.shape[1]
|