File: default_formula_blueprint.Rd

package info (click to toggle)
r-cran-hardhat 1.2.0%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 1,656 kB
  • sloc: sh: 13; makefile: 2
file content (340 lines) | stat: -rw-r--r-- 13,837 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/blueprint-formula-default.R, R/mold.R
\name{default_formula_blueprint}
\alias{default_formula_blueprint}
\alias{mold.formula}
\title{Default formula blueprint}
\usage{
default_formula_blueprint(
  intercept = FALSE,
  allow_novel_levels = FALSE,
  indicators = "traditional",
  composition = "tibble"
)

\method{mold}{formula}(formula, data, ..., blueprint = NULL)
}
\arguments{
\item{intercept}{A logical. Should an intercept be included in the
processed data? This information is used by the \code{process} function
in the \code{mold} and \code{forge} function list.}

\item{allow_novel_levels}{A logical. Should novel factor levels be allowed at
prediction time? This information is used by the \code{clean} function in the
\code{forge} function list, and is passed on to \code{\link[=scream]{scream()}}.}

\item{indicators}{A single character string. Control how factors are
expanded into dummy variable indicator columns. One of:
\itemize{
\item \code{"traditional"} - The default. Create dummy variables using the
traditional \code{\link[=model.matrix]{model.matrix()}} infrastructure. Generally this creates
\code{K - 1} indicator columns for each factor, where \code{K} is the number of
levels in that factor.
\item \code{"none"} - Leave factor variables alone. No expansion is done.
\item \code{"one_hot"} - Create dummy variables using a one-hot encoding approach
that expands unordered factors into all \code{K} indicator columns, rather than
\code{K - 1}.
}}

\item{composition}{Either "tibble", "matrix", or "dgCMatrix" for the format
of the processed predictors. If "matrix" or "dgCMatrix" are chosen, all of
the predictors must be numeric after the preprocessing method has been
applied; otherwise an error is thrown.}

\item{formula}{A formula specifying the predictors and the outcomes.}

\item{data}{A data frame or matrix containing the outcomes and predictors.}

\item{...}{Not used.}

\item{blueprint}{A preprocessing \code{blueprint}. If left as \code{NULL}, then a
\code{\link[=default_formula_blueprint]{default_formula_blueprint()}} is used.}
}
\value{
For \code{default_formula_blueprint()}, a formula blueprint.
}
\description{
This pages holds the details for the formula preprocessing blueprint. This
is the blueprint used by default from \code{mold()} if \code{x} is a formula.
}
\details{
While not different from base R, the behavior of expanding factors into
dummy variables when \code{indicators = "traditional"} and an intercept is \emph{not}
present is not always intuitive and should be documented.
\itemize{
\item When an intercept is present, factors are expanded into \code{K-1} new columns,
where \code{K} is the number of levels in the factor.
\item When an intercept is \emph{not} present, the first factor is expanded into
all \code{K} columns (one-hot encoding), and the remaining factors are expanded
into \code{K-1} columns. This behavior ensures that meaningful predictions can
be made for the reference level of the first factor, but is not the exact
"no intercept" model that was requested. Without this behavior, predictions
for the reference level of the first factor would always be forced to \code{0}
when there is no intercept.
}

Offsets can be included in the formula method through the use of the inline
function \code{\link[stats:offset]{stats::offset()}}. These are returned as a tibble with 1 column
named \code{".offset"} in the \verb{$extras$offset} slot of the return value.
}
\section{Mold}{


When \code{mold()} is used with the default formula blueprint:
\itemize{
\item Predictors
\itemize{
\item The RHS of the \code{formula} is isolated, and converted to its own
1 sided formula: \code{~ RHS}.
\item Runs \code{\link[stats:model.frame]{stats::model.frame()}} on the RHS formula and uses \code{data}.
\item If \code{indicators = "traditional"}, it then runs \code{\link[stats:model.matrix]{stats::model.matrix()}}
on the result.
\item If \code{indicators = "none"}, factors are removed before \code{model.matrix()}
is run, and then added back afterwards. No interactions or inline
functions involving factors are allowed.
\item If \code{indicators = "one_hot"}, it then runs \code{\link[stats:model.matrix]{stats::model.matrix()}} on the
result using a contrast function that creates indicator columns for all
levels of all factors.
\item If any offsets are present from using \code{offset()}, then they are
extracted with \code{\link[=model_offset]{model_offset()}}.
\item If \code{intercept = TRUE}, adds an intercept column.
\item Coerces the result of the above steps to a tibble.
}
\item Outcomes
\itemize{
\item The LHS of the \code{formula} is isolated, and converted to its own
1 sided formula: \code{~ LHS}.
\item Runs \code{\link[stats:model.frame]{stats::model.frame()}} on the LHS formula and uses \code{data}.
\item Coerces the result of the above steps to a tibble.
}
}
}

\section{Forge}{


When \code{forge()} is used with the default formula blueprint:
\itemize{
\item It calls \code{\link[=shrink]{shrink()}} to trim \code{new_data} to only the required columns and
coerce \code{new_data} to a tibble.
\item It calls \code{\link[=scream]{scream()}} to perform validation on the structure of the columns
of \code{new_data}.
\item Predictors
\itemize{
\item It runs \code{\link[stats:model.frame]{stats::model.frame()}} on \code{new_data} using the stored terms
object corresponding to the \emph{predictors}.
\item If, in the original \code{\link[=mold]{mold()}} call, \code{indicators = "traditional"} was
set, it then runs \code{\link[stats:model.matrix]{stats::model.matrix()}} on the result.
\item If, in the original \code{\link[=mold]{mold()}} call, \code{indicators = "none"} was set, it
runs \code{\link[stats:model.matrix]{stats::model.matrix()}} on the result without the factor columns,
and then adds them on afterwards.
\item If, in the original \code{\link[=mold]{mold()}} call, \code{indicators = "one_hot"} was set, it
runs \code{\link[stats:model.matrix]{stats::model.matrix()}} on the result with a contrast function that
includes indicators for all levels of all factor columns.
\item If any offsets are present from using \code{offset()} in the original call
to \code{\link[=mold]{mold()}}, then they are extracted with \code{\link[=model_offset]{model_offset()}}.
\item If \code{intercept = TRUE} in the original call to \code{\link[=mold]{mold()}}, then an
intercept column is added.
\item It coerces the result of the above steps to a tibble.
}
\item Outcomes
\itemize{
\item It runs \code{\link[stats:model.frame]{stats::model.frame()}} on \code{new_data} using the
stored terms object corresponding to the \emph{outcomes}.
\item Coerces the result to a tibble.
}
}
}

\section{Differences From Base R}{


There are a number of differences from base R regarding how formulas are
processed by \code{mold()} that require some explanation.

Multivariate outcomes can be specified on the LHS using syntax that is
similar to the RHS (i.e. \code{outcome_1 + outcome_2 ~ predictors}).
If any complex calculations are done on the LHS and they return matrices
(like \code{\link[stats:poly]{stats::poly()}}), then those matrices are flattened into multiple
columns of the tibble after the call to \code{model.frame()}. While this is
possible, it is not recommended, and if a large amount of preprocessing is
required on the outcomes, then you are better off
using a \code{\link[recipes:recipe]{recipes::recipe()}}.

Global variables are \emph{not} allowed in the formula. An error will be thrown
if they are included. All terms in the formula should come from \code{data}. If
you need to use inline functions in the formula, the safest way to do so is
to prefix them with their package name, like \code{pkg::fn()}. This ensures that
the function will always be available at \code{mold()} (fit) and \code{forge()}
(prediction) time. That said, if the package is \emph{attached}
(i.e. with \code{library()}), then you should be able to use the inline function
without the prefix.

By default, intercepts are \emph{not} included in the predictor output from the
formula. To include an intercept, set
\code{blueprint = default_formula_blueprint(intercept = TRUE)}. The rationale
for this is that many packages either always require or never allow an
intercept (for example, the \code{earth} package), and they do a large amount of
extra work to keep the user from supplying one or removing it. This
interface standardizes all of that flexibility in one place.
}

\examples{
# ---------------------------------------------------------------------------

data("hardhat-example-data")

# ---------------------------------------------------------------------------
# Formula Example

# Call mold() with the training data
processed <- mold(
  log(num_1) ~ num_2 + fac_1,
  example_train,
  blueprint = default_formula_blueprint(intercept = TRUE)
)

# Then, call forge() with the blueprint and the test data
# to have it preprocess the test data in the same way
forge(example_test, processed$blueprint)

# Use `outcomes = TRUE` to also extract the preprocessed outcome
forge(example_test, processed$blueprint, outcomes = TRUE)

# ---------------------------------------------------------------------------
# Factors without an intercept

# No intercept is added by default
processed <- mold(num_1 ~ fac_1 + fac_2, example_train)

# So, for factor columns, the first factor is completely expanded into all
# `K` columns (the number of levels), and the subsequent factors are expanded
# into `K - 1` columns.
processed$predictors

# In the above example, `fac_1` is expanded into all three columns,
# `fac_2` is not. This behavior comes from `model.matrix()`, and is somewhat
# known in the R community, but can lead to a model that is difficult to
# interpret since the corresponding p-values are testing wildly different
# hypotheses.

# To get all indicators for all columns (irrespective of the intercept),
# use the `indicators = "one_hot"` option
processed <- mold(
  num_1 ~ fac_1 + fac_2,
  example_train,
  blueprint = default_formula_blueprint(indicators = "one_hot")
)

processed$predictors

# It is not possible to construct a no-intercept model that expands all
# factors into `K - 1` columns using the formula method. If required, a
# recipe could be used to construct this model.

# ---------------------------------------------------------------------------
# Global variables

y <- rep(1, times = nrow(example_train))

# In base R, global variables are allowed in a model formula
frame <- model.frame(fac_1 ~ y + num_2, example_train)
head(frame)

# mold() does not allow them, and throws an error
try(mold(fac_1 ~ y + num_2, example_train))

# ---------------------------------------------------------------------------
# Dummy variables and interactions

# By default, factor columns are expanded
# and interactions are created, both by
# calling `model.matrix()`. Some models (like
# tree based models) can take factors directly
# but still might want to use the formula method.
# In those cases, set `indicators = "none"` to not
# run `model.matrix()` on factor columns. Interactions
# are still allowed and are run on numeric columns.

bp_no_indicators <- default_formula_blueprint(indicators = "none")

processed <- mold(
  ~ fac_1 + num_1:num_2,
  example_train,
  blueprint = bp_no_indicators
)

processed$predictors

# An informative error is thrown when `indicators = "none"` and
# factors are present in interaction terms or in inline functions
try(mold(num_1 ~ num_2:fac_1, example_train, blueprint = bp_no_indicators))
try(mold(num_1 ~ paste0(fac_1), example_train, blueprint = bp_no_indicators))

# ---------------------------------------------------------------------------
# Multivariate outcomes

# Multivariate formulas can be specified easily
processed <- mold(num_1 + log(num_2) ~ fac_1, example_train)
processed$outcomes

# Inline functions on the LHS are run, but any matrix
# output is flattened (like what happens in `model.matrix()`)
# (essentially this means you don't wind up with columns
# in the tibble that are matrices)
processed <- mold(poly(num_2, degree = 2) ~ fac_1, example_train)
processed$outcomes

# TRUE
ncol(processed$outcomes) == 2

# Multivariate formulas specified in mold()
# carry over into forge()
forge(example_test, processed$blueprint, outcomes = TRUE)

# ---------------------------------------------------------------------------
# Offsets

# Offsets are handled specially in base R, so they deserve special
# treatment here as well. You can add offsets using the inline function
# `offset()`
processed <- mold(num_1 ~ offset(num_2) + fac_1, example_train)

processed$extras$offset

# Multiple offsets can be included, and they get added together
processed <- mold(
  num_1 ~ offset(num_2) + offset(num_3),
  example_train
)

identical(
  processed$extras$offset$.offset,
  example_train$num_2 + example_train$num_3
)

# Forging test data will also require
# and include the offset
forge(example_test, processed$blueprint)

# ---------------------------------------------------------------------------
# Intercept only

# Because `1` and `0` are intercept modifying terms, they are
# not allowed in the formula and are instead controlled by the
# `intercept` argument of the blueprint. To use an intercept
# only formula, you should supply `NULL` on the RHS of the formula.
mold(
  ~NULL,
  example_train,
  blueprint = default_formula_blueprint(intercept = TRUE)
)

# ---------------------------------------------------------------------------
# Matrix output for predictors

# You can change the `composition` of the predictor data set
bp <- default_formula_blueprint(composition = "dgCMatrix")
processed <- mold(log(num_1) ~ num_2 + fac_1, example_train, blueprint = bp)
class(processed$predictors)
}