File: roles.Rd

package info (click to toggle)
r-cran-recipes 1.0.4%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 3,636 kB
  • sloc: sh: 37; makefile: 2
file content (179 lines) | stat: -rw-r--r-- 8,106 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/roles.R
\name{roles}
\alias{roles}
\alias{add_role}
\alias{update_role}
\alias{remove_role}
\title{Manually Alter Roles}
\usage{
add_role(recipe, ..., new_role = "predictor", new_type = NULL)

update_role(recipe, ..., new_role = "predictor", old_role = NULL)

remove_role(recipe, ..., old_role)
}
\arguments{
\item{recipe}{An existing \code{\link[=recipe]{recipe()}}.}

\item{...}{One or more selector functions to choose which variables are
being assigned a role. See \code{\link[=selections]{selections()}} for more details.}

\item{new_role}{A character string for a single role.}

\item{new_type}{A character string for specific type that the variable should
be identified as. If left as \code{NULL}, the type is automatically identified
as the \emph{first} type you see for that variable in \code{summary(recipe)}.}

\item{old_role}{A character string for the specific role to update for the
variables selected by \code{...}. \code{update_role()} accepts a \code{NULL} as long as the
variables have only a single role.}
}
\value{
An updated recipe object.
}
\description{
\code{update_role()} alters an existing role in the recipe or assigns an initial
role to variables that do not yet have a declared role.

\code{add_role()} adds an \emph{additional} role to variables that already have a role
in the recipe. It does not overwrite old roles, as a single variable can have
multiple roles.

\code{remove_role()} eliminates a single existing role in the recipe.
}
\details{
Variables can have any arbitrary role (see the examples) but there are two
special standard roles, \code{"predictor"} and \code{"outcome"}. These two roles are
typically required when fitting a model.

\code{update_role()} should be used when a variable doesn't currently have a role
in the recipe, or to replace an \code{old_role} with a \code{new_role}. \code{add_role()}
only adds additional roles to variables that already have roles and will
throw an error when the current role is missing (i.e. \code{NA}).

When using \code{add_role()}, if a variable is selected that already has the
\code{new_role}, a warning is emitted and that variable is skipped so no duplicate
roles are added.

Adding or updating roles is a useful way to group certain variables that
don't fall in the standard \code{"predictor"} bucket. You can perform a step
on all of the variables that have a custom role with the selector
\code{\link[=has_role]{has_role()}}.
\subsection{Effects of non-standard roles}{

Recipes can label and retain column(s) of your data set that should not be treated as outcomes or predictors. A unique identifier column or some other ancillary data could be used to troubleshoot issues during model development but may not be either an outcome or predictor.

For example, the \code{modeldata::biomass} dataset has a column named \code{sample} with information about the specific sample type. We can change that role:

\if{html}{\out{<div class="sourceCode r">}}\preformatted{library(recipes)

data(biomass, package = "modeldata")
biomass_train <- biomass[1:100,]
biomass_test <- biomass[101:200,]

rec <- recipe(HHV ~ ., data = biomass_train) \%>\%
  update_role(sample, new_role = "id variable") \%>\%
  step_center(carbon)

rec <- prep(rec, biomass_train)
}\if{html}{\out{</div>}}

This means that \code{sample} is no longer treated as a \code{"predictor"} (the default role for columns on the right-hand side of the formula supplied to \code{recipe()}) and won't be used in model fitting or analysis, but will still be retained in the data set.

If you really aren't using \code{sample} in your recipe, we recommend that you instead remove \code{sample} from your dataset before passing it to \code{recipe()}. The reason for this is because recipes assumes that all non-standard roles are required at \code{bake()} time (or \code{predict()} time, if you are using a workflow). Since you didn't use \code{sample} in any steps of the recipe, you might think that you don't need to pass it to \code{bake()}, but this isn't true because recipes doesn't know that you didn't use it:

\if{html}{\out{<div class="sourceCode r">}}\preformatted{biomass_test$sample <- NULL

bake(rec, biomass_test)
#> Error in `bake()`:
#> ! The following required columns are missing from `new_data`: "sample".
#> i These columns have one of the following roles, which are required at `bake()` time: "id variable".
#> i If these roles are not required at `bake()` time, use `update_role_requirements(role = "your_role", bake = FALSE)`.
}\if{html}{\out{</div>}}

As we mentioned before, the best way to avoid this issue is to not even use a role, just remove the \code{sample} column from \code{biomass} before calling \code{recipe()}. In general, predictors and non-standard roles that are supplied to \code{recipe()} should be present at both \code{prep()} and \code{bake()} time.

If you can't remove \code{sample} for some reason, then the second best way to get around this issue is to tell recipes that the \code{"id variable"} role isn't required at \code{bake()} time. You can do that by using \code{update_role_requirements()}:

\if{html}{\out{<div class="sourceCode r">}}\preformatted{rec <- recipe(HHV ~ ., data = biomass_train) \%>\%
  update_role(sample, new_role = "id variable") \%>\%
  update_role_requirements("id variable", bake = FALSE) \%>\%
  step_center(carbon)

rec <- prep(rec, biomass_train)

# No errors!
biomass_test_baked <- bake(rec, biomass_test)
}\if{html}{\out{</div>}}

It should be very rare that you need this feature.
}
}
\examples{
\dontshow{if (rlang::is_installed("modeldata")) (if (getRversion() >= "3.4") withAutoprint else force)(\{ # examplesIf}
library(recipes)
data(biomass, package = "modeldata")

# Using the formula method, roles are created for any outcomes and predictors:
recipe(HHV ~ ., data = biomass) \%>\%
  summary()

# However `sample` and `dataset` aren't predictors. Since they already have
# roles, `update_role()` can be used to make changes, to any arbitrary role:
recipe(HHV ~ ., data = biomass) \%>\%
  update_role(sample, new_role = "id variable") \%>\%
  update_role(dataset, new_role = "splitting variable") \%>\%
  summary()

# `update_role()` cannot set a role to NA, use `remove_role()` for that
\dontrun{
recipe(HHV ~ ., data = biomass) \%>\%
  update_role(sample, new_role = NA_character_)
}

# ------------------------------------------------------------------------------

# Variables can have more than one role. `add_role()` can be used
# if the column already has at least one role:
recipe(HHV ~ ., data = biomass) \%>\%
  add_role(carbon, sulfur, new_role = "something") \%>\%
  summary()

# `update_role()` has an argument called `old_role` that is required to
# unambiguously update a role when the column currently has multiple roles.
recipe(HHV ~ ., data = biomass) \%>\%
  add_role(carbon, new_role = "something") \%>\%
  update_role(carbon, new_role = "something else", old_role = "something") \%>\%
  summary()

# `carbon` has two roles at the end, so the last `update_roles()` fails since
# `old_role` was not given.
\dontrun{
recipe(HHV ~ ., data = biomass) \%>\%
  add_role(carbon, sulfur, new_role = "something") \%>\%
  update_role(carbon, new_role = "something else")
}

# ------------------------------------------------------------------------------

# To remove a role, `remove_role()` can be used to remove a single role.
recipe(HHV ~ ., data = biomass) \%>\%
  add_role(carbon, new_role = "something") \%>\%
  remove_role(carbon, old_role = "something") \%>\%
  summary()

# To remove all roles, call `remove_role()` multiple times to reset to `NA`
recipe(HHV ~ ., data = biomass) \%>\%
  add_role(carbon, new_role = "something") \%>\%
  remove_role(carbon, old_role = "something") \%>\%
  remove_role(carbon, old_role = "predictor") \%>\%
  summary()

# ------------------------------------------------------------------------------

# If the formula method is not used, all columns have a missing role:
recipe(biomass) \%>\%
  summary()
\dontshow{\}) # examplesIf}
}