File: package.Rmd

package info (click to toggle)
r-cran-hardhat 1.2.0%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: bookworm
  • size: 1,656 kB
  • sloc: sh: 13; makefile: 2
file content (475 lines) | stat: -rw-r--r-- 18,784 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
---
title: "Creating Modeling Packages With hardhat"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Creating Modeling Packages With hardhat}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

options(rlang_backtrace_on_error = "none")
```

```{r setup}
library(hardhat)
library(tibble)
library(modeldata)

data(penguins)
penguins <- na.omit(penguins)
```

# Introduction

The goal of this vignette is to teach you how to use `mold()` and `forge()` in a modeling package. This is the intended use of these functions, even though they can also be called interactively. Creating a new modeling package has two main stages: creating the model fitting function, and implementing a predict method. The stages break down like this:

- Stage 1 - Model Fitting
  
  - Create a model constructor.
  
  - Create a fitting implementation function.
  
  - Create a common bridge to go between high level user facing methods and the lower level constructor and implementation function.
  
  - Create a user facing function with methods for data frame, matrix, formula, and recipe inputs.
  
I imagine that this comes together as a few internal pieces that power the user facing methods for your model.

<br></br>

```{r out.width = '100%', echo = FALSE}
knitr::include_graphics("../man/figures/Fitting.png")
```

<br></br>
  
- Stage 2 - Model Prediction

  - Create one or more prediction implementation functions, varying by the `"type"` of prediction to make.
  
  - Create a common bridge between a high level predict method and the lower level prediction implementation functions.
  
  - Create a user facing predict method.

In this case, there are 2 user facing methods. Many models have multiple internal implementation functions that you'll switch between, depending on the `"type"`.

<br></br>

```{r out.width = '100%', echo = FALSE}
knitr::include_graphics("../man/figures/Prediction.png")
```

<br></br>

The end result is a single high level modeling function that has methods for multiple different "interfaces", and a corresponding predict method to make predictions using one of these models along with new data (by "interfaces", I just mean the different types of inputs, so: data frame, matrix, formula and recipe).

There are obviously other things that you might want your modeling package to do. For instance, you might implement a `plot()` or `summary()` method. But the two stages described here are necessary for almost every model, and they involve the inputs and outputs that hardhat helps the most with.

## What's Our Model?

We will use the underlying `lm()` infrastructure, `lm.fit()`, to create our model. Linear regression should be recognizable to many, so we can focus on understanding how `mold()` and `forge()` fit in the bigger picture, rather than trying to understand how the model works.

`lm.fit()` takes `x` and `y` directly, rather than using the formula method. It will serve as the core part of our modeling implementation function. More generally, it is easiest if the core implementation function of your algorithm takes `x` and `y` in this manner, since that is how `mold()` will standardize the inputs.

We will call the model `simple_lm()`. It won't have all the features of normal linear regression (weights, offsets, etc), but will serve as a nice dummy model to show off features you get with hardhat.

# Model Fitting

## Model Constructor

The first thing we need is a modeling _constructor_. Constructors are very simple functions that creates new objects of our model class. In the arguments to the constructor you supply all of the individual pieces, and it wraps them up into a model object. The hardhat function `new_model()` can help with creating that.

A model constructor should:

- Have the name `new_<model_class>()`.

- Take the required model elements as named arguments, including a required `blueprint`.

- Validate the types of the new elements.

- Pass the named elements on to `new_model()` along with setting the class to `"<model_class>"`.

If you want to learn more about the details of constructors and creating S3 classes, take a look at the S3 section in [Advanced R](https://adv-r.hadley.nz/s3.html#s3-constructor).

```{r}
new_simple_lm <- function(coefs, coef_names, blueprint) {
  
  if (!is.numeric(coefs)) {
    stop("`coefs` should be a numeric vector.", call. = FALSE)
  }
  
  if (!is.character(coef_names)) {
    stop("`coef_names` should be a character vector.", call. = FALSE)
  }
  
  if (length(coefs) != length(coef_names)) {
    stop("`coefs` and `coef_names` must have the same length.")
  }
  
  new_model(
    coefs = coefs, 
    coef_names = coef_names,
    blueprint = blueprint, 
    class = "simple_lm"
  )
}
```

A `"simple_lm"` object has just enough information to make numeric predictions on new data, but you can store other things here as well to enable your model object to work with extra post-fitting functionality.

We can test this by manually generating a model object. Along with the custom class we provided, this object also has a `"hardhat_model"` class. There is a very simple print method for objects of this type. Specifically, it prints the name of the class at the top, and only prints out the custom elements (i.e. not the `blueprint`).

```{r}
manual_model <- new_simple_lm(1, "my_coef", default_xy_blueprint())

manual_model

names(manual_model)

manual_model$blueprint
```

## Model Fitting Implementation

The implementation function is where the hard work is done. I generally recommend naming it `<model_class>_impl()`. It should accept predictors and outcomes in whatever form is required for the algorithm, run the algorithm, and return a named list of the new elements you added to the model constructor. You might also have arguments for extra options that can be used to tweak the internal algorithm.

```{r}
simple_lm_impl <- function(predictors, outcomes) {
  lm_fit <- lm.fit(predictors, outcomes)
  
  coefs <- lm_fit$coefficients
  
  coef_names <- names(coefs)
  coefs <- unname(coefs)
  
  list(
    coefs = coefs,
    coef_names = coef_names
  )
}
```

This simple linear regression implementation just calls `lm.fit()` with `x = predictors` and `y = outcomes`. `lm.fit()` expects a matrix of predictors and a vector of outcomes (at least for univariate regression). In a moment we will discuss how to create those.

```{r}
predictors <- as.matrix(subset(penguins, select = bill_length_mm))
outcomes <- penguins$body_mass_g

simple_lm_impl(predictors, outcomes)
```

## Model Fitting Bridge

Now that we have our constructor and our implementation function, we can create a common function that will be used in all of our top level methods (for data frames, matrices, formulas, and recipes). It will call the implementation function, and then use that information along with the blueprint to create a new instance of our model. It should have an argument for the output from a call to `mold()`, here I've called that `processed`. In that object will be (at a minimum) the predictors, outcomes, and blueprint. You might also have arguments for additional options to pass on to the implementation function.

The bridge function should take the standardized predictors and outcomes and convert them to the lower level types that the implementation function requires. The `predictors` and `outcomes` that are returned from `mold()` will _always_ be data frames, so in this case we can convert them to matrices and vectors directly for use in the lower level function.

This is also a good place to use some of hardhat's validation functions. In this case, we always expect the outcome to have a single column since this is a univariate model, so we can use `validate_outcomes_is_univariate()` to enforce that.

```{r}
simple_lm_bridge <- function(processed) {
  
  validate_outcomes_are_univariate(processed$outcomes)
  
  predictors <- as.matrix(processed$predictors)
  outcomes <- processed$outcomes[[1]]
  
  fit <- simple_lm_impl(predictors, outcomes)
  
  new_simple_lm(
    coefs = fit$coefs,
    coef_names = fit$coef_names,
    blueprint = processed$blueprint
  )
}
```

At this point, we can simulate user input and pass it on to our bridge to run a model.

```{r}
# Simulate formula interface
processed_1 <- mold(bill_length_mm ~ body_mass_g + species, penguins)

# Simulate xy interface
processed_2 <- mold(x = penguins["body_mass_g"], y = penguins$bill_length_mm)

simple_lm_bridge(processed_1)

simple_lm_bridge(processed_2)
```

Multiple outcomes are an error:

```{r, error=TRUE}
multi_outcome <- mold(bill_length_mm + bill_depth_mm ~ body_mass_g + species, penguins)

simple_lm_bridge(multi_outcome)
```

## User Facing Fitting Function

With all of the pieces in place, we have everything we need to create our high level modeling interface. This should be a generic function, generally with methods for data frames, matrices, formulas, and recipes. Each method should call `mold()` with the method specific inputs to run the preprocessing, and then pass off to the bridge function to run the actual model. It is also good practice to provide a default method with a nice error message for unknown types.

```{r}
# Generic
simple_lm <- function(x, ...) {
  UseMethod("simple_lm")
}

# Default
simple_lm.default <- function(x, ...) {
  stop(
    "`simple_lm()` is not defined for a '", class(x)[1], "'.", 
    call. = FALSE
  )
}

# XY method - data frame
simple_lm.data.frame <- function(x, y, ...) {
  processed <- mold(x, y)
  simple_lm_bridge(processed)
}

# XY method - matrix
simple_lm.matrix <- function(x, y, ...) {
  processed <- mold(x, y)
  simple_lm_bridge(processed)
}

# Formula method
simple_lm.formula <- function(formula, data, ...) {
  processed <- mold(formula, data)
  simple_lm_bridge(processed)
}

# Recipe method
simple_lm.recipe <- function(x, data, ...) {
  processed <- mold(x, data)
  simple_lm_bridge(processed)
}
```

Let's give it a try:

```{r}
predictors <- penguins[c("bill_length_mm", "bill_depth_mm")]
outcomes_vec <- penguins$body_mass_g
outcomes_df <- penguins["body_mass_g"]

# Vector outcome
simple_lm(predictors, outcomes_vec)

# 1 column data frame outcome
simple_lm(predictors, outcomes_df)

# Formula interface
simple_lm(body_mass_g ~ bill_length_mm + bill_depth_mm, penguins)
```

We can use preprocessing as well, and it is handled by `mold()`.

```{r, warning=FALSE, message=FALSE}
library(recipes)

# - Log a predictor
# - Generate dummy variables for factors
simple_lm(body_mass_g ~ log(bill_length_mm) + species, penguins)

# Same, but with a recipe
rec <- recipe(body_mass_g ~ bill_length_mm + species, penguins) %>%
  step_log(bill_length_mm) %>%
  step_dummy(species, one_hot = TRUE)

simple_lm(rec, penguins)
```

### Adding an Intercept Option

You might have noticed that our linear regression isn't adding an intercept. Generally, with linear regression models we will want a default intercept added on. To accomplish this, we can add an `intercept` argument to our user facing function, and then use that to tweak the blueprint that would otherwise be created for you automatically.

```{r}
simple_lm <- function(x, ...) {
  UseMethod("simple_lm")
}

simple_lm.data.frame <- function(x, y, intercept = TRUE, ...) {
  blueprint <- default_xy_blueprint(intercept = intercept)
  processed <- mold(x, y, blueprint = blueprint)
  simple_lm_bridge(processed)
}

simple_lm.matrix <- function(x, y, intercept = TRUE,...) {
  blueprint <- default_xy_blueprint(intercept = intercept)
  processed <- mold(x, y, blueprint = blueprint)
  simple_lm_bridge(processed)
}

simple_lm.formula <- function(formula, data, intercept = TRUE, ...) {
  blueprint <- default_formula_blueprint(intercept = intercept)
  processed <- mold(formula, data, blueprint = blueprint)
  simple_lm_bridge(processed)
}

simple_lm.recipe <- function(x, data, intercept = TRUE, ...) {
  blueprint <- default_recipe_blueprint(intercept = intercept)
  processed <- mold(x, data, blueprint = blueprint)
  simple_lm_bridge(processed)
}
```

```{r}
# By default an intercept is included
simple_lm(predictors, outcomes_df)

# But the user can turn this off
simple_lm(body_mass_g ~ log(bill_length_mm) + species, penguins, intercept = FALSE)
```

Note that even the formula method respects this `intercept` argument. To recap, by default `mold()` will _not_ automatically add an intercept for any method, including the formula method.

# Model Prediction

## Prediction Implementation

On the prediction side, we need implementation functions like for fitting our model. These vary based on the `"type"` argument to `predict()`. The `"type"` might be `"numeric"` for numeric predictions as we will use here, or it could be `"class"` for hard class predictions, `"prob"` for class probabilities, and more. A set of recommended names for `"type"` can be found in the [Model Predictions](https://tidymodels.github.io/model-implementation-principles/model-predictions.html#input-data) section of the implementation principles.

For our model, we will focus on only returning numeric predictions. I generally like to name these prediction implementation functions `predict_<model_class>_<type>()`. The arguments for the implementation functions should include the model object and the predictors in the form that your prediction algorithm expects (here, a matrix).

Also used here is another hardhat function for standardizing prediction output, `spruce_numeric()`. This function tidies up the numeric response output, and automatically standardizes it to match the recommendations in the principles guide. The output is always a tibble, and for the `"numeric"` type it has 1 column, `.pred`.

```{r}
predict_simple_lm_numeric <- function(object, predictors) {
  
  coefs <- object$coefs
  
  pred <- as.vector(predictors %*% coefs)
  
  out <- spruce_numeric(pred)
  
  out
}
```

To test it, we will have to run a model and call `forge()` on the output manually. The higher level user facing function will do this automatically.

```{r}
model <- simple_lm(bill_length_mm ~ body_mass_g + species, penguins)

predictors <- forge(penguins, model$blueprint)$predictors
predictors <- as.matrix(predictors)

predict_simple_lm_numeric(model, predictors)
```

## Prediction Bridge

A prediction bridge converts the standardized `predictors` into the lower level type that the prediction implementation functions expect. `predictors` here is always a data frame, and is part of the return value of a call to `forge()`. Since the prediction implementation function takes a matrix, we convert it to that here.

Additionally, it should `switch()` on the `type` argument to decide which of the prediction implementation functions to call. Here, when `type == "numeric"`, `predict_simple_lm_numeric()` is called.

I also like using `rlang::arg_match()` to validate that `type` is one of the accepted prediction types. This has an advantage over `match.arg()` in that partial matches are not allowed, and the error messages are a bit nicer.

```{r}
predict_simple_lm_bridge <- function(type, object, predictors) {
  
  type <- rlang::arg_match(type, "numeric")
  
  predictors <- as.matrix(predictors)
  
  switch(
    type,
    numeric = predict_simple_lm_numeric(object, predictors)
  )
}
```

Let's test:

```{r, error=TRUE}
model <- simple_lm(bill_length_mm ~ body_mass_g + species, penguins)

# Pass in the data frame
predictors <- forge(penguins, model$blueprint)$predictors

predict_simple_lm_bridge("numeric", model, predictors)

# Partial matches are an error
predict_simple_lm_bridge("numer", model, predictors)
```

## User Facing Prediction Function

Finally, we can create an S3 method for the generic `predict()` function. To match the modeling principles, it should use `new_data` to accept a matrix or data frame of new predictors.

The first thing that the `predict()` method should do is call `forge()` with the `new_data` and the `blueprint` that we attached to our `simple_lm` model at fit time. This performs the required preprocessing on the new data, and checks that the _type_ of the data frame supplied to `new_data` matches the type of data frame supplied at fit time. This is one of the most valuable features of `forge()`, as it adds a large amount of robustness to your `predict()`ion function. You will see some examples of this at the end of the vignette.

After calling `forge()`, it should pass off to the bridge function to call the correct prediction function based on the type.

Finally, it is good practice to call the hardhat function, `validate_prediction_size()`, with the return value and the original `new_data` to ensure that the number of rows in the output are the same as the number of rows of the input. If a prediction cannot be made for a row of `new_data`, an `NA` value should be placed there instead. Mainly, this validation function is a check on the model developer to ensure that you always return output with a sane length.

```{r}
predict.simple_lm <- function(object, new_data, type = "numeric", ...) {
  
  # Enforces column order, type, column names, etc
  processed <- forge(new_data, object$blueprint)
  
  out <- predict_simple_lm_bridge(type, object, processed$predictors)
  
  validate_prediction_size(out, new_data)
  
  out
}
```

# Final Testing

Finally, we can test out our top level modeling function along with its corresponding `predict()` method.

```{r}
model <- simple_lm(bill_length_mm ~ log(body_mass_g) + species, penguins)

predict(model, penguins)
```

By using `forge()`, you automatically get powerful type checking to ensure that the `new_data` is in a form that you expect.

```{r, warning=TRUE, error=TRUE}
# `new_data` isn't a data frame
predict(model, penguins$species)

# Missing a required column
predict(model, subset(penguins, select = -body_mass_g))

# In this case, 'species' is a character, 
# but can be losslessy converted to a factor.
# That happens for you automatically and silently.
penguins_chr_species <- transform(penguins, species = as.character(species))

predict(model, penguins_chr_species)

# Slightly different from above. Here, 'species' is a character, 
# AND has an extra unexpected factor level. It is 
# removed with a warning, but you still get a factor 
# with the correct levels
penguins_chr_bad_species <- penguins_chr_species
penguins_chr_bad_species$species[1] <- "new_level"

predict(model, penguins_chr_bad_species)

# This case throws an error.
# Here, 'species' is a double and
# when it should have been a factor.
# You can't cast a double to a factor!
penguins_dbl_species <- transform(penguins, species = 1)

predict(model, penguins_dbl_species)
```