1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156
|
#' Scaling Numeric Data to a Specific Range
#'
#' `step_range` creates a *specification* of a recipe
#' step that will normalize numeric data to be within a pre-defined
#' range of values.
#'
#' @inheritParams step_center
#' @param ... One or more selector functions to choose which
#' variables will be scaled. See [selections()] for more
#' details. For the `tidy` method, these are not currently
#' used.
#' @param role Not used by this step since no new variables are
#' created.
#' @param min A single numeric value for the smallest value in the
#' range.
#' @param max A single numeric value for the largest value in the
#' range.
#' @param ranges A character vector of variables that will be
#' normalized. Note that this is ignored until the values are
#' determined by [prep.recipe()]. Setting this value will
#' be ineffective.
#' @return An updated version of `recipe` with the new step
#' added to the sequence of existing steps (if any). For the
#' `tidy` method, a tibble with columns `terms` (the
#' selectors or variables selected), `min`, and `max`.
#' @keywords datagen
#' @concept preprocessing
#' @concept normalization_methods
#' @export
#' @details When a new data point is outside of the ranges seen in
#' the training set, the new values are truncated at `min` or
#' `max`.
#' @examples
#' library(modeldata)
#' data(biomass)
#'
#' biomass_tr <- biomass[biomass$dataset == "Training",]
#' biomass_te <- biomass[biomass$dataset == "Testing",]
#'
#' rec <- recipe(HHV ~ carbon + hydrogen + oxygen + nitrogen + sulfur,
#' data = biomass_tr)
#'
#' ranged_trans <- rec %>%
#' step_range(carbon, hydrogen)
#'
#' ranged_obj <- prep(ranged_trans, training = biomass_tr)
#'
#' transformed_te <- bake(ranged_obj, biomass_te)
#'
#' biomass_te[1:10, names(transformed_te)]
#' transformed_te
#'
#' tidy(ranged_trans, number = 1)
#' tidy(ranged_obj, number = 1)
step_range <-
function(recipe,
...,
role = NA,
trained = FALSE,
min = 0,
max = 1,
ranges = NULL,
skip = FALSE,
id = rand_id("range")) {
add_step(
recipe,
step_range_new(
terms = ellipse_check(...),
role = role,
trained = trained,
min = min,
max = max,
ranges = ranges,
skip = skip,
id = id
)
)
}
step_range_new <-
function(terms, role, trained, min, max, ranges, skip, id) {
step(
subclass = "range",
terms = terms,
role = role,
trained = trained,
min = min,
max = max,
ranges = ranges,
skip = skip,
id = id
)
}
#' @export
prep.step_range <- function(x, training, info = NULL, ...) {
col_names <- eval_select_recipes(x$terms, training, info)
check_type(training[, col_names])
mins <-
vapply(training[, col_names], min, c(min = 0), na.rm = TRUE)
maxs <-
vapply(training[, col_names], max, c(max = 0), na.rm = TRUE)
step_range_new(
terms = x$terms,
role = x$role,
trained = TRUE,
min = x$min,
max = x$max,
ranges = rbind(mins, maxs),
skip = x$skip,
id = x$id
)
}
#' @export
bake.step_range <- function(object, new_data, ...) {
tmp <- as.matrix(new_data[, colnames(object$ranges)])
tmp <- sweep(tmp, 2, object$ranges[1, ], "-")
tmp <- tmp * (object$max - object$min)
tmp <- sweep(tmp, 2, object$ranges[2, ] - object$ranges[1, ], "/")
tmp <- tmp + object$min
tmp[tmp < object$min] <- object$min
tmp[tmp > object$max] <- object$max
tmp <- tibble::as_tibble(tmp)
new_data[, colnames(object$ranges)] <- tmp
as_tibble(new_data)
}
print.step_range <-
function(x, width = max(20, options()$width - 30), ...) {
cat("Range scaling to [", x$min, ",", x$max, "] for ", sep = "")
printer(colnames(x$ranges), x$terms, x$trained, width = width)
invisible(x)
}
#' @rdname step_range
#' @param x A `step_range` object.
#' @export
tidy.step_range <- function(x, ...) {
if (is_trained(x)) {
res <- tibble(terms = colnames(x$ranges),
min = x$ranges["mins",],
max = x$ranges["maxs",])
} else {
term_names <- sel2char(x$terms)
res <- tibble(terms = term_names,
min = na_dbl,
max = na_dbl)
}
res$id <- x$id
res
}
|