File: vfold_cv.Rd

package info (click to toggle)
r-cran-rsample 1.2.1%2Bdfsg-1
links: PTS, VCS
area: main
in suites: forky, sid, trixie
size: 1,932 kB
sloc: sh: 13; makefile: 2
file content (97 lines) | stat: -rw-r--r-- 3,454 bytes
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/vfold.R
\name{vfold_cv}
\alias{vfold_cv}
\title{V-Fold Cross-Validation}
\usage{
vfold_cv(data, v = 10, repeats = 1, strata = NULL, breaks = 4, pool = 0.1, ...)
}
\arguments{
\item{data}{A data frame.}

\item{v}{The number of partitions of the data set.}

\item{repeats}{The number of times to repeat the V-fold partitioning.}

\item{strata}{A variable in \code{data} (single character or name) used to conduct
stratified sampling. When not \code{NULL}, each resample is created within the
stratification variable. Numeric \code{strata} are binned into quartiles.}

\item{breaks}{A single number giving the number of bins desired to stratify a
numeric stratification variable.}

\item{pool}{A proportion of data used to determine if a particular group is
too small and should be pooled into another group. We do not recommend
decreasing this argument below its default of 0.1 because of the dangers
of stratifying groups that are too small.}

\item{...}{These dots are for future extensions and must be empty.}
}
\value{
A tibble with classes \code{vfold_cv}, \code{rset}, \code{tbl_df}, \code{tbl}, and
\code{data.frame}. The results include a column for the data split objects and
one or more identification variables. For a single repeat, there will be
one column called \code{id} that has a character string with the fold identifier.
For repeats, \code{id} is the repeat number and an additional column called \code{id2}
that contains the fold information (within repeat).
}
\description{
V-fold cross-validation (also known as k-fold cross-validation) randomly
splits the data into V groups of roughly equal size (called "folds"). A
resample of the analysis data consists of V-1 of the folds while the
assessment set contains the final fold. In basic V-fold cross-validation
(i.e. no repeats), the number of resamples is equal to V.
}
\details{
With more than one repeat, the basic V-fold cross-validation is
conducted each time. For example, if three repeats are used with \code{v = 10},
there are a total of 30 splits: three groups of 10 that are generated
separately.

With a \code{strata} argument, the random sampling is conducted
\emph{within the stratification variable}. This can help ensure that the
resamples have equivalent proportions as the original data set. For
a categorical variable, sampling is conducted separately within each class.
For a numeric stratification variable, \code{strata} is binned into quartiles,
which are then used to stratify. Strata below 10\% of the total are
pooled together; see \code{\link[=make_strata]{make_strata()}} for more details.
}
\examples{
\dontshow{if (rlang::is_installed("modeldata")) (if (getRversion() >= "3.4") withAutoprint else force)(\{ # examplesIf}
vfold_cv(mtcars, v = 10)
vfold_cv(mtcars, v = 10, repeats = 2)

library(purrr)
data(wa_churn, package = "modeldata")

set.seed(13)
folds1 <- vfold_cv(wa_churn, v = 5)
map_dbl(
  folds1$splits,
  function(x) {
    dat <- as.data.frame(x)$churn
    mean(dat == "Yes")
  }
)

set.seed(13)
folds2 <- vfold_cv(wa_churn, strata = churn, v = 5)
map_dbl(
  folds2$splits,
  function(x) {
    dat <- as.data.frame(x)$churn
    mean(dat == "Yes")
  }
)

set.seed(13)
folds3 <- vfold_cv(wa_churn, strata = tenure, breaks = 6, v = 5)
map_dbl(
  folds3$splits,
  function(x) {
    dat <- as.data.frame(x)$churn
    mean(dat == "Yes")
  }
)
\dontshow{\}) # examplesIf}
}