File: createDataPartition.Rd

package info (click to toggle)
r-cran-caret 6.0-73%2Bdfsg1-1
links: PTS, VCS
area: main
in suites: stretch
size: 5,884 kB
ctags: 9
sloc: ansic: 207; sh: 10; makefile: 2
file content (131 lines) | stat: -rw-r--r-- 4,341 bytes
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/createDataPartition.R
\name{createDataPartition}
\alias{createDataPartition}
\alias{createFolds}
\alias{createMultiFolds}
\alias{createResample}
\alias{createTimeSlices}
\title{Data Splitting functions}
\usage{
createDataPartition(y, times = 1, p = 0.5, list = TRUE, groups = min(5,
  length(y)))

createFolds(y, k = 10, list = TRUE, returnTrain = FALSE)

createTimeSlices(y, initialWindow, horizon = 1, fixedWindow = TRUE,
  skip = 0)
}
\arguments{
\item{y}{a vector of outcomes. For \code{createTimeSlices}, these should be
in chronological order.}

\item{times}{the number of partitions to create}

\item{p}{the percentage of data that goes to training}

\item{list}{logical - should the results be in a list (\code{TRUE}) or a
matrix with the number of rows equal to \code{floor(p * length(y))} and
\code{times} columns.}

\item{groups}{for numeric \code{y}, the number of breaks in the quantiles
(see below)}

\item{k}{an integer for the number of folds.}

\item{returnTrain}{a logical. When true, the values returned are the sample
positions corresponding to the data used during training. This argument only
works in conjunction with \code{list = TRUE}}

\item{initialWindow}{The initial number of consecutive values in each
training set sample}

\item{horizon}{The number of consecutive values in test set sample}

\item{fixedWindow}{A logical: if \code{FALSE}, the training set always start
at the first sample.}

\item{skip}{An integer specifying how many (if any) resamples to skip to
thin the total amount.}
}
\value{
A list or matrix of row position integers corresponding to the
training data
}
\description{
A series of test/training partitions are created using
\code{createDataPartition} while \code{createResample} creates one or more
bootstrap samples. \code{createFolds} splits the data into \code{k} groups
while \code{createTimeSlices} creates cross-validation sample information to
be used with time series data.
}
\details{
For bootstrap samples, simple random sampling is used.

For other data splitting, the random sampling is done within the levels of
\code{y} when \code{y} is a factor in an attempt to balance the class
distributions within the splits.

For numeric \code{y}, the sample is split into groups sections based on
percentiles and sampling is done within these subgroups. For
\code{createDataPartition}, the number of percentiles is set via the
\code{groups} argument. For \code{createFolds} and \code{createMultiFolds},
the number of groups is set dynamically based on the sample size and
\code{k}.  For smaller samples sizes, these two functions may not do
stratified splitting and, at most, will split the data into quartiles.

Also, for \code{createDataPartition}, very small class sizes (<= 3) the
classes may not show up in both the training and test data

For multiple k-fold cross-validation, completely independent folds are
created.  The names of the list objects will denote the fold membership
using the pattern "Foldi.Repj" meaning the ith section (of k) of the jth
cross-validation set (of \code{times}). Note that this function calls
\code{createFolds} with \code{list = TRUE} and \code{returnTrain = TRUE}.

Hyndman and Athanasopoulos (2013)) discuss rolling forecasting origin
techniques that move the training and test sets in time.
\code{createTimeSlices} can create the indices for this type of splitting.
}
\examples{

data(oil)
createDataPartition(oilType, 2)

x <- rgamma(50, 3, .5)
inA <- createDataPartition(x, list = FALSE)

plot(density(x[inA]))
rug(x[inA])

points(density(x[-inA]), type = "l", col = 4)
rug(x[-inA], col = 4)

createResample(oilType, 2)

createFolds(oilType, 10)
createFolds(oilType, 5, FALSE)

createFolds(rnorm(21))

createTimeSlices(1:9, 5, 1, fixedWindow = FALSE)
createTimeSlices(1:9, 5, 1, fixedWindow = TRUE)
createTimeSlices(1:9, 5, 3, fixedWindow = TRUE)
createTimeSlices(1:9, 5, 3, fixedWindow = FALSE)

createTimeSlices(1:15, 5, 3)
createTimeSlices(1:15, 5, 3, skip = 2)
createTimeSlices(1:15, 5, 3, skip = 3)

}
\author{
Max Kuhn, \code{createTimeSlices} by Tony Cooper
}
\references{
\url{http://topepo.github.io/caret/splitting.html}

Hyndman and Athanasopoulos (2013), Forecasting: principles and practice.
\url{https://www.otexts.org/fpp}
}
\keyword{utilities}