1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
|
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/learning_curve.R
\name{learning_curve_dat}
\alias{learning_curve_dat}
\title{Create Data to Plot a Learning Curve}
\usage{
learning_curve_dat(
dat,
outcome = NULL,
proportion = (1:10)/10,
test_prop = 0,
verbose = TRUE,
...
)
}
\arguments{
\item{dat}{the training data}
\item{outcome}{a character string identifying the outcome column name}
\item{proportion}{the incremental proportions of the training set that are
used to fit the model}
\item{test_prop}{an optional proportion of the data to be used to measure
performance.}
\item{verbose}{a logical to print logs to the screen as models are fit}
\item{\dots}{options to pass to \code{\link{train}} to specify the model.
These should not include \code{x}, \code{y}, \code{formula}, or \code{data}.
If \code{trainControl} is used here, do not use \code{method = "none"}.}
}
\value{
a data frame with columns for each performance metric calculated by
\code{\link{train}} as well as columns: \item{Training_Size }{the number of
data points used in the current model fit} \item{Data }{which data were used
to calculate performance. Values are "Resampling", "Training", and
(optionally) "Testing"} In the results, each data set size will have one row
for the apparent error rate, one row for the test set results (if used) and
as many rows as resamples (e.g. 10 rows if 10-fold CV is used).
}
\description{
For a given model, this function fits several versions on different sizes of
the total training set and returns the results
}
\details{
This function creates a data set that can be used to plot how well the model
performs over different sized versions of the training set. For each data
set size, the performance metrics are determined and saved. If
\code{test_prop == 0}, the apparent measure of performance (i.e.
re-predicting the training set) and the resampled estimate of performance
are available. Otherwise, the test set results are also added.
If the model being fit has tuning parameters, the results are based on the
optimal settings determined by \code{\link{train}}.
}
\examples{
\dontrun{
set.seed(1412)
class_dat <- twoClassSim(1000)
ctrl <- trainControl(classProbs = TRUE,
summaryFunction = twoClassSummary)
set.seed(29510)
lda_data <-
learning_curve_dat(dat = class_dat,
outcome = "Class",
test_prop = 1/4,
## `train` arguments:
method = "lda",
metric = "ROC",
trControl = ctrl)
ggplot(lda_data, aes(x = Training_Size, y = ROC, color = Data)) +
geom_smooth(method = loess, span = .8) +
theme_bw()
}
}
\seealso{
\code{\link{train}}
}
\author{
Max Kuhn
}
\keyword{models}
|