1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352
|
% \VignetteIndexEntry{ A Short Introduction to the caret Package}
% \VignetteDepends{caret}
% \VignettePackage{caret}
\documentclass[12pt]{article}
\usepackage{colortbl}
\usepackage{amsmath}
\usepackage[pdftex]{graphicx}
\usepackage{color}
\usepackage{xspace}
\usepackage{fancyvrb}
\usepackage{fancyhdr}
\usepackage[
colorlinks=true,
linkcolor=blue,
citecolor=blue,
urlcolor=blue]
{hyperref}
\usepackage{Sweave}
\SweaveOpts{keep.source=TRUE}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% define new colors for use
\definecolor{darkgreen}{rgb}{0,0.6,0}
\definecolor{darkred}{rgb}{0.6,0.0,0}
\definecolor{lightbrown}{rgb}{1,0.9,0.8}
\definecolor{brown}{rgb}{0.6,0.3,0.3}
\definecolor{darkblue}{rgb}{0,0,0.8}
\definecolor{darkmagenta}{rgb}{0.5,0,0.5}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newcommand{\bld}[1]{\mbox{\boldmath $#1$}}
\newcommand{\shell}[1]{\mbox{$#1$}}
\renewcommand{\vec}[1]{\mbox{\bf {#1}}}
\newcommand{\codeheading}[1]{\mbox{\color{darkblue}\texttt{#1}}}
\newcommand{\code}[1]{\mbox{\footnotesize\color{darkblue}\texttt{#1}}}
\newcommand{\pkg}[1]{{\fontseries{b}\selectfont #1}}
\renewcommand{\pkg}[1]{{\textsf{#1}}}
\newcommand{\ReallySmallSpacing}{\renewcommand{\baselinestretch}{.6}\Large\normalsize}
\newcommand{\SmallSpacing}{\renewcommand{\baselinestretch}{1.1}\Large\normalsize}
\providecommand{\SetAlgoLined}{\SetLine}
\newcommand{\halfs}{\frac{1}{2}}
\setlength{\oddsidemargin}{-.25 truein}
\setlength{\evensidemargin}{0truein}
\setlength{\topmargin}{-0.2truein}
\setlength{\textwidth}{7 truein}
\setlength{\textheight}{8.5 truein}
\setlength{\parindent}{0truein}
\setlength{\parskip}{0.10truein}
\DefineVerbatimEnvironment{Sinput}{Verbatim}{fontshape=sl,formatcom=\color{darkblue}}
\fvset{listparameters={\setlength{\topsep}{0pt}}}
\renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}}
\fvset{fontsize=\footnotesize}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\pagestyle{fancy}
\lhead{}
\chead{The {\tt caret} Package}
\rhead{}
\lfoot{}
\cfoot{}
\rfoot{\thepage}
\renewcommand{\headrulewidth}{1pt}
\renewcommand{\footrulewidth}{1pt}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\title{A Short Introduction to the \pkg{caret} Package}
\author{Max Kuhn \\ max.kuhn@pfizer.com}
\begin{document}
\SweaveOpts{concordance=TRUE}
\maketitle
\thispagestyle{empty}
\vspace{.2in}
\renewcommand{\baselinestretch}{1}
<<loadLibs, results = hide, echo = FALSE>>=
library(MASS)
library(caret)
library(mlbench)
data(Sonar)
library(pls)
options(useFancyQuotes = FALSE)
getInfo <- function(what = "Suggests")
{
text <- packageDescription("caret")[what][[1]]
text <- gsub("\n", ", ", text, fixed = TRUE)
text <- gsub(">=", "$\\\\ge$", text, fixed = TRUE)
eachPkg <- strsplit(text, ", ", fixed = TRUE)[[1]]
eachPkg <- gsub(",", "", eachPkg, fixed = TRUE)
#out <- paste("\\\\pkg{", eachPkg[order(tolower(eachPkg))], "}", sep = "")
#paste(out, collapse = ", ")
length(eachPkg)
}
@
The \pkg{caret} package (short for
{\bf{\color{blue}{c}}}lassification {\bf{\color{blue}{a}}}nd
{\bf{\color{blue}{re}}}gression {\bf{\color{blue}{t}}}raining)
contains functions to streamline the model training process for
complex regression and classification problems. The package utilizes a
number of R packages but tries not to load them all at package
start-up\footnote{By adding formal package dependencies, the package
startup time can be greatly decreased}. The package ``suggests''
field includes \Sexpr{getInfo("Suggests")} packages. \pkg{caret} loads
packages as needed and assumes that they are installed. Install
\pkg{caret} using
<<install, eval = FALSE>>=
install.packages("caret", dependencies = c("Depends", "Suggests"))
@
to ensure that all the needed packages are installed.
The {\bf main help pages} for the package are at:
\begin{center}
\url{http://caret.r-forge.r-project.org/}
\end{center}
Here, there are extended examples and a large amount of information
that previously found in the package vignettes.
\pkg{caret} has several functions that attempt to streamline the model building and evaluation process, as well as feature selection and other techniques.
One of the primary tools in the package is the \code{train} function which can be used to
\begin{itemize}
\item evaluate, using resampling, the effect of model tuning parameters on performance
\item choose the ``optimal'' model across these parameters
\item estimate model performance from a training set
\end{itemize}
More formally:
\begin{center}
\includegraphics[clip, width = .9\textwidth]{train_algo}
\label{A:tune}
\end{center}
There are options for customizing almost every step of this process (e.g. resampling technique, choosing the optimal parameters etc). To demonstrate this function, the Sonar data from the \pkg{mlbench} package will be used.
The Sonar data consist of \Sexpr{nrow(Sonar)} data points collected on \Sexpr{ncol(Sonar)-1} predictors. The goal is to predict the two classes (\texttt{M} for metal cylinder or \texttt{R} for rock).
First, we split the data into two groups: a training set and a test set. To do this, the \code{createDataPartition} function is used:
<<SonarSplit>>=
library(caret)
library(mlbench)
data(Sonar)
set.seed(107)
inTrain <- createDataPartition(y = Sonar$Class,
## the outcome data are needed
p = .75,
## The percentage of data in the
## training set
list = FALSE)
## The format of the results
## The output is a set of integers for the rows of Sonar
## that belong in the training set.
str(inTrain)
@
By default, \code{createDataPartition} does a stratified random split of the data. To partition the data:
<<SonarDatasets>>=
training <- Sonar[ inTrain,]
testing <- Sonar[-inTrain,]
nrow(training)
nrow(testing)
@
To tune a model using Algorithm \ref{A:tune}, the \code{train} function can be used. More details on this function can be found at:
\begin{center}
\url{http://caret.r-forge.r-project.org/training.html}
\end{center}
Here, a partial least squares discriminant analysis (PLSDA) model will be tuned over the number of PLS components that should be retained. The most basic syntax to do this is:
<<plsTune1, eval = FALSE>>=
plsFit <- train(Class ~ .,
data = training,
method = "pls",
## Center and scale the predictors for the training
## set and all future samples.
preProc = c("center", "scale"))
@
However, we would probably like to customize it in a few ways:
\begin{itemize}
\item expand the set of PLS models that the function evaluates. By default, the function will tune over three values of each tuning parameter.
\item the type of resampling used. The simple bootstrap is used by default. We will have the function use three repeats of 10--fold cross--validation.
\item the methods for measuring performance. If unspecified, overall accuracy and the Kappa statistic are computed. For regression models, root mean squared error and $R^2$ are computed. Here, the function will be altered to estimate the area under the ROC curve, the sensitivity and specificity
\end{itemize}
To change the candidate values of the tuning parameter, either of the \code{tuneLength} or \code{tuneGrid} arguments can be used. The \code{train} function can generate a candidate set of parameter values and the \code{tuneLength} argument controls how many are evaluated. In the case of PLS, the function uses a sequence of integers from 1 to \code{tuneLength}. If we want to evaluate all integers between 1 and 15, setting \code{tuneLength = 15} would achieve this. The \code{tuneGrid} argument is used when specific values are desired. A data frame is used where each row is a tuning parameter setting and each column is a tuning parameter. An example is used below to illustrate this.
The syntax for the model would then be:
\begin{Verbatim}[fontshape=sl,formatcom=\color{darkblue},fontsize=\footnotesize,commandchars=\\\{\}]
> plsFit <- train(Class ~ .,
+ data = training,
+ method = "pls",
+ \textcolor{red}{tuneLength = 15},
+ preProc = c("center", "scale"))
\end{Verbatim}
To modify the resampling method, a \code{trainControl} function is used. The option \code{method} controls the type of resampling and defaults to \code{"boot"}. Another method, \code{"repeatedcv"}, is used to specify repeated $K$--fold cross--validation (and the argument \code{repeats} controls the number of repetitions). $K$ is controlled by the \code{number} argument and defaults to 10. The new syntax is then:
\begin{Verbatim}[fontshape=sl,formatcom=\color{darkblue},fontsize=\footnotesize,commandchars=\\\{\}]
> \textcolor{red}{ctrl <- trainControl(method = "repeatedcv",}
+ \textcolor{red}{ repeats = 3)}
> plsFit <- train(Class ~ .,
+ data = training,
+ method = "pls",
+ tuneLength = 15,
+ \textcolor{red}{trControl = ctrl},
+ preProc = c("center", "scale"))
\end{Verbatim}
Finally, to choose different measures of performance, additional arguments are given to \code{trainControl}. The \code{summaryFunction} argument is used to pas in a function that takes the observed and predicted values and estimate some measure of performance. Two such functions are already included in the package: \code{defaultSummary} and \code{twoClassSummary}. The latter will compute measures specific to two--class problems, such as the area under the ROC curve, the sensitivity and specificity. Since the ROC curve is based on the predicted class probabilities (which are not computed automatically), another option is required. The \code{classProbs = TRUE} option is used to include these calculations.
Lastly, the function will pick the tuning parameters associated with the best results. Since we are using custom performance measures, the criterion that should be optimized must also be specified. In the call to \code{train}, we can use \code{metric = "ROC"} to do this.
The final model fit would then be:
\begin{Verbatim}[fontshape=sl,formatcom=\color{darkblue},fontsize=\footnotesize,commandchars=\\\{\}]
> set.seed(123)
> ctrl <- trainControl(method = "repeatedcv",
+ repeats = 3,
+ \textcolor{red}{ classProbs = TRUE},
+ \textcolor{red}{ summaryFunction = twoClassSummary})
> plsFit <- train(Class ~ .,
+ data = training,
+ method = "pls",
+ tuneLength = 15,
+ trControl = ctrl,
+ \textcolor{red}{metric = "ROC"},
+ preProc = c("center", "scale"))
\end{Verbatim}
<<plsFit, echo = FALSE>>=
ctrl <- trainControl(method = "repeatedcv",
repeats = 3,
classProbs = TRUE,
summaryFunction = twoClassSummary)
set.seed(123)
plsFit <- train(Class ~ .,
data = training,
method = "pls",
tuneLength = 15,
trControl = ctrl,
metric = "ROC",
preProc = c("center", "scale"))
@
<<plsPrint>>=
plsFit
@
In this output the grid of results are the average resampled estimates of performance. The note at the bottom tells the user that \Sexpr{plsFit$bestTune$.ncomp} PLS components were found to be optimal. Based on this value, a final PLS model is fit to the whole data set using this specification and this is the model that is used to predict future samples.
The package has several functions for visualizing the results. One method for doing this is the \code{plot} function for \code{train} objects. The command \code{plot(plsFit)} produced the results seen in Figure \ref{F:pls} and shows the relationship between the resampled performance values and the number of PLS components.
\setkeys{Gin}{width=.65\textwidth}
\begin{figure}
\begin{center}
<<baPlot, echo = FALSE, results = hide, fig = TRUE, width = 6, height = 4.25>>=
trellis.par.set(caretTheme())
print(plot(plsFit))
@
\caption{
\code{plot(plsFit)} shows
the relationship between the number of
PLS components and the resampled
estimate of the area under the ROC curve. }
\label{F:pls}
\end{center}
\end{figure}
To predict new samples, \code{predict.train} can be used. For classification models, the default behavior is to calculated the predicted class. Using the option \code{type = "prob"} can be used to compute class probabilities from the model. For example:
<<plsPred>>=
plsClasses <- predict(plsFit, newdata = testing)
str(plsClasses)
plsProbs <- predict(plsFit, newdata = testing, type = "prob")
head(plsProbs)
@
\pkg{caret} contains a function to compute the confusion matrix and associated statistics for the model fit:
<<plsCM>>=
confusionMatrix(data = plsClasses, testing$Class)
@
To fit an another model to the data, \code{train} can be invoked with minimal changes. Lists of models available can be found at:
\begin{center}
\url{http://caret.r-forge.r-project.org/modelList.html}
\url{http://caret.r-forge.r-project.org/bytag.html}
\end{center}
For example, to fit a regularized discriminant model to these data, the following syntax can be used:
<<rdaFit>>=
## To illustrate, a custom grid is used
rdaGrid = data.frame(gamma = (0:4)/4, lambda = 3/4)
set.seed(123)
rdaFit <- train(Class ~ .,
data = training,
method = "rda",
tuneGrid = rdaGrid,
trControl = ctrl,
metric = "ROC")
rdaFit
rdaClasses <- predict(rdaFit, newdata = testing)
confusionMatrix(rdaClasses, testing$Class)
@
How do these models compare in terms of their resampling results? The \code{resamples} function can be used to collect, summarize and contrast the resampling results. Since the random number seeds were initialized to the same value prior to calling \code{train}, the same folds were used for each model. To assemble them:
<<rs>>=
resamps <- resamples(list(pls = plsFit, rda = rdaFit))
summary(resamps)
@
There are several functions to visualize these results. For example, a Bland--Altman type plot can be created using \code{xyplot(resamps, what = "BlandAltman")} (see Figure \ref{F:BA}). The results look similar. Since, for each resample, there are paired results a paired $t$--test can be used to assess whether there is a difference in the average resampled area under the ROC curve. The \code{diff.resamples} function can be used to compute this:
<<diffs>>=
diffs <- diff(resamps)
summary(diffs)
@
Based on this analysis, the difference between the models is \Sexpr{round(diffs$statistics$ROC[[1]]$estimate, 3)} ROC units (the RDA model is slightly higher) and the two--sided $p$--value for this difference is \Sexpr{format.pval(diffs$statistics$ROC[[1]]$p.value)}.
\setkeys{Gin}{width=.65\textwidth}
\begin{figure}
\begin{center}
<<plsPlot, echo = FALSE, results = hide, fig = TRUE, width = 6, height = 4.25>>=
plotTheme <- caretTheme()
plotTheme$plot.symbol$col <- rgb(.2, .2, .2, .5)
plotTheme$plot.symbol$pch <- 16
trellis.par.set(plotTheme)
print(xyplot(resamps, what = "BlandAltman"))
@
\caption{A Bland--Altman plot of the resampled ROC values produced using
\code{xyplot(resamps, what = "BlandAltman")}. }
\label{F:BA}
\end{center}
\end{figure}
\end{document}
|