File: safs.Rd

package info (click to toggle)
r-cran-caret 7.0-1%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 4,036 kB
  • sloc: ansic: 210; sh: 10; makefile: 2
file content (156 lines) | stat: -rw-r--r-- 6,535 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/safs.R
\name{safs}
\alias{safs}
\alias{safs.default}
\alias{safs.recipe}
\title{Simulated annealing feature selection}
\usage{
safs(x, ...)

\method{safs}{default}(x, y, iters = 10, differences = TRUE, safsControl = safsControl(), ...)

\method{safs}{recipe}(x, data, iters = 10, differences = TRUE, safsControl = safsControl(), ...)
}
\arguments{
\item{x}{An object where samples are in rows and features are in columns.
This could be a simple matrix, data frame or other type (e.g. sparse
matrix). For the recipes method, \code{x} is a recipe object.  See Details below.}

\item{\dots}{arguments passed to the classification or regression routine
specified in the function \code{safsControl$functions$fit}}

\item{y}{a numeric or factor vector containing the outcome for each sample.}

\item{iters}{number of search iterations}

\item{differences}{a logical: should the difference in fitness values with
and without each predictor be calculated?}

\item{safsControl}{a list of values that define how this function acts. See
\code{\link{safsControl}} and URL.}

\item{data}{an object of class \code{\link{rfe}}.}
}
\value{
an object of class \code{safs}
}
\description{
Supervised feature selection using simulated annealing

\code{\link{safs}} conducts a supervised binary search of the predictor
space using simulated annealing (SA). See Kirkpatrick et al (1983) for more
information on this search algorithm.
}
\details{
This function conducts the search of the feature space repeatedly within
resampling iterations. First, the training data are split be whatever
resampling method was specified in the control function. For example, if
10-fold cross-validation is selected, the entire simulated annealing search
is conducted 10 separate times. For the first fold, nine tenths of the data
are used in the search while the remaining tenth is used to estimate the
external performance since these data points were not used in the search.

During the search, a measure of fitness (i.e. SA energy value) is needed to
guide the search. This is the internal measure of performance. During the
search, the data that are available are the instances selected by the
top-level resampling (e.g. the nine tenths mentioned above). A common
approach is to conduct another resampling procedure. Another option is to
use a holdout set of samples to determine the internal estimate of
performance (see the holdout argument of the control function). While this
is faster, it is more likely to cause overfitting of the features and should
only be used when a large amount of training data are available. Yet another
idea is to use a penalized metric (such as the AIC statistic) but this may
not exist for some metrics (e.g. the area under the ROC curve).

The internal estimates of performance will eventually overfit the subsets to
the data. However, since the external estimate is not used by the search, it
is able to make better assessments of overfitting. After resampling, this
function determines the optimal number of iterations for the SA.

Finally, the entire data set is used in the last execution of the simulated
annealing algorithm search and the final model is built on the predictor
subset that is associated with the optimal number of iterations determined
by resampling (although the update function can be used to manually set the
number of iterations).

This is an example of the output produced when \code{safsControl(verbose =
TRUE)} is used:

\preformatted{
Fold03 1 0.401 (11)
Fold03 2 0.401->0.410 (11+1, 91.7\%) *
Fold03 3 0.410->0.396 (12+1, 92.3\%) 0.969 A
Fold03 4 0.410->0.370 (12+2, 85.7\%) 0.881
Fold03 5 0.410->0.399 (12+2, 85.7\%) 0.954 A
Fold03 6 0.410->0.399 (12+1, 78.6\%) 0.940 A
Fold03 7 0.410->0.428 (12+2, 73.3\%) *
}

The text "Fold03" indicates that this search is for the third
cross-validation fold. The initial subset of 11 predictors had a fitness
value of 0.401. The next iteration added a single feature the the existing
best subset of 11 (as indicated by "11+1") that increased the fitness value
to 0.410. This new solution, which has a Jaccard similarity value of 91.7\%
to the current best solution, is automatically accepted. The third iteration
adds another feature to the current set of 12 but does not improve the
fitness. The acceptance probability for this difference is shown to be
95.6\% and the "A" indicates that this new sub-optimal subset is accepted.
The fourth iteration does not show an increase and is not accepted. Note
that the Jaccard similarity value of 85.7\% is the similarity to the current
best solution (from iteration 2) and the "12+2" indicates that there are two
additional features added from the current best that contains 12 predictors.

The search algorithm can be parallelized in several places: \enumerate{
\item each externally resampled SA can be run independently (controlled by
the \code{allowParallel} option of \code{\link{safsControl}}) \item if inner
resampling is used, these can be run in parallel (controls depend on the
function used. See, for example, \code{\link[caret]{trainControl}}) \item
any parallelization of the individual model fits. This is also specific to
the modeling function.  }

It is probably best to pick one of these areas for parallelization and the
first is likely to produces the largest decrease in run-time since it is the
least likely to incur multiple re-starting of the worker processes. Keep in
mind that if multiple levels of parallelization occur, this can effect the
number of workers and the amount of memory required exponentially.
}
\examples{

\dontrun{

set.seed(1)
train_data <- twoClassSim(100, noiseVars = 10)
test_data  <- twoClassSim(10,  noiseVars = 10)

## A short example
ctrl <- safsControl(functions = rfSA,
                    method = "cv",
                    number = 3)

rf_search <- safs(x = train_data[, -ncol(train_data)],
                  y = train_data$Class,
                  iters = 3,
                  safsControl = ctrl)

rf_search
}

}
\references{
\url{http://topepo.github.io/caret/feature-selection-using-genetic-algorithms.html}

\url{http://topepo.github.io/caret/feature-selection-using-simulated-annealing.html}

Kuhn and Johnson (2013), Applied Predictive Modeling, Springer

Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. (1983). Optimization by
simulated annealing. Science, 220(4598), 671.
}
\seealso{
\code{\link{safsControl}}, \code{\link{predict.safs}}
}
\author{
Max Kuhn
}
\keyword{models}