File: gafs.default.Rd

package info (click to toggle)
r-cran-caret 7.0-1%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 4,036 kB
  • sloc: ansic: 210; sh: 10; makefile: 2
file content (185 lines) | stat: -rw-r--r-- 6,905 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/gafs.R
\name{gafs.default}
\alias{gafs.default}
\alias{gafs}
\alias{gafs.recipe}
\title{Genetic algorithm feature selection}
\usage{
\method{gafs}{default}(
  x,
  y,
  iters = 10,
  popSize = 50,
  pcrossover = 0.8,
  pmutation = 0.1,
  elite = 0,
  suggestions = NULL,
  differences = TRUE,
  gafsControl = gafsControl(),
  ...
)

\method{gafs}{recipe}(
  x,
  data,
  iters = 10,
  popSize = 50,
  pcrossover = 0.8,
  pmutation = 0.1,
  elite = 0,
  suggestions = NULL,
  differences = TRUE,
  gafsControl = gafsControl(),
  ...
)
}
\arguments{
\item{x}{An object where samples are in rows and features are in columns.
This could be a simple matrix, data frame or other type (e.g. sparse
matrix). For the recipes method, \code{x} is a recipe object. See Details below}

\item{y}{a numeric or factor vector containing the outcome for each sample}

\item{iters}{number of search iterations}

\item{popSize}{number of subsets evaluated at each iteration}

\item{pcrossover}{the crossover probability}

\item{pmutation}{the mutation probability}

\item{elite}{the number of best subsets to survive at each generation}

\item{suggestions}{a binary matrix of subsets strings to be included in the
initial population. If provided the number of columns must match the number
of columns in \code{x}}

\item{differences}{a logical: should the difference in fitness values with
and without each predictor be calculated?}

\item{gafsControl}{a list of values that define how this function acts. See
\code{\link{gafsControl}} and URL.}

\item{...}{additional arguments to be passed to other methods}

\item{data}{Data frame from which variables specified in
\code{formula} or \code{recipe} are preferentially to be taken.}
}
\value{
an object of class \code{gafs}
}
\description{
Supervised feature selection using genetic algorithms
}
\details{
\code{\link{gafs}} conducts a supervised binary search of the predictor
space using a genetic algorithm. See Mitchell (1996) and Scrucca (2013) for
more details on genetic algorithms.

This function conducts the search of the feature space repeatedly within
resampling iterations. First, the training data are split be whatever
resampling method was specified in the control function. For example, if
10-fold cross-validation is selected, the entire genetic algorithm is
conducted 10 separate times. For the first fold, nine tenths of the data are
used in the search while the remaining tenth is used to estimate the
external performance since these data points were not used in the search.

During the genetic algorithm, a measure of fitness is needed to guide the
search. This is the internal measure of performance. During the search, the
data that are available are the instances selected by the top-level
resampling (e.g. the nine tenths mentioned above). A common approach is to
conduct another resampling procedure. Another option is to use a holdout set
of samples to determine the internal estimate of performance (see the
holdout argument of the control function). While this is faster, it is more
likely to cause overfitting of the features and should only be used when a
large amount of training data are available. Yet another idea is to use a
penalized metric (such as the AIC statistic) but this may not exist for some
metrics (e.g. the area under the ROC curve).

The internal estimates of performance will eventually overfit the subsets to
the data. However, since the external estimate is not used by the search, it
is able to make better assessments of overfitting. After resampling, this
function determines the optimal number of generations for the GA.

Finally, the entire data set is used in the last execution of the genetic
algorithm search and the final model is built on the predictor subset that
is associated with the optimal number of generations determined by
resampling (although the update function can be used to manually set the
number of generations).

This is an example of the output produced when \code{gafsControl(verbose =
TRUE)} is used:

\preformatted{
Fold2 1 0.715 (13)
Fold2 2 0.715->0.737 (13->17, 30.4\%) *
Fold2 3 0.737->0.732 (17->14, 24.0\%)
Fold2 4 0.737->0.769 (17->23, 25.0\%) *
}

For the second resample (e.g. fold 2), the best subset across all
individuals tested in the first generation contained 13 predictors and was
associated with a fitness value of 0.715. The second generation produced a
better subset containing 17 samples with an associated fitness values of
0.737 (and improvement is symbolized by the \code{*}. The percentage listed
is the Jaccard similarity between the previous best individual (with 13
predictors) and the new best. The third generation did not produce a better
fitness value but the fourth generation did.

The search algorithm can be parallelized in several places: \enumerate{
\item each externally resampled GA can be run independently (controlled by
the \code{allowParallel} option of \code{\link{gafsControl}}) \item within a
GA, the fitness calculations at a particular generation can be run in
parallel over the current set of individuals (see the \code{genParallel}
option in \code{\link{gafsControl}}) \item if inner resampling is used,
these can be run in parallel (controls depend on the function used. See, for
example, \code{\link[caret]{trainControl}}) \item any parallelization of the
individual model fits. This is also specific to the modeling function.  }

It is probably best to pick one of these areas for parallelization and the
first is likely to produces the largest decrease in run-time since it is the
least likely to incur multiple re-starting of the worker processes. Keep in
mind that if multiple levels of parallelization occur, this can effect the
number of workers and the amount of memory required exponentially.
}
\examples{

\dontrun{
set.seed(1)
train_data <- twoClassSim(100, noiseVars = 10)
test_data  <- twoClassSim(10,  noiseVars = 10)

## A short example
ctrl <- gafsControl(functions = rfGA,
                    method = "cv",
                    number = 3)

rf_search <- gafs(x = train_data[, -ncol(train_data)],
                  y = train_data$Class,
                  iters = 3,
                  gafsControl = ctrl)

rf_search
  }

}
\references{
Kuhn M and Johnson K (2013), Applied Predictive Modeling,
Springer, Chapter 19 \url{http://appliedpredictivemodeling.com}

Scrucca L (2013). GA: A Package for Genetic Algorithms in R. Journal of
Statistical Software, 53(4), 1-37. \url{https://www.jstatsoft.org/article/view/v053i04}

Mitchell M (1996), An Introduction to Genetic Algorithms, MIT Press.

\url{https://en.wikipedia.org/wiki/Jaccard_index}
}
\seealso{
\code{\link{gafsControl}}, \code{\link{predict.gafs}},
\code{\link{caretGA}}, \code{\link{rfGA}} \code{\link{treebagGA}}
}
\author{
Max Kuhn, Luca Scrucca (for GA internals)
}
\keyword{models}