File: cv.nfeaturesLDA.Rd

package info (click to toggle)
r-cran-animation 2.7%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: bookworm, forky, sid, trixie
  • size: 1,268 kB
  • sloc: javascript: 873; sh: 15; makefile: 2
file content (85 lines) | stat: -rw-r--r-- 2,866 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/cv.nfeaturesLDA.R
\name{cv.nfeaturesLDA}
\alias{cv.nfeaturesLDA}
\title{Cross-validation to find the optimum number of features (variables) in LDA}
\usage{
cv.nfeaturesLDA(
  data = matrix(rnorm(600), 60),
  cl = gl(3, 20),
  k = 5,
  cex.rg = c(0.5, 3),
  col.av = c("blue", "red"),
  ...
)
}
\arguments{
\item{data}{a data matrix containg the predictors in columns}

\item{cl}{a factor indicating the classification of the rows of \code{data}}

\item{k}{the number of folds}

\item{cex.rg}{the range of the magnification to be used to the points in the
plot}

\item{col.av}{the two colors used to respectively denote rates of correct
predictions in the i-th fold and the average rates for all k folds}

\item{...}{arguments passed to \code{\link[graphics]{points}} to draw the
points which denote the correct rate}
}
\value{
A list containing \item{accuracy }{a matrix in which the element in
  the i-th row and j-th column is the rate of correct predictions based on
  LDA, i.e. build a LDA model with j variables and predict with data in the
  i-th fold (the test set) } \item{optimum }{the optimum number of features
  based on the cross-validation}
}
\description{
This function provids an illustration of the process of finding out the
optimum number of variables using k-fold cross-validation in a linear
discriminant analysis (LDA).
}
\details{
For a classification problem, usually we wish to use as less variables as
possible because of difficulties brought by the high dimension.

The selection procedure is like this:

\itemize{
  \item Split the whole data randomly into \eqn{k} folds:
    \itemize{
      \item For the number of features \eqn{g = 1, 2, \cdots, g_{max}}{g = 1, 2,
..., gmax}, choose \eqn{g} features that have the largest discriminatory
power (measured by the F-statistic in ANOVA):
        \itemize{
          \item For the fold \eqn{i} (\eqn{i = 1, 2, \cdots, k}{i = 1, 2, ..., k}):
           \itemize{
             \item
Train a LDA model without the \eqn{i}-th fold data, and predict with the
\eqn{i}-th fold for a proportion of correct predictions
\eqn{p_{gi}}{p[gi]};
           }
         }
      \item Average the \eqn{k} proportions to get the correct rate \eqn{p_g}{p[g]};
    }
  \item Determine the optimum number of features with the largest \eqn{p}.
}

Note that \eqn{g_{max}} is set by \code{ani.options('nmax')} (i.e. the
maximum number of features we want to choose).
}
\references{
Examples at \url{https://yihui.org/animation/example/cv-nfeatureslda/}

  Maindonald J, Braun J (2007). \emph{Data Analysis and Graphics
  Using R - An Example-Based Approach}. Cambridge University Press, 2nd
  edition. pp. 400
}
\seealso{
\code{\link{kfcv}}, \code{\link{cv.ani}}, \code{\link[MASS]{lda}}
}
\author{
Yihui Xie <\url{https://yihui.org/}>
}