File: findCorrelation.Rd

package info (click to toggle)
r-cran-caret 7.0-1%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 4,036 kB
  • sloc: ansic: 210; sh: 10; makefile: 2
file content (84 lines) | stat: -rw-r--r-- 2,558 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/findCorrelation.R
\name{findCorrelation}
\alias{findCorrelation}
\title{Determine highly correlated variables}
\usage{
findCorrelation(
  x,
  cutoff = 0.9,
  verbose = FALSE,
  names = FALSE,
  exact = ncol(x) < 100
)
}
\arguments{
\item{x}{A correlation matrix}

\item{cutoff}{A numeric value for the pair-wise absolute correlation cutoff}

\item{verbose}{A boolean for printing the details}

\item{names}{a logical; should the column names be returned (\code{TRUE}) or
the column index (\code{FALSE})?}

\item{exact}{a logical; should the average correlations be recomputed at
each step? See Details below.}
}
\value{
A vector of indices denoting the columns to remove (when \code{names
= TRUE}) otherwise a vector of column names. If no correlations meet the
criteria, \code{integer(0)} is returned.
}
\description{
This function searches through a correlation matrix and returns a vector of
integers corresponding to columns to remove to reduce pair-wise
correlations.
}
\details{
The absolute values of pair-wise correlations are considered. If two
variables have a high correlation, the function looks at the mean absolute
correlation of each variable and removes the variable with the largest mean
absolute correlation.

Using \code{exact = TRUE} will cause the function to re-evaluate the average
correlations at each step while \code{exact = FALSE} uses all the
correlations regardless of whether they have been eliminated or not. The
exact calculations will remove a smaller number of predictors but can be
much slower when the problem dimensions are "big".
}
\examples{

R1 <- structure(c(1, 0.86, 0.56, 0.32, 0.85, 0.86, 1, 0.01, 0.74, 0.32,
                  0.56, 0.01, 1, 0.65, 0.91, 0.32, 0.74, 0.65, 1, 0.36,
                  0.85, 0.32, 0.91, 0.36, 1),
                .Dim = c(5L, 5L))
colnames(R1) <- rownames(R1) <- paste0("x", 1:ncol(R1))
R1

findCorrelation(R1, cutoff = .6, exact = FALSE)
findCorrelation(R1, cutoff = .6, exact = TRUE)
findCorrelation(R1, cutoff = .6, exact = TRUE, names = FALSE)


R2 <- diag(rep(1, 5))
R2[2, 3] <- R2[3, 2] <- .7
R2[5, 3] <- R2[3, 5] <- -.7
R2[4, 1] <- R2[1, 4] <- -.67

corrDF <- expand.grid(row = 1:5, col = 1:5)
corrDF$correlation <- as.vector(R2)
levelplot(correlation ~ row + col, corrDF)

findCorrelation(R2, cutoff = .65, verbose = TRUE)

findCorrelation(R2, cutoff = .99, verbose = TRUE)

}
\seealso{
\code{\link{findLinearCombos}}
}
\author{
Original R code by Dong Li, modified by Max Kuhn
}
\keyword{manip}