File: dummyVars.Rd

package info (click to toggle)
r-cran-caret 7.0-1%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 4,036 kB
  • sloc: ansic: 210; sh: 10; makefile: 2
file content (175 lines) | stat: -rw-r--r-- 6,583 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/dummyVar.R
\name{dummyVars}
\alias{dummyVars}
\alias{dummyVars.default}
\alias{predict.dummyVars}
\alias{contr.dummy}
\alias{contr.ltfr}
\alias{class2ind}
\alias{print.dummyVars}
\title{Create A Full Set of Dummy Variables}
\usage{
dummyVars(formula, ...)

\method{dummyVars}{default}(formula, data, sep = ".", levelsOnly = FALSE, fullRank = FALSE, ...)

\method{print}{dummyVars}(x, ...)

\method{predict}{dummyVars}(object, newdata, na.action = na.pass, ...)

contr.ltfr(n, contrasts = TRUE, sparse = FALSE)

class2ind(x, drop2nd = FALSE)
}
\arguments{
\item{formula}{An appropriate R model formula, see References}

\item{...}{additional arguments to be passed to other methods}

\item{data}{A data frame with the predictors of interest}

\item{sep}{An optional separator between factor variable names and their
levels. Use \code{sep = NULL} for no separator (i.e. normal behavior of
\code{\link[stats]{model.matrix}} as shown in the Details section)}

\item{levelsOnly}{A logical; \code{TRUE} means to completely remove the
variable names from the column names}

\item{fullRank}{A logical; should a full rank or less than full rank
parameterization be used? If \code{TRUE}, factors are encoded to be
consistent with \code{\link[stats]{model.matrix}} and the resulting there
are no linear dependencies induced between the columns.}

\item{x}{A factor vector.}

\item{object}{An object of class \code{dummyVars}}

\item{newdata}{A data frame with the required columns}

\item{na.action}{A function determining what should be done with missing
values in \code{newdata}. The default is to predict \code{NA}.}

\item{n}{A vector of levels for a factor, or the number of levels.}

\item{contrasts}{A logical indicating whether contrasts should be computed.}

\item{sparse}{A logical indicating if the result should be sparse.}

\item{drop2nd}{A logical: if the factor has two levels, should a single binary vector be returned?}
}
\value{
The output of \code{dummyVars} is a list of class 'dummyVars' with
elements \item{call }{the function call} \item{form }{the model formula}
\item{vars }{names of all the variables in the model} \item{facVars }{names
of all the factor variables in the model} \item{lvls }{levels of any factor
variables} \item{sep }{\code{NULL} or a character separator} \item{terms
}{the \code{\link[stats]{terms.formula}} object} \item{levelsOnly }{a
logical}

The \code{predict} function produces a data frame.

\code{class2ind} returns a matrix (or a vector if \code{drop2nd = TRUE}).

\code{contr.ltfr} generates a design matrix.
}
\description{
\code{dummyVars} creates a full set of dummy variables (i.e. less than full
rank parameterization)
}
\details{
Most of the \code{\link[stats]{contrasts}} functions in R produce full rank
parameterizations of the predictor data. For example,
\code{\link[stats]{contr.treatment}} creates a reference cell in the data
and defines dummy variables for all factor levels except those in the
reference cell. For example, if a factor with 5 levels is used in a model
formula alone, \code{\link[stats]{contr.treatment}} creates columns for the
intercept and all the factor levels except the first level of the factor.
For the data in the Example section below, this would produce:
\preformatted{ (Intercept) dayTue dayWed dayThu dayFri daySat daySun
           1      0      0      0      0      0      0
           1      0      0      0      0      0      0
           1      0      0      0      0      0      0
           1      0      1      0      0      0      0
           1      0      1      0      0      0      0
           1      0      0      0      1      0      0
           1      0      0      0      0      1      0
           1      0      0      0      0      1      0
           1      0      0      0      1      0      0}

In some situations, there may be a need for dummy variables for all the
levels of the factor. For the same example:
\preformatted{ dayMon dayTue dayWed dayThu dayFri daySat daySun
      1      0      0      0      0      0      0
      1      0      0      0      0      0      0
      1      0      0      0      0      0      0
      0      0      1      0      0      0      0
      0      0      1      0      0      0      0
      0      0      0      0      1      0      0
      0      0      0      0      0      1      0
      0      0      0      0      0      1      0
      0      0      0      0      1      0      0}

Given a formula and initial data set, the class \code{dummyVars} gathers all
the information needed to produce a full set of dummy variables for any data
set. It uses \code{contr.ltfr} as the base function to do this.

\code{class2ind} is most useful for converting a factor outcome vector to a
matrix (or vector) of dummy variables.
}
\examples{
when <- data.frame(time = c("afternoon", "night", "afternoon",
                            "morning", "morning", "morning",
                            "morning", "afternoon", "afternoon"),
                   day = c("Mon", "Mon", "Mon",
                           "Wed", "Wed", "Fri",
                           "Sat", "Sat", "Fri"),
                           stringsAsFactors = TRUE)

levels(when$time) <- list(morning="morning",
                          afternoon="afternoon",
                          night="night")
levels(when$day) <- list(Mon="Mon", Tue="Tue", Wed="Wed", Thu="Thu",
                         Fri="Fri", Sat="Sat", Sun="Sun")

## Default behavior:
model.matrix(~day, when)

mainEffects <- dummyVars(~ day + time, data = when)
mainEffects
predict(mainEffects, when[1:3,])

when2 <- when
when2[1, 1] <- NA
predict(mainEffects, when2[1:3,])
predict(mainEffects, when2[1:3,], na.action = na.omit)


interactionModel <- dummyVars(~ day + time + day:time,
                              data = when,
                              sep = ".")
predict(interactionModel, when[1:3,])

noNames <- dummyVars(~ day + time + day:time,
                     data = when,
                     levelsOnly = TRUE)
predict(noNames, when)

head(class2ind(iris$Species))

two_levels <- factor(rep(letters[1:2], each = 5))
class2ind(two_levels)
class2ind(two_levels, drop2nd = TRUE)
}
\references{
\url{https://cran.r-project.org/doc/manuals/R-intro.html#Formulae-for-statistical-models}
}
\seealso{
\code{\link[stats]{model.matrix}}, \code{\link[stats]{contrasts}},
\code{\link[stats]{formula}}
}
\author{
\code{contr.ltfr} is a small modification of
\code{\link[stats]{contr.treatment}} by Max Kuhn
}
\keyword{models}