File: keyTemplate.Rd

package info (click to toggle)
r-cran-kutils 1.73%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 1,648 kB
  • sloc: sh: 13; makefile: 2
file content (187 lines) | stat: -rw-r--r-- 7,827 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/variableKey.R
\name{keyTemplate}
\alias{keyTemplate}
\title{Create variable key template (in memory or in a file)}
\usage{
keyTemplate(
  dframe,
  long = FALSE,
  sort = FALSE,
  file = NULL,
  max.levels = 15,
  missings = NULL,
  missSymbol = ".",
  safeNumericToInteger = TRUE,
  trimws = "both",
  varlab = FALSE
)
}
\arguments{
\item{dframe}{A data frame}

\item{long}{Default FALSE.}

\item{sort}{Default FALSE. Should the rows representing the
variables be sorted alphabetically? Otherwise, they appear in
the order in which they were included in the original dataset.}

\item{file}{DEFAULT NULL, meaning no file is produced. Choose a
file name ending in either "csv" (for comma separated
variables), "xlsx" (compatible with Microsoft Excel), or "rds"
(R serialization data). The file name will be used to select
among the 3 storage formats. XLSX output requires the openxlsx
package.}

\item{max.levels}{How high is the limit on the number of values
for discrete (integer, character, and Date) variables?
Default = 15. If observed number exceeds max.levels, we
conclude the author should not assign new values in the key
and only the missing value will be included in the key as a
"placeholder". This does not affect variables declared as
factor or ordered variables, for which all levels are included
in all cases.}

\item{missings}{Values in exising data which should be treated as
missing in the new key. Character string in format acceptable
to the \code{assignMissing} function. Can be a string with
several missing indicators"1;2;3;(8,10);[22,24];> 99;< 2".}

\item{missSymbol}{Default ".".  A character string used to
represent missing values in the key that is created.  Relevant
(mostly) for the key's \code{value_new} column. Default is the
period, ".". Because R's symbol \code{NA} can be mistaken for
the character string \code{"NA"}, we use a different
(hopefully unmistakable) symbol in the key.}

\item{safeNumericToInteger}{Default TRUE: Should we treat values
which appear to be integers as integers? If a column is
numeric, it might be safe to treat it as an integer.  In many
csv data sets, the values coded c(1, 2, 3) are really
integers, not floats c(1.0, 2.0, 3.0). See \code{safeInteger}.}

\item{trimws}{Default is "both", user can change to "left", "right", or
set as NULL to avoid any trimming.}

\item{varlab}{A key can have a companion data structure for
variable labels. Default is FALSE, but the value may also be
TRUE or a named vector of variable labels, such as
\code{c("x1" = "happiness", "x2" = "wealth")}. The labels
become an attribute of the key object. See Details for
information on storage of varlabs in saved key files.}
}
\value{
A key in the form of a data frame. May also be saved on
    disk if the file argument is supplied. The key may have an
    attribute "varlab", variable labels.
}
\description{
A variable key is a human readable document that describes the
variables in a data set. A key can be revised and re-imported by R
to recode data. This might also be referred to as a
"programmable codebook."  This function inspects a data frame,
takes notice of its variable names, their classes, and legal
values, and then it creates a table summarizing that
information. The aim is to create a document that principal
investigators and research assistants can use to keep a project
well organized.  Please see the vignette in this package.
}
\details{
The variable key can be created in two formats, wide and long.
The original style of the variable key, wide, has one row per
variable. It has a style for compact notation about current values
and required recodes.  That is more compact, probably easier for
experts to read, but perhaps more difficult to edit. The long
style variable key has one row per value per variable.  Thus, in a
larger project, the long key can have many rows. However, in a
larger project, the long style key is easier to edit with a spread
sheet program.

After a key is created, it should be re-imported into R with the
\code{kutils::keyImport} function.  Then the key structure can
guide the importation and recoding of the data set.

Concerning the varlab attribute. Run \code{attr(key, "varlab"} to
review existing labels, if any.

Storing the variable labels in files requires some care because
the \code{rds}, \code{xlsx}, and \code{csv} formats have different
capabilities.  The \code{rds} storage format saves all attributes without
difficulty. In contrast, because \code{csv} and \code{xlsx} do not save
attributes, the varlabs are stored as separate character
matrices. For \code{xlsx} files, the varlab object is saved as a second
sheet in \code{xlsx} file, while in \code{csv} a second file suffixed
"-varlab.csv" is created.
}
\examples{
set.seed(234234)
N <- 200
mydf <- data.frame(x5 = rnorm(N),
                   x4 = rpois(N, lambda = 3),
                   x3 = ordered(sample(c("lo", "med", "hi"),
                   size = N, replace=TRUE),
                   levels = c("med", "lo", "hi")),
                   x2 = letters[sample(c(1:4,6), N, replace = TRUE)],
                   x1 = factor(sample(c("cindy", "bobby", "marcia",
                                        "greg", "peter"), N,
                   replace = TRUE)),
                   x7 = ordered(letters[sample(c(1:4,6), N, replace = TRUE)]),
                   x6 = sample(c(1:5), N, replace = TRUE),
                   stringsAsFactors = FALSE)
mydf$x4[sample(1:N, 10)] <- 999
mydf$x5[sample(1:N, 10)] <- -999

## Note: If we change this example data, we need to save a copy in
## "../inst/extdata" for packacing
dn <- tempdir()
write.csv(mydf, file = file.path(dn, "mydf.csv"), row.names = FALSE)
mydf.templ <- keyTemplate(mydf, file = file.path(dn, "mydf.templ.csv"),
                          varlab = TRUE)
mydf.templ_long <- keyTemplate(mydf, long = TRUE,
                            file = file.path(dn, "mydf.templlong.csv"),
                            varlab = TRUE)

mydf.templx <- keyTemplate(mydf, file = file.path(dn, "mydf.templ.xlsx"),
                            varlab = TRUE)
mydf.templ_longx <- keyTemplate(mydf, long = TRUE,
                             file = file.path(dn, "mydf.templ_long.xlsx"),
                             varlab = TRUE)
## Check the varlab attribute
attr(mydf.templ, "varlab")
mydf.tmpl2 <- keyTemplate(mydf,
                         varlab = c(x5 = "height", x4 = "age",
                         x3 = "intelligence", x1 = "Name"))
## Check the varlab attribute
attr(mydf.tmpl2, "varlab")

## Try with the national longitudinal study data
data(natlongsurv)
natlong.templ <- keyTemplate(natlongsurv,
                          file = file.path(dn, "natlongsurv.templ.csv"),
                          max.levels = 15, varlab = TRUE, sort = TRUE)

natlong.templlong <- keyTemplate(natlongsurv, long = TRUE,
                   file = file.path(dn, "natlongsurv.templ_long.csv"), sort = TRUE)
if(interactive()) View(natlong.templlong)
natlong.templlong2 <- keyTemplate(natlongsurv, long = TRUE,
                      missings = "<0", max.levels = 50, sort = TRUE,
                      varlab = TRUE)
if(interactive()) View(natlong.templlong2)

natlong.templwide2 <- keyTemplate(natlongsurv, long = FALSE,
                      missings = "<0", max.levels = 50, sort = TRUE)
if(interactive()) View(natlong.templwide2)

all.equal(wide2long(natlong.templwide2), natlong.templlong2)

head(keyTemplate(natlongsurv, file = file.path(dn, "natlongsurv.templ.xlsx"),
             max.levels = 15, varlab = TRUE, sort = TRUE), 10)
head(keyTemplate(natlongsurv, file = file.path(dn, "natlongsurv.templ.xlsx"),
             long = TRUE, max.levels = 15, varlab = TRUE, sort = TRUE), 10)

list.files(dn)

}
\author{
Paul Johnson <pauljohn@ku.edu>
}