File: variablekey.Rnw

package info (click to toggle)
r-cran-kutils 1.73%2Bdfsg-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 1,648 kB
  • sloc: sh: 13; makefile: 2
file content (1062 lines) | stat: -rw-r--r-- 49,570 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
%% LyX 2.3.7 created this file.  For more info, see http://www.lyx.org/.
%% Do not edit unless you really know what you are doing.
\documentclass[english,american,noae]{scrartcl}
\usepackage{lmodern}
\renewcommand{\sfdefault}{lmss}
\renewcommand{\ttdefault}{lmtt}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{geometry}
\geometry{verbose,tmargin=1in,bmargin=1in,lmargin=1in,rmargin=1in}
\usepackage{float}
\usepackage{setspace}
\usepackage[authoryear]{natbib}

\makeatletter

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% LyX specific LaTeX commands.
%% Because html converters don't know tabularnewline
\providecommand{\tabularnewline}{\\}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Textclass specific LaTeX commands.
<<echo=F>>=
  if(exists(".orig.enc")) options(encoding = .orig.enc)
@
\providecommand*{\code}[1]{\texttt{#1}}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% User specified LaTeX commands.
%\VignetteIndexEntry{variablekey}

\usepackage{booktabs}
\usepackage{Sweavel}
\usepackage{graphicx}
\usepackage{color}

\usepackage[samesize]{cancel}



\usepackage{ifthen}

\makeatletter

\renewenvironment{figure}[1][]{%

 \ifthenelse{\equal{#1}{}}{%

   \@float{figure}

 }{%

   \@float{figure}[#1]%

 }%

 \centering

}{%

 \end@float

}

\renewenvironment{table}[1][]{%

 \ifthenelse{\equal{#1}{}}{%

   \@float{table}

 }{%

   \@float{table}[#1]%

 }%

 \centering

%  \setlength{\@tempdima}{\abovecaptionskip}%

%  \setlength{\abovecaptionskip}{\belowcaptionskip}%

% \setlength{\belowcaptionskip}{\@tempdima}%

}{%

 \end@float

}


%\usepackage{listings}
% Make ordinary listings look as if they come from Sweave
\lstset{tabsize=2, breaklines=true, %,style=Rstyle}
                        fancyvrb=false,escapechar=`,language=R,%
                        %%basicstyle={\Rcolor\Sweavesize},%
                        backgroundcolor=\Rbackground,%
                        showstringspaces=false,%
                        keywordstyle=\Rcolor,%
                        commentstyle={\Rcommentcolor\ttfamily\itshape},%
                        literate={<<-}{{$\twoheadleftarrow$}}2{~}{{$\sim$}}1{<=}{{$\leq$}}2{>=}{{$\geq$}}2{^}{{$^{\scriptstyle\wedge}$}}1{==}{{=\,=}}1,%
                        alsoother={$},%
                        alsoletter={.<-},%
                        otherkeywords={!,!=,~,$,*,\&,\%/\%,\%*\%,\%\%,<-,<<-,/},%
                        escapeinside={(*}{*)}}%


% In document Latex options:
\fvset{listparameters={\setlength{\topsep}{0em}}}
\def\Sweavesize{\scriptsize} 
\def\Rcolor{\color{black}} 
\def\Rbackground{\color[gray]{0.95}}

% for sideways table
\usepackage{rotating}

\makeatother

\usepackage{babel}
\begin{document}
\title{The Variable Key Data Management Framework}
\author{\selectlanguage{english}%
Paul E. Johnson<pauljohn@ku.edu>\\
Benjamin A. Kite<bakite@ku.edu> \\
Center for Research Methods and Data Analysis\\
University of Kansas}
\date{\selectlanguage{english}%
\singlespacing{}\noindent \textbf{\today}}
\maketitle
\selectlanguage{american}%
\begin{abstract}
This essay describes the ``variable key'' approach to importing
and recoding data. This method has been developed in the Center for
Research Methods and Data Analysis at the University of Kansas to
deal with the importation of large, complicated data sets. This approach
improves teamwork, keeps better records, and reduces slippage between
the intentions of principal investigators the implementation by code
writers. The framework is implemented in the R \citep{RCore} package
\code{kutils}. 
\end{abstract}
<<echo=F>>=
if(!dir.exists("plots")) dir.create("plots")
@

\selectlanguage{english}%
% In document Latex options:
\fvset{listparameters={\setlength{\topsep}{0em}}}
\SweaveOpts{prefix.string=plots/t,split=F,ae=F,height=4,width=5.5}

<<Roptions, echo=F>>=
options(device = pdf)
options(width=100, prompt=" ", continue="  ")
options(useFancyQuotes = FALSE) 
options(SweaveHooks=list(fig=function() par(ps=10)))
pdf.options(onefile=F,family="Times",pointsize=10)
@
\selectlanguage{american}%

\section{Introduction}

The staff of the Center for Research Methods and Data Analysis has
been asked to help with data importation and recoding from time to
time. In one very large project, we were asked to combine, recode,
and integrate variables from 21 different files. The various files
used different variable names and had different, unique coding schemes.
A skeptic might have thought that the firm which created the data
sets intentionally obfuscated the records to prevent the comparison
of variables across a series of surveys.

In projects like that, the challenge of importing and fixing the data
seems overwhelming. The graduate research assistants are asked to
cobble together thousands of lines of ``recodes'' as they rename,
regroup, and otherwise harmonize the information. From a managerial
point of view, that is not the main problem. We expect to spend the
time of research assistants. While it may be tedious to read a codebook
and write recodes, social scientists have been doing that since the
1960s. It is not all that difficult. The truly difficult part is mustering
up the confidence in the resulting recoded data. How can a supervisor
check thousands of recode statements for accuracy? The very extensibility
of R itself–its openness to new functions and language elements–makes
proof-reading more difficult. We might shift some of the proof reading
duty to the principle investigators, but they sometimes are not interested
in details. In the end, the responsibility for verifying the recodes
falls on the project supervisors. While most supervisors with whom
we are personally acquainted have nearly super-human reading skills
and almost-perfect comprehension, we have documented a case in which
one of them was unable to catch an error on line 827 within an R file
with 2119 lines. 

To reduce the risk of misunderstanding and error, we propose the \emph{variable
key procedure}. It is a systematic way to separate code writing from
the process of renaming variables and re-designating their values.
The characteristics of the data are summarized in a table, a simple-looking
structure that might be edited in a text editor or a spread sheet
program. This simple structure, which we call the variable key, can
be used by principal investigators and supervisors to designate the
desired results. Once the key is created, then the data set can be
imported and recoded by the application of the key's information.
This does not eliminate the need to proof-read the renaming and recoding
of the variables, it simply shifts that chore into a simpler, more
workable setting.

This essay proceeds in 3 parts. First, the general concepts behind
the variable key system are explored. Second, the four stages in the
variable key procedure are outlined and illustrated with examples.
Third, we offer some examples of ways to double-check the results.

\section{Enter the Variable Key}

The variable key process was first developed for a very large project
for which we were hired by a commercial consulting company. As it
happened, the project manager who hired us was an Excel user who did
not know about R. He was given several SPSS datasets. After going
through the usual R process of importing and recoding data from 6
files, the aggregate of which included more than 40,000 observations
on 150 variables, we arrived at a renamed set of columns. Unfortunately,
the research assistant who had done most of the work resigned in order
to pursue a career as a magician.\footnote{Or graduated, we are not sure which.} 

With the unavailability of our key asset, it was difficult to know
for sure what was in which column. There was nobody to quickly answer
questions like ``which column is the respondent's sexual identity?''
and ``if sex is V23418, did we change 1 to male or female''. The
only way to find out is by hunting and pecking through a giant R file. 

In order to better communicate about that project, we developed a
table that looked like Table \ref{tab:A-Small-Variable-key}.

\begin{table}[H]
\caption{A Small Variable Key\label{tab:A-Small-Variable-key}}

\begin{tabular}{|c|c|c|c|}
\hline 
name\_old & name\_new & values\_old & values\_new\tabularnewline
\hline 
\hline 
V23419 & sex & 1|2|3 & ``male''|''female''|''neither''\tabularnewline
\hline 
V32422 & education & 1|2|3|4|5 & ``elem''<''hs''<''somecoll''<''ba''<''post''\tabularnewline
\hline 
V54532 & income & . & numeric\tabularnewline
\hline 
\end{tabular}
\end{table}

It was tedious to assemble that table, but it helped quite a bit in
our discussions. The vertical bars were used to indicate that the
original data had discrete values. When a variable has a natural ordering,
the new values were placed in order with the symbol (``<''). That
table grew quite large, since it had one row per variable, but it
was otherwise workable. It was popular with the client. 

In the middle of preparing that summary table of recoded values, we
realized that it was possible to write an R program to import the
key table and use its information to recode and rename the variables.
The recodes would \emph{just happen}. If we prepared the functions
properly, we had not just a table masquerading as a codebook, we had
a \emph{programmable codebook}. We wrote some functions that could
import variables (as named in column 1), apply the new values (from
columns 3 and 4), then apply the new name from column 2. The functions
to do that are somewhat difficult to prepare, but they are very appealing
from a supervisor's point of view. There will be less proof-reading
to do, at least in the R code itself. Once we can validate the functions,
then we never have to proof-read them again. These functions can be
applied, row by row, to create a new data frame. Instead, we need
to concentrate our attention on the substance of the problem, the
specification of the new names and values in the table.

In the projects where we have employed this system, we adjusted the
key and the R functions to suit the particular demands of the project
and the client. That was unfortunate, because we had very little accumulation
of code from one project to another. However, we did accumulate experience;
there were concepts and vocabulary which allowed us to understand
the various challenges that might be faced. The effort to develop
a standardized framework for the variable key began in 2016 with the
creation of the \code{kutils} package for R.

The variable key process allows project supervisors to create a table
that instructs the research assistants in the importation, renaming,
and recoding of data. There is still a daunting problem, however,
because the supervisors must create that variable key table. In a
large data set, it might be arduous to simply type the old names of
the variables and their observed values. In 2015 one of the graduate
assistants in our lab was asked to type up a variable key and he couldn't
quite believe that was a good use of his time. After some discussion,
we realized that it was not necessary to type the variable key at
all. We would write a function to do so. If R can import the candidate
data set, then R can certainly output its column names and a roster
of observed values. This lightened the workload considerably. By tabulating
all of the observed variables and their values, the most tedious part
of the process was done mechanically. 

In the remainder of this essay, we discuss the process of creating
a variable key template, revising it, and putting it to use. 

\section{Four Simple Steps}

The variable key process has four steps. First, inspect an R data.frame
object and create a key template file. The key template summarizes
the existing state of the variables and creates ``placeholders''
where we might like to specify revisions. Second, edit the key template
file in a spreadsheet or other program that can work with comma separate
variables. Change the names, values, and designate other recodes (which
we will describe next). Third, import the revised key into R. Fourth,
apply the key to the data to generate a new, improved data frame.
Then run some diagnostic routines.

If all goes well, we should end up with a new data frame in which
\begin{enumerate}
\item The columns are renamed in accordance with the instructions of the
principal investigator (or supervisor).
\item The values of all variables have been recoded according to the instructions
of the principal investigator (or supervisor).
\end{enumerate}
Diagnostic tables are reported to clearly demonstrate the effect of
each coding change, mapping out the difference between the input and
the output variables. 

For purposes of illustration, we have create an example data frame
with various types of variables. This data frame, \code{mydf}, has
most of the challenges that we see in actual projects. It has integer
variables that need to be reorganized and turned into character or
factor variables. It has character variables that might become integers
or factors. 

<<eval=T,include=T>>=
set.seed(234234)
N <- 200
mydf <- data.frame(
    x5 = rnorm(N),
    x4 = rpois(N, lambda = 3),
    x3 = ordered(sample(c("lo", "med", "hi"), size = N, replace=TRUE),
            levels = c("med", "lo", "hi")),
    x2 = letters[sample(c(1:4,6), 200, replace = TRUE)],
    x1 = factor(sample(c("cindy", "jan", "marcia"), 200,
            replace = TRUE)),
    x7 = ordered(letters[sample(c(1:4,6), 200, replace = TRUE)]),
    x6 = sample(c(1:5), 200, replace = TRUE),
            stringsAsFactors = FALSE)
mydf$x4[sample(1:N, 10)] <- 999
mydf$x5[sample(1:N, 10)] <- -999
@

\subsection{Step 1. Create a Key Template}

The function \code{keyTemplate} scans a data frame and generates
a new key template. The key has 8 pieces of information about each
variable. The rows of the key are named by the variables of the data
frame. The 8 columns in the key are \code{name\_old}, \code{name\_new},
\code{class\_old}, \code{class\_new}, \code{value\_old}, \code{value\_new},
\code{missings}, and \code{recodes}. \code{keyTemplate} will fill
\code{name\_old}, \code{class\_old}, and \code{value\_old}, in
the with values based on the data input, while the \code{new} columns
will be copies of those old values. The last 2, missings and recodes,
will be empty. 

There are two formats for the key template, \code{long} and \code{wide}
(determined by the parameter \code{long}). These names are drawn
from terminology in R's reshape function. The long format has one
row per value of each variable, while the wide format has all of the
information in one row. The two key formats are intended to be interchangeable
in functionality; they differ solely for convenience. Some users may
prefer to edit variable information in one style. The re-importation
of the key should deal gracefully with either type of variable key. 

A wide format key can be produced with a call to the \code{keyTemplate}
function like so:

<<eval=F>>=
key_wide <- keyTemplate(mydf, file = "key_wide.csv", max.levels = 5)
@

\noindent If the long argument is not specified, a wide key is the
default. One can ask for a long format:

<<eval=F>>=
key_long <- keyTemplate(mydf, long = TRUE, file = "key_long.csv", max.levels = 5)
@

\noindent The key object is a data.frame. 

Apart from the long argument, the \code{keyTemplate} function has
two especially noteworthy arguments, \code{file} and \code{max.levels}.
If the file argument is supplied, \code{keyTemplate} uses the suffix
to determine storage format. Legal suffixes are \code{.csv}, \code{.xlsx},
and \code{.rds} (for creating comma separated variables, Excel spreadsheets,
and R serialization data structures). 

The \code{max.levels} argument is also important. This is used in
the same sense that functions like \code{read.spss} in the foreign
package use that term. There is guessing involved in deciding if we
should enumerate a character or integer variable. We do want to enumerate
the ``Strongly Disagree'' to ``Strongly Agree'' values of a 7
point scale, but we do not want to enumerate the first names of all
study participants. If the number of discrete values exceeds \code{max.levels},
for which the default is 15, then the key will not enumerate them. 

Table \ref{tab:The-Wide-Key} demonstrates a wide key template as
it is produced by \code{keyTemplate}. We see now why it is called
a wide key; the recoding information is tightly packed into \code{value\_old}
and \code{value\_new}. The key includes more or less obvious columns
for the old and new variable names, their classes, and values of the
variables. Note that the values of x5 and x4 are not enumerated because
we set max.levels at 5. The \code{max.levels} parameter defaults
to 15, so that an integer variable with less than 15 values will have
each value displayed. For this example, the display of that variable
key was too wide for the page, so we reduced the number of values.
When the observed number of scores is above \code{max.levels}, the
key does not try to list the individual values (compare the treatment
of variables \code{x4} and \code{x6}). 

<<key0, include=F>>=
library(kutils)
library(xtable)
@

<<key10, echo=F>>=
key_wide <- keyTemplate(mydf, max.levels = 5)
@

A long key template is displayed in Table \ref{tab:Long-Key}. The
benefit of the long key is that the cells \code{value\_old} and \code{value\_new}
are easier to navigate.

<<key30,echo=F>>=
key_long <- keyTemplate(mydf, long = TRUE, max.levels = 5)
@

\begin{table}[h]
\caption{The Wide Key Template\label{tab:The-Wide-Key}}

\def\Sweavesize{\tiny}
<<key20,echo=FALSE,results=tex>>=
print(xtable(key_wide), include.rownames = FALSE, size = "small", floating = FALSE )
@
\end{table}

\begin{table}
\caption{The Long Key Template\label{tab:Long-Key}}

\def\Sweavesize{\tiny}
<<key40,echo=FALSE,results=tex>>=
print(xtable(key_long), include.rownames = FALSE, size = "small", floating = FALSE)
@
\end{table}

The value of \code{class\_old} in the key is the first element in
the return from the function class for a variable. There is one exception,
where we have tried to differentiate integer variables from numeric
variables. This is a confusing isssue in the history of R, as discussed
in the R help page for the function \code{as.double}. ,In the \emph{Note
on names} section, that page explains an ``anomaly'' in the usage
of term \code{numeric}. The R function \code{as.numeric} creates
a double precision floating point value, not an integer. However,
the is.numeric function responds TRUE for both integers and floating
point values. For purposes of editing the key, it is useful to differentiate
integers from floating point numbers. \code{kutils} includes a function
named \code{safeInteger}. It checks the observed values of a variable
to find out if any floating point values are present. If the aggregate
deviations from integer values are miniscule, then a variable is classified
as an integer. As a result, the keyTemplate function's column \code{class\_old}
should be ``integer'' or ``numeric'', and by the latter we mean
a floating point number. 

In some of our early projects, the variable key was in the wide format.
Difficulty in editing that caused us to shift some projects to the
long key. The idea that we would glide back and forth between keys
created in the wide and long formats dawned on us only recently. To
ease the conversion back-and-forth between the formats, we developed
the functions named \code{wide2long} and \code{long2wide}. We believe
these functions work effectively, but we have experienced some troubles
related to the way spreadsheets store character strings. If the key
in long format has a column of values ``Yes'',''No'', and ``'',
the wide representation should be ``Yes|No|'', but there is some
inclination to say we should have nested quotation marks, as in ``''Yes''|''No''|''{}''{}''.
That kind of string will not generally survive importation to and
export from a spread sheet at the current time.

\subsection{Step 2. Edit the variable key}

If the file argument was specified in \code{keyTemplate}, the work
is laid out for us. One can edit a csv file in any text editor or
in a spreadsheet. An xlsx file can be edited by Libre Office or Microsoft
Office. 

It is not necessary to change all of the values in name\_new, class\_new,
and value\_new. In fact, one might change just a few elements and
the un-altered variables will remain as they were when the data is
re-imported. We suggest users start small by making a few edits in
the key. A principal investigator might change just a few variable
names or values. In a large project, there may be quite a bit of work
involved. 

The \code{name\_old} column must never be edited. Generally, \code{class\_old}
and \code{value\_old} will not be edited either (the only exception
might arise if \code{class\_new} is either ``\code{factor}'' or
``\code{ordered}''). The \code{name\_new} column should include
legal R variable names (do not begin name\_new with a numeral or include
mathematical symbols like ``+'' or ``-''). We use R's \code{make.names}
function to clean up errant entries, so incorrect names are not fatal.

The difficult user decisions will concern the contents of class\_new
and value\_new. The desired variable type, of course, influences the
values that are meaningful. To convert a character variable to integer,
for example, it should go without saying that the value\_new element
should include integer values, not text strings.

The conversion of information from one type of variable into another
may be more complicated than it seems. It is a bit more tricky to
convert a factor into a numeric variable that uses the factor's levels
as numeric values. 

After experimenting with a number of cases, we believe that if \code{class\_old}
and \code{class\_new} are elements of this the \emph{safe class set}:
\code{character}, \code{logical}, \code{integer}, \code{numeric}
(same as \code{double}), \code{factor}, or \code{ordered}, then
the re-importation and recoding of data will be more-or-less automatic.
If \code{class\_new} differs from \code{class\_old}, and \code{class\_new}
is not an element in that 6 element set, then the user must supply
a recode function that creates a variable of the requested class.
Most commonly, we expect that will be used to incorporate \code{Date}
variables.

The enumerated values in the \code{value\_new} column should be specified
in the more or less obvious way. If \code{class\_new} is equal to
\code{character}, \code{factor}, or \code{ordered}, then the new
values can be arbitrary strings. 

The \code{missings} and \code{recodes} columns are empty in the
key template. The user will need to fill in those values if they are
to be used. When the key is later put to use, the order of processing
will be as follows. First, the values declared as missings will be
applied (convert observed value to R's NA symbol). Second, if there
is a recode function in the key, it is applied to a variable. Third,
if there was no recode function supplied, then the conversion of discrete
values by recalculation from \code{value\_old} into \code{value\_new}
will be applied. Note that the discrete values are applied only if
the recode cell is empty in the key.

The decision of whether to approach a given variable via value enumeration
or a recode function is, to an extent, simply a matter of taste. Some
chores that might be handled in either way. If a variable includes
floating point numbers (temperatures, dollar values, etc), then we
would not rely on new assignments in \code{value\_old} and \code{value\_new}.
Truly numeric variables of that sort almost certainly call for assignment
of missings and recodes by the last two cells in the variable key.
However, if a column includes integers or characters (1 = male, 2
= female), one might use the enumerated values (\code{value\_old}
and \code{value\_new}) or one could design a recode function to produce
the same result. It is important to remember that if a recode function
is applied, the enumerated value recoding is not. If one decides to
use a recode statement, then the elements in value\_old and value\_new
are ignored entirely, they could be manually deleted to simplify the
key. (That is to say, the \code{max.levels} parameter is just a way
of guessing how many unique levels is ``too many'' for an enumeration.
Users are free to delete values if recodes are used.)

Despite the possibility that a factor (or ordered) variable may have
many values, we believe that all of the levels of those variables
should be usually be included in the key. If a variable is declared
as a factor, it means the researcher has assigned meaning to the various
observed values and we are reluctant to ignore them. 

There is a more important reason to enumerate all of the legal values
for factor variables. If a value is omitted from the key, that value
will be omitted from the dataset. 

Among our users, we find opinion is roughly balanced between the long
and the wide key formats. One might simply try both. If the number
of observed values is more than 5 or 10, editing the key in a program
like Microsoft Excel is less error prone in the long key. This is
simply a matter of taste, however. The disadvantage of the long format
is that it is somewhat verbose, with repeated values in the name and
class values. If an editor makes an error in the assignment of a block,
then hard to find errors may result.

Because editing the key can be a rather involved process, we will
wait to discuss the details until section \ref{sec:Editing-the-key}.

\subsection{Step 3. keyImport}

Once any desired changes are entered in the variable key, the file
needs to be imported back into R. For that purpose, we supply the
\code{keyImport} function. As in \code{keyTemplate}, the file argument's
suffix is used to discern whether the input should be read as .csv,
.xlsx, or .rds. It is not necessary to specify that the key being
imported is in the long or wide format. \code{keyImport} includes
heuristics that have classified user-edited keys very accurately. 

The returned value is an R data frame that should be very similar
to the template, except that the new values of \code{name\_new},
\code{class\_new}, and \code{value\_new} will be visible. 

In order to test this function with the \code{kutils} package, we
include some variable keys. The usage of those keys is demonstrated
in the help page for \code{keyImport}. In addition to the mydf toy
data frame created above, we also include a subset of the US National
Longitudinal Survey in a data frame named \code{natlongsurv}.

\subsection{Step 4. Apply the imported key to the data}

The final step is to apply the key to the data frame (or some other
data frame that may have arrived in the interim). The syntax is simple

<<eval=F>>=
mydf.cleaned <- keyApply(mydf, mydf.keylist)
@

Because the default value of the argument \code{diagnostic} is TRUE,
the output from \code{keyApply} is somewhat verbose. After we have
more feedback from test users, we will be able to quiet some of that
output. 

The diagnostic output will include information about mismatch between
the key and the data itself. If variables that are included in the
key that are not included in the new data set, there will not be an
error, but a gentle warning will appear. Similarly, if the observed
values of an enumerated variable are not included in the variable
key, there will be a warning. 

The diagnostic will also create a cross tabulation of each new variable
against its older counterpart. This works very well with discrete
variables with 10 or so values, but for variables with more values
it is rather unmanageable. 

\section{Editing the variable key}

\label{sec:Editing-the-key}

The work of revising the variable key can be driven by the separation
of variables into two type. The variables with enumerated values–the
ones for which we intend to rename or re-assign values one-by-one–are
treated in a very different way than the other ones. The enumerated
value strategy works well with variables for which we simply need
to rename categories (e.g, ``cons'' to ``Conservative''). Variables
for which we do not do so (e.g., convert Fahrenheit to Celsius) are
treated differently.

As we will see in this section, the revision of variables of the enumerated
value type emphasizes the revision of the value\_old and value\_new
columns in the key. On the other hand, the other types will depend
on writing correctly formatted statements in the recode column of
the variable key.

\subsection{Enumerated variables}

All of the values observed for \code{logical}, \code{factor}, and
\code{ordered} variables will appear in the key template. Do not
delete them unless the exclusion of those values from the analysis
intended. For character and integer variables with fewer than \code{max.levels}
discrete values, the observed scores will be included in \code{value\_old}.
If one wishes to convert a variable from being treated as an enumerated
to a numeric type, then one can delete all values from \code{value\_old}
and \code{value\_new}. 

The recoding of discrete variables is a fairly obvious chore. For
each old value, a new value must be specified. We first consider the
case of a variable that enters as a character variable but we might
like to recode it and also create factor and integer variants of it.
In the \code{mydf} variable key (Table \ref{tab:The-Wide-Key}),
we have variable \code{x2} which is coded \code{a} through \code{f}.
We demonstrate ways to spawn new character, factor, or integer variable
in Table \ref{tab:Change-Type-Example1}. As long as name\_old is
preserved, as many lines as desired can be used to create variables
of different types. Here we show the middle section of the revised
key in which we have spawned 3 new variants of x2, each with its own
name.

\begin{table}[H]
\caption{Change Class, Example 1\label{tab:Change-Type-Example1}}

\begin{tabular}{cccccc}
\hline 
name\_old & name\_new & class\_old & class\_new & value\_old & value\_new\tabularnewline
\hline 
x2 & x2.char & character & character & a|b|c|d|f & Excellent|Proficient|Good|Fair|Poor\tabularnewline
x2 & x2.fac & character & factor & a|b|c|d|f & Excellent|Proficient|Good|Fair|Poor\tabularnewline
x2 & x2.gpa & character & integer & a|b|c|d|f & 4|3|2|1|0\tabularnewline
\hline 
\end{tabular}
\end{table}

In line one of Table \ref{tab:Change-Type-Example1}, the class \code{character}
remains the same. That line will produce a new character variable
with embellished values. Line two demonstrates how to create an R
factor variable, \code{x2.fac}, and line three converts the character
to an integer variable. Remember that it is important to match the
value of \code{class\_new} with the content proposed for \code{value\_new}.
Do not include character values in a variable for which the new class
will be numeric or integer. 

Similarly, it is obvious to see how an integer input can be converted
into either an integer, character, or factor variable by employing
any of the rows seen in Table \ref{tab:Change-Type-Example2}.

\begin{table}[H]
\caption{Change Class Example 2\label{tab:Change-Type-Example2}}

\begin{tabular}{cccccc}
\hline 
name\_old & name\_new & class\_old & class\_new & value\_old & value\_new\tabularnewline
\hline 
x6 & x6.i100 & integer & integer & 1|2|3|4|5 & 100|200|300|400|500\tabularnewline
x6 & x6.c & integer & character & 1|2|3|4|5 & Austin|Denver|Nashville|Provo|Miami\tabularnewline
x6 & x6.f & integer & factor & 1|2|3|4|5 & F|D|C|B|A\tabularnewline
\hline 
\end{tabular}
\end{table}

If a variable's \code{class\_old} is ordered, and we simply want
to relabel the existing levels, the work is also easy (see Table \ref{tab:Change-Type-Example3}).
The second row in Table \ref{tab:Change-Type-Example3} shows that
factor levels can be ``combined'' by assigning the same character
string to several ``<'' separated values. 

\begin{table}[H]
\caption{Change Class, Example 3\label{tab:Change-Type-Example3}}

\begin{tabular}{cccccc}
\hline 
name\_old & name\_new & class\_old & class\_new & value\_old & value\_new\tabularnewline
\hline 
x7 & x7.grades & ordered & ordered & f<d<c<b<a & F<D<C<B<A\tabularnewline
x7 & x7.passfail & ordered & ordered & f<d<c<b<a & Fail<Fail<Pass<Pass<Pass\tabularnewline
x7 & x7.gpa & ordered & integer & f<d<c<b<a & 0|1|2|3|4\tabularnewline
\hline 
\end{tabular}
\end{table}

Working with ordered variables, whether as input or output, becomes
more complicated if the existing data are not ordered in the way we
want. In the \code{mydf} example, the variable \code{mydf\$x3} was
coded as an ordered variable with levels (``med'', ``lo'', ``high'').
That might have been one person's idea of a joke, so we need to rearrange
these as (``lo'', ``med'', ``high''). If the original ordering
of the values is not consistent with the desired ordering of the new
ordered factor, then we need notation that allows researchers to achieve
two purposes. First, the values must be re-leveled. Second, allow
for the possibility that the new values must be relabeled as well.
We'd rather not proliferate new columns in the variable key or create
some confusing new notation.

Reordering variable levels requires us to do something that seems
dangerous. We need to edit the \code{value\_old} to correct the ordering
of the levels \emph{as they are currently labeled}. This is the only
time where we suggest that users edit the \code{value\_old} column.
In \code{value\_new}, supply new labels in the in the correct order
to parallel the newly edited \code{value\_old} column (see Table
\ref{tab:Reorder-Values}).

\begin{table}[H]
\caption{Reorder Values, Example 1\label{tab:Reorder-Values}}

\begin{tabular}{cccccc}
\hline 
name\_old & name\_new & class\_old & class\_new & value\_old & value\_new\tabularnewline
\hline 
x3 & x3.lo2hi & ordered & ordered & low<med<hi & low<medium<high\tabularnewline
x3 & x3.passfail & ordered & ordered & low<med<hi & low<pass<pass\tabularnewline
\hline 
\end{tabular}
\end{table}

The key importer will check that all ``duplicated levels'' are adjacent
with one another, so that the values above low are grouped together.

In the long key format, the equivalent information would be conveyed
by altering the ordering of the rows. For example, it is necessary
to re-order the rows to indicate the lo is lower than med, and then
for the new values we put in the desired names (see Table \ref{tab:Reorder-Values-Example2}).

\begin{table}[H]
\caption{Reorder Values, Example 2\label{tab:Reorder-Values-Example2}}

\begin{tabular}{llllllll}
\toprule
name\_old & name\_new & class\_old & class\_new & value\_old & value\_new\\
\midrule
x3 & x3lo2hi & ordered & ordered & lo & low\\
x3 & x3lo2hi & ordered & ordered & med & medium\\
x3 & x3lo2hi & ordered & ordered & hi & high\\
x3 & x3.passfail & ordered & ordered & low & low\\
x3 & x3.passfail & ordered & ordered & medium & pass\\
x3 & x3.passfail & ordered & ordered & high & pass\\
\bottomrule
\end{tabular}
\end{table}

What to do about missing values. Even within our small group, there
is some disagreement about this. The SAS tradition would have us enter
a period, ``.'', in value\_new for a level that is to be treated
as missing. Others are tempted to use special purpose character strings
like ``NA'' or ``N/A''. As long as neither ``NA'' nor ``N/A''
are legal values, that seems safe. There is another school of thought
to argue that if a user wants nothing, the clearest, safest thing
to do in the key is to enter nothing at all. 

As a result of this diversity, the keyImport function includes an
argument named na.strings (again, an argument drawn from R core funcitons).
The default setting has an ``all of the above'' flavor, treating
any of the values that any of us think might be missing as NA when
the key is later put to use. If one is fastidious in editing the key
and represents all value\_new that should be missing as a period (hooray
for SAS!), then the value of na.strings can be tightened up in the
obvious way. 

It is also possible to delete a level from value\_old and value\_new
in order to indicate that it should be treated as missing, but we
do not recommend this approach. This destroys an element of ``book
keeping'' because future users of the data, and the key, might like
to know that some levels were obliterated.

The missing value cell in the key can be used with these enumerated
variables. Usually, writing the values to be omitted is not as easy
as putting the desired missing symbol in the value\_new column, but
either method should work.

In a long format key, there will be many repeated rows for each variable.
The entries in the \code{missings} and \code{recodes} columns are
harvested from the cells corresponding to each combination of name\_old
and name\_new, it is not necessary to retype the \code{missings}
and \code{recodes} entries for each element.

The work of assigning missing values is carried out by a \code{kutils}
function named \code{assignMissing}. The help page for that function
has verbose commentary to explain the format of the cells that might
work well in the missings column.


\subsection{About non-enumerated variables}

In the process that creates the variable key template, we think differently
about the enumerated variables–ones that have meaningfully discrete
values that can be reassigned one-by-one–and numeric variables. Numeric
variables, whether we literally mean integers like age or the number
of pennies in a jar, or floating point numbers (weight, temperature,
volume, etc) are different. If we guessed right with \code{max.levels}
when the key was created, those non-discrete variables are included
as just one row, whether the key is wide or long. In those variables,
the main elements of interest for these variables are the columns
labeled \code{missings} and \code{recodes}. 

The instructions for specifying missing values are detailed in the
help page of the \code{assignMissing} function in \code{kutils}.
If the class\_old is \code{numeric} or \code{integer}, there are
only three types of statements allowed in the \code{missings} column.
Legal values must must begin with the characters ``<'', ``>'',
or ``c''. These are illustrated in Table \ref{tab:Recode-Examples}.
The symbols ``<='' and ``>='' are accepted in the obvious way. 

\begin{table}[H]
\caption{Missings Examples\label{tab:Recode-Examples}}

\begin{tabular}{|c|c|c|}
\hline 
missings & interpretation: NA will be assigned to  & example\tabularnewline
\hline 
> t & values greater than t & > 99\tabularnewline
>= t & values greater than or equal to t  & >=99\tabularnewline
<t & values less than t  & <0\tabularnewline
<=t & values less than or equal to t  & <=0\tabularnewline
c(a,b) & values equal to or greater than a and less than or equal to b  & c(-999,-1)\tabularnewline
\hline 
\end{tabular}
\end{table}

The key specification for \code{recodes} is discussed in the help
page of the \code{assignRecode} function in \code{kutils}. The \code{recodes}
column takes R code and applies it to the desired variable. For example,
if one wanted to transform a variable by taking its square root, this
could be done by providing ``sqrt(x)'' in the \code{recodes} column.
Here ``x'' is simply a placeholder where the name of the variable
indicated in the \code{name\_new} column will be inserted when the
variable key is applied. 

\begin{table}[H]
\caption{Recoding Integer and Numeric Variables}

{\footnotesize

\begin{tabular}{lllllllll}
  \toprule
name\_old & name\_new & class\_old & class\_new & value\_old & value\_new & missings & recodes \\
  \midrule
x5 & x5 & numeric & numeric &  &  & <0 & log(x) \\
x4 & x4 & numeric & numeric &  &  & c(-999) & abs(x) \\
x6 & x6 & integer & integer & 1$|$2$|$3$|$4$|$5 & 1$|$2$|$3$|$4$|$5 & c(-9) & \\
   \bottomrule
\end{tabular}
}
\end{table}


\subsection{Class conversions}

The problem of recoding a variable, but leaving its class the same,
is mostly solved. 

The conversion of variables from one class to another requires special
care. R's built in functions for coercion of variables from one type
to another succeed meaningfully in many cases, but not all. The coercion
of an integer to a floating point number produces understandable effects.
The coercion of a character to an integer is not always understandable,
and the conversion of factors to integers or numeric is, well, almost
always a source of concern.

We are intending to support, at minimum, the 6 safe classes, which
can be used in any combination of \code{value\_old} and \code{value\_new}.
We need to be sure that conversion from each one into the other types
is handled accurately. 

In the \code{keyApply} function, we have implemented special purpose
code to handle a class change from factor to numeric, for example,
in a way that preserves the levels as the new values. This is needed
when the input data has values like 10, 12, 14, and ``unknown''.
If an R function like read.table is used to import a column with those
values, the default behavior will convert that to a factor variable
with values ``10'', ``12, ``14'' and ``unknown''. What the
user probably wants–and what we do now–is open the possibility that
``unknown'' should be treated as NA and the values that appear to
be numbers (but are actually text strings) are converted into integer
format.

\subsection{Unanticipated \protect\code{class\_new} values}

By far, the most troublesome variable type is date and time information.
This is difficult for a number of reasons. In order from less troublesome
to more troublesome:
\begin{enumerate}
\item No two people seem to represent dates with the same style. Virtuous
people (the authors) write 2016-12-01, while others write ambiguous
strings like 12/01/2016, 01/12/2016 or 01-12-2016.
\item Computer programs absorb dates and convert them to internal numeric
schemes. However, programs differ in the way they record and export
date information. Most software converts dates as integers, for example.
The value is the number of days since time began. 

Unfortunately, it appears that the day on which time began differs
among programs. Microsoft Excel records dates as integers, and (shockingly!)
the origin date differs between Excel files created in Windows and
Macintosh. Mac designers were perplexed by the problem of leap years,
which caused them to use the origin 1904-01-01, while in MS Windows
Excel dates begin at 1900-01-01. The program SPSS uses integers for
dates as well, but the origin for SPSS is 1582-10-14 (we guess most
readers already knew that one, we apologize for being surprised). 

On the other hand, dates on Unix/Linux systems tend to follow the
POSIX standard, so date information is interpreted as the number of
days since 1970-01-01. That is a local favorite because it reminds
us of Nixon and Vietnam.
\item Dates are, to a certain extent, arbitrary without time zone information.
Its Tuesday, December 8 in Lawrence, Kansas, but it is Wednesday,
December 9 in Moscow, Russia. If time zones are not specified, than
all date information is, well, approximate. Furthermore, when scholars
try to put time zone information onto existing data that was collected
without it, they sometimes accidentally create impossible dates and
times (February 29, 1999, for example). When R functions like \code{as.Date}
encounter impossible dates, they return NA.
\item Even when programs seem to have absorbed dates and saved the information,
even in casual testing we have found that the saved information is
not retrieved in a consistent way. Some spreadsheet programs, on some
computers, return values dates as integers, while the same program
with a different user returns a character string. 
\end{enumerate}
All of this complaining is just our way of saying, ``we left the
recode box open, why not use it to recode your date information?''
If the variable key template says that a column is ``integer,''
but you expected a date, then all is likely to be fine if we set the
origin. One needs to find out what the origin should be and tell R
about it. The right recode statement is likely to be \code{as.Date(x, origin = \textquotedbl 1970-01-01\textquotedbl )}.
On the other hand, if the date information is in text format, with
values like ``2016-11-19'', then it is necessary to learn the shorthand
notation for date symbols. The right recode will be something like
\code{as.Date(x, format = \textquotedbl\%Y-\%m-\%d\textquotedbl )}.

If one sets \code{class\_new} = \code{Date}, the \code{keyApply}
function will succeed if the return value from the recode function
is an R Date variable. Otherwise, an error will be raised.

\subsection{Where more work should be done}

For the most part, our work now lies in development of better diagnostics
and error-checking routines. 

First, we need better filtering of user-created key entries. User
typographical errors occur in key preparation and more comprehensive
filtering should be done. At the current time, the \code{keyImport}
function does not include many subroutines for the validation of the
key. We'd like to filter the values requested in \code{class\_new},
for example, but it is difficult to see how that can be done. We want
to leave open the possibility that users may specify values of \code{class\_new}
that are unfamiliar to us. Another problem may arise if the number
of elements in \code{value\_old} and \code{value\_new} are not aligned
one-to-one. There will be an error when the key is put to use by \code{keyApply},
but we should have a stop message in \code{keyImport}.

Second, we need to better understand the problems that result when
users enter variable keys in programs like MS Excel or Libre Office.
Between programs, or versions of programs, or programs on various
platforms, we notice disappointing inconsistencies. If keys are kept
in ``flat text'' csv files, there is less danger, but there can
still be some trouble because spread sheets change the way they export
to csv from time to time. During the testing process, we have found
that version updates Excel cause unexpected changes in the way character
vectors and dates are stored. Until now, we have relied on the very
popular package \code{openxlsx} (\citealp{Walker2015}). This does
well most of the time, but we still find some inexplicable variations
in imported Excel files. Sometimes, empty cells in the variable key
appear as imported empty strings ``'', while sometimes they are
imported as R NA symbols. While it is probably safe to treat an entirely
empty value as a missing value in most projects, we'd rather not do
so. To deal with this problem, we need to develop a standard framework
for testing the quality of imported variable keys. 

Third, we need more comprehensive error-checking for the accuracy
of the imported data and compliance with the variable key. The variable
key system works well with the data sets for which it was created,
but we need more formal criteria for verifying that claim. 

\section{Discussion}

When a project has a small budget, we invite the principal investigator
to economize on the expenses by filling out the variable key's \code{name\_new},
\code{class\_new}, and \code{value\_new} columns. There are several
benefits in inviting the clients (or PIs) to be directly involved
in filling in the variable key. Most importantly, they are allowed
to name the variables in any way that is meaningful to them. When
statistical results are obtained, it is never necessary for them to
ask, ``what did you mean by this variable \code{occupation}?''
There are other benefits, however. By making the principal investigator
aware of the values that are actually observed, and by offering the
opportunity to specify how they ought to be recoded, a substantial
element of administrative slippage is ameliorated. The variable key
will specify exactly how categories are to be re-mapped, there is
much less danger of an accident buried in thousands of lines of recodes.

It often happens that the raw data to be imported are provided by
one of the national data centers. The variables are given exciting,
meaningful column names like V34342a. It appears to be almost certain
that research assistants will conclude that these names are not meaningful,
so they create names that are more meaningful to them, such as \emph{gender},
\emph{sex}, \emph{male}, \emph{female}, or whatnot. The research assistants
disappear into a haze of code and come out talking about the effect
of income, gender, and education on educational achievement, and the
principal investigator has to say, ``which of those variables is
income, again?'' and ``what's the coding on education?'' A very
exciting conversation then follows as one of the research assistant
realizes that V34342b is the one that should have been used for gender,
while V34342a indicates if the respondent ever visited Eastern Europe.

The variable key is intended to create neutral territory between principal
investigators and project supervisors on one side and the research
assistants on the other. The numbers are more likely to come out correctly
if the names, values, and classes can be specified in one document
that is accessible in numerous formats on which many eyes can be laid.
The distance between a re-design of the imported values is minimized
and the production cycle of the projects is accelerated. As we are
frequently reminded, there are many some bright people who don't know
R, it appears they are all fluent in spread sheet.

\bibliographystyle{chicago}
\bibliography{kutils}

\end{document}