1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350
|
---
title: Importing data from SPSS and Stata
output: rmarkdown::html_vignette
vignette: >
% \VignetteIndexEntry{Importing data from SPSS and Stata}
% \VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r,echo=FALSE,message=FALSE}
knitr::opts_chunk$set(comment=NA,eval=FALSE)
```
# Motivation
The \"foreign\" package for *R* already provides facilities to import
data from other statistical software packages such as SPSS or Stata, but
they are limited by the way survey data are generally represented in
*R*. That is, since variables in an R data frame can only be numerical
vectors or factors, any direct translation of SPSS or Stata data sets
into data frames will lead to the loss of information of information,
such as variable labels, variable labels, or user-specified missing
values. (Value labels can be preserved by translating them into factor
levels, but this means losing information about the original codes. It
will also lead to undesired missing values, if variables in the original
data sets are only partially labelled.) The \"memisc\" package for this
reason provides functions that allow to import SPSS or Stata data sets
into objects of the class `"data.set"` defined in it.
# The role of \"importer\" objects
Importing data using the facilities provided by the \"memisc\" package
consists of two steps. In the first step, a description of the data in
the file is collected in an object of class \"importer\". In the second
step, data are imported into \"data.set\" objects with the help of these
\"importer\" objects. These \"importer\" objects contain only meta-data
e.g. about variable labels, value labels, and user-defined missing
values. This allows to get an overview of the structure of the file
without the need of loading the complete data, which is advantageous
esp. if the data set is large. For example, with the help of an
\"importer object\" it is possible to see what the labels of the
variables are so that one can select those variables from the data file
that are actually needed. The data set object in R memory can then
created by \-- if `imprtr` is an importer object \-- by calls like
`subset(imprtr,...)`, `imprtr[...]` or `as.data.set(imprtr)`. Some
examples are given in the following.
Note that these examples require data not included in the package
(you need to register to [GESIS](https://www.gesis.org) to download the data).
The vignette code cannot be run without this additional data.
# Importing data from SPSS files
## Importing from a SPSS \"system\" file
In order to import data from an SPSS \"system\" file, the usual binary
format in which SPSS data now is usually saved and often distributed,
one needs to first make the file that contains the data known to R, as
in the following example:
``` r
library(memisc)
ZA5702 <- spss.system.file("Data/ZA5702_v2-0-0.sav")
ZA5702
```
:
SPSS system file 'Data/ZA5702_v2-0-0.sav'
with 979 variables and 3911 observations
Once the \"system file\" is declared using the function
`spss.system.file()`, metadata becomes available, such as the number of
cases and variables (as just seen), the names and labels of the
variables (as seen below):
``` r
description(ZA5702)
```
:
study 'Studiennummer'
version 'GESIS Archiv Version'
year 'Erhebungsjahr'
field 'Erhebungszeitraum'
glescomp 'GLES-Komponente'
survey 'Erhebung/Welle'
lfdn 'Laufende Nummer (Kumulation)'
vlfdn 'Laufende Nummer (Vorwahl)'
nlfdn 'Laufende Nummer (Nachwahl)'
datum 'Datum der Befragung (Monat/Tag/Jahr)'
(Here only an extract of the full output was shown, since the data set
contains as many as 979 variables.)
An \"importer\" object, such as `ZA5702` in this example, would also
allow to obtain a full codebook with
``` r
codebook(ZA5702)
```
but we refrain from showing such a codebook for the obvious reason of
not creating too much output. As the inspection of the data in the file
shows, most variable names have a standardised, yet non-mnemonic
structure. Variables referring to questions asked in the pre-election
wave of the GLES 2013 study have names starting with \"`v`\", those
referring to questions asked in the post-election wave have names
starting with \"`v`\", while those referring to question asked in both
waves have names starting \"`nv`\". For a specific analysis, such
variable names are not very useful. For this reason we want to rename
them. We could do this after loading the data, but it is more convenient
to do the data import and the renaming in one step as in the example
below:
``` r
gles2013work <- subset(ZA5702,
select=c(
wave = survey,
intent.turnout = v10,
turnout = n10,
voteint.candidate = v11aa,
voteint.list = v11ba,
postal.vote.candidate = v12aa,
postal.vote.list = v12ba,
vote.candidate = n11aa,
vote.list = n11ba,
bula = bl
))
```
The variable names to the left of the equality sign are the variable
names as they will appear in the data set after import, while the
variable names to the right of the equality aign are the variable names
as they exist in the data file.
As a demonstration of what information can be extracted from the data
file, we create a codebook for one of the items in the data set:
``` r
codebook(gles2013work$turnout)
```
```
================================================================================
gles2013work$turnout 'Wahlbeteiligung'
--------------------------------------------------------------------------------
Storage mode: double
Measurement: interval
Missing values: -Inf - -1
Values and labels N Valid Total
-99 M 'keine Angabe' 3 0.1
-97 M 'trifft nicht zu' 20 0.5
-94 M 'nicht in Auswahlgesamtheit' 2003 51.2
1 'ja, habe gewaehlt' 1596 84.7 40.8
2 'nein, habe nicht gewaehlt' 289 15.3 7.4
Min: 1.000
Max: 2.000
Mean: 1.153
Std.Dev.: 0.360
```
## Import from a SPSS \"portable\" file
Data from SPSS \"portable\" files are imported in essentially the same
way as data from SPSS \"system\" files: The first step again is to make
the data set known to *R*:
``` r
ZA3861 <- spss.portable.file("Data/ZA3861.por",iconv=FALSE)
ZA3861
```
:
SPSS portable file 'Data/ZA3861.por'
with 331 variables and 3263 observations
Since this file contains German umlauts (in contrast to the previous
example), we need to convert the character coding of the value labels
etc. from \"Latin-1\" (the original coding of the data) into the native
encoding of the system (unless the computer is using natively
\"Latin-1\" encoding and not - as must Mac and most Linux System - a
variant of UTF8).
``` r
ZA3861 <- Iconv(ZA3861,from="latin1")
```
Importer objects created from \"portable\" files can be examined in the
same way as importer objects created from \"system\" files. For example,
we get a description of the variables in the data set (the variable
labels) and a codebook.
``` r
description(ZA3861)
```
:
vvpnid 'Fallnummer'
vsplitwo 'West-Ost-Kennung'
vvornach 'Vor-/Nachwahl'
vland 'Bundesland'
v10 'Wirtschaftl. Lage allgemein'
v20 'Wirtschaftl. Lage retrospektiv'
v30 'Wirtschaftl. Lage prospektiv'
v31 'Wichtigkeit Erst/Zweitstimme BTW (nicht 94)'
v40 'Demokratiezufriedenheit'
v50 'Staerke Politikinteresse'
To actually import the data and make them accessible for analysis we can
(as above), use `as.data.set()`, or `subset()` as in this example:
``` r
work2002 <- subset(ZA3861,
select=c(
respid = VVPNID,
split.wo = VSPLITWO,
split.vor.nach = VVORNACH,
Bundesland = VLAND,
Erststimme = V69,
Zweitstimme = V70,
Geschlecht = VSEX,
GebMonat = VMONAT,
GebJahr = VJAHR,
Konfession = VRELIG,
Kirchgang = VKIRCHG,
Erwerbst = VBERUFTG,
FrErwerbst = VFRBERTG,
Beruf = VBERUF,
Famstand = VFAMSTDN,
Partner = VPARTNER,
BildungP = VPBILDGA,
BerufstP = VPBERUFT,
FrBerufstP = VPFBERTG,
BerufP = VPBERUF,
ReprGewicht = VGVWNW
)
)
```
## Import from a fixed-width file accompanied by SPSS syntax
Data from more recent study components of the American National Elecion
Study comes in fixed-width format, with some additional SPSS syntax
files that define columns, variable labels, value labels, and missing
values. `memisc` also provides an importer function such data. Naturally
this requires a little bit more information. In addition to the actual
data file, we also need a file with SPSS syntax specifying the data
columns. Optionally, Syntax files that define variable labels, value
lables, and missing values can also be specified.
``` r
anes2008TS <- spss.fixed.file("Data/anes2008/anes2008TS_dat.txt",
columns.file="Data/anes2008/anes2008TS_col.sps",
varlab.file="Data/anes2008/anes2008TS_lab.sps",
codes.file="Data/anes2008/anes2008TS_cod.sps",
missval.file="Data/anes2008/anes2008TS_md.sps")
anes2008TS
```
:
SPSS fixed column file 'issues/anes2008/anes2008TS_dat.txt'
with 1954 variables and 2322 observations
with variable labels from file 'issues/anes2008/anes2008TS_lab.sps'
with value labels from file 'issues/anes2008/anes2008TS_cod.sps'
with missing value definitions from file 'issues/anes2008/anes2008TS_md.sps'
Further information about the data can now be obtained from the returned
importer object in the same way as from importer objects that describe
SPSS \"system\" or SPSS \"portable\" files. That is, we can use
`names()`, `description()`, and `codebook()`. To get the data in to the
memory of *R* we can use (as above) the functions `as.data.set()` and
`subset()`.
## Importing data from a Stata file
Data from Stata files (up to Stata Version 12) can be imported in the
same way as data from SPSS files. The main difference is the function
used for it, and the fact that user-defined missing values do not exists
in Stata. For this, see the following example:
``` r
library(memisc)
ZA5702.dta <- Stata.file("Data/ZA5702_v2-0-0.dta")
ZA5702.dta
```
:
Stata file 'Data/ZA5702_v2-0-0.dta'
with 874 variables and 3911 observations
``` r
gles2013work.dta <- subset(ZA5702.dta,
select=c(
wave = survey,
intent.turnout = v10,
turnout = n10,
voteint.candidate = v11aa,
voteint.list = v11ba,
postal.vote.candidate = v12aa,
postal.vote.list = v12ba,
vote.candidate = n11aa,
vote.list = n11ba,
bula = bl
))
codebook(gles2013work.dta$turnout)
```
```
================================================================================
gles2013work.dta$turnout 'Wahlbeteiligung'
--------------------------------------------------------------------------------
Storage mode: integer
Measurement: nominal
Missing values: 100 - 127
Values and labels N Percent
-99 'keine Angabe' 3 0.1
-98 'weiss nicht' 0 0.0
-97 'trifft nicht zu' 20 0.5
-96 'Split' 0 0.0
-95 'nicht teilgenommen' 0 0.0
-94 'nicht in Auswahlgesamtheit' 2003 51.2
-93 'Interview abgebrochen' 0 0.0
-92 'Fehler in Daten' 0 0.0
-86 'nicht wahlberechtigt' 0 0.0
-85 'nicht waehlen' 0 0.0
-84 'keine Erst-/Zweitstimme abgegeben' 0 0.0
-83 'ungueltig waehlen' 0 0.0
-82 'keine andere Partei waehlen' 0 0.0
-81 'noch nicht entschieden' 0 0.0
-72 'nicht einzuschaetzen' 0 0.0
-71 'nicht bekannt' 0 0.0
1 'ja, habe gewaehlt' 1596 40.8
2 'nein, habe nicht gewaehlt' 289 7.4
```
|