1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380
|
@c PSPP - a program for statistical analysis.
@c Copyright (C) 2019 Free Software Foundation, Inc.
@c Permission is granted to copy, distribute and/or modify this document
@c under the terms of the GNU Free Documentation License, Version 1.3
@c or any later version published by the Free Software Foundation;
@c with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
@c A copy of the license is included in the section entitled "GNU
@c Free Documentation License".
@c
@node SPSS/PC+ System File Format
@chapter SPSS/PC+ System File Format
SPSS/PC+, first released in 1984, was a simplified version of SPSS for
IBM PC and compatible computers. It used a data file format related
to the one described in the previous chapter, but simplified and
incompatible. The SPSS/PC+ software became obsolete in the 1990s, so
files in this format are rarely encountered today. Nevertheless, for
completeness, and because it is not very difficult, it seems
worthwhile to support at least reading these files. This chapter
documents this format, based on examination of a corpus of about 60
files from a variety of sources.
System files use four data types: 8-bit characters, 16-bit unsigned
integers, 32-bit unsigned integers, and 64-bit floating points, called
here @code{char}, @code{uint16}, @code{uint32}, and @code{flt64},
respectively. Data is not necessarily aligned on a word or
double-word boundary.
SPSS/PC+ ran only on IBM PC and compatible computers. Therefore,
values in these files are always in little-endian byte order.
Floating-point numbers are always in IEEE 754 format.
SPSS/PC+ system files represent the system-missing value as -1.66e308,
or @code{f5 1e 26 02 8a 8c ed ff} expressed as hexadecimal. (This is
an unusual choice: it is close to, but not equal to, the largest
negative 64-bit IEEE 754, which is about -1.8e308.)
Text in SPSS/PC+ system file is encoded in ASCII-based 8-bit MS DOS
codepages. The corpus used for investigating the format were all
ASCII-only.
An SPSS/PC+ system file begins with the following 256-byte directory:
@example
uint32 two;
uint32 zero;
struct @{
uint32 ofs;
uint32 len;
@} records[15];
char filename[128];
@end example
@table @code
@item uint32 two;
@itemx uint32 zero;
Always set to 2 and 0, respectively.
These fields could be used as a signature for the file format, but the
@code{product} field in record 0 seems more likely to be unique
(@pxref{Record 0 Main Header Record}).
@item struct @{ @dots{} @} records[15];
Each of the elements in this array identifies a record in the system
file. The @code{ofs} is a byte offset, from the beginning of the
file, that identifies the start of the record. @code{len} specifies
the length of the record, in bytes. Many records are optional or not
used. If a record is not present, @code{ofs} and @code{len} for that
record are both are zero.
@item char filename[128];
In most files in the corpus, this field is entirely filled with
spaces. In one file, it contains a file name, followed by a null
bytes, followed by spaces to fill the remainder of the field. The
meaning is unknown.
@end table
The following sections describe the contents of each record,
identified by the index into the @code{records} array.
@menu
* Record 0 Main Header Record::
* Record 1 Variables Record::
* Record 2 Labels Record::
* Record 3 Data Record::
* Records 4 and 5 Data Entry::
@end menu
@node Record 0 Main Header Record
@section Record 0: Main Header Record
All files in the corpus have this record at offset 0x100 with length
0xb0 (but readers should find this record, like the others, via the
@code{records} table in the directory). Its format is:
@example
uint16 one0;
char product[62];
flt64 sysmis;
uint32 zero0;
uint32 zero1;
uint16 one1;
uint16 compressed;
uint16 nominal_case_size;
uint16 n_cases0;
uint16 weight_index;
uint16 zero2;
uint16 n_cases1;
uint16 zero3;
char creation_date[8];
char creation_time[8];
char label[64];
@end example
@table @code
@item uint16 one0;
@itemx uint16 one1;
Always set to 1.
@item uint32 zero0;
@itemx uint32 zero1;
@itemx uint16 zero2;
@itemx uint16 zero3;
Always set to 0.
It seems likely that one of these variables is set to 1 if weighting
is enabled, but none of the files in the corpus is weighted.
@item char product[62];
Name of the program that created the file. Only the following unique
values have been observed, in each case padded on the right with
spaces:
@example
DESPSS/PC+ System File Written by Data Entry II
PCSPSS SYSTEM FILE. IBM PC DOS, SPSS/PC+
PCSPSS SYSTEM FILE. IBM PC DOS, SPSS/PC+ V3.0
PCSPSS SYSTEM FILE. IBM PC DOS, SPSS for Windows
@end example
Thus, it is reasonable to use the presence of the string @samp{SPSS}
at offset 0x104 as a simple test for an SPSS/PC+ data file.
@item flt64 sysmis;
The system-missing value, as described previously (@pxref{SPSS/PC+
System File Format}).
@item uint16 compressed;
Set to 0 if the data in the file is not compressed, 1 if the data is
compressed with simple bytecode compression.
@item uint16 nominal_case_size;
Number of data elements per case. This is the number of variables,
except that long string variables add extra data elements (one for
every 8 bytes after the first 8). String variables in SPSS/PC+ system
files are limited to 255 bytes.
@item uint16 n_cases0;
@itemx uint16 n_cases1;
The number of cases in the data record. Both values are the same.
Some files in the corpus contain data for the number of cases noted
here, followed by garbage that somewhat resembles data.
@item uint16 weight_index;
0, if the file is unweighted, otherwise a 1-based index into the data
record of the weighting variable, e.g.@: 4 for the first variable
after the 3 system-defined variables.
@item char creation_date[8];
The date that the file was created, in @samp{mm/dd/yy} format.
Single-digit days and months are not prefixed by zeros. The string is
padded with spaces on right or left or both, e.g. @samp{_2/4/93_},
@samp{10/5/87_}, and @samp{_1/11/88} (with @samp{_} standing in for a
space) are all actual examples from the corpus.
@item char creation_time[8];
The time that the file was created, in @samp{HH:MM:SS} format.
Single-digit hours are padded on a left with a space. Minutes and
seconds are always written as two digits.
@item char file_label[64];
File label declared by the user, if any (@pxref{FILE LABEL,,,pspp,
PSPP Users Guide}). Padded on the right with spaces.
@end table
@node Record 1 Variables Record
@section Record 1: Variables Record
The variables record most commonly starts at offset 0x1b0, but it can
be placed elsewhere. The record contains instances of the following
32-byte structure:
@example
uint32 value_label_start;
uint32 value_label_end;
uint32 var_label_ofs;
uint32 format;
char name[8];
union @{
flt64 f;
char s[8];
@} missing;
@end example
The number of instances is the @code{nominal_case_size} specified in
the main header record. There is one instance for each numeric
variable and each string variable with width 8 bytes or less. String
variables wider than 8 bytes have one instance for each 8 bytes,
rounding up. The first instance for a long string specifies the
variable's correct dictionary information. Subsequent instances for a
long string are generally filled with all-zero bytes, although the
@code{missing} field contains the numeric system-missing value, and
some writers also fill in @code{var_label_ofs}, @code{format}, and
@code{name}, sometimes filling the latter with the numeric
system-missing value rather than a text string. Regardless of the
values used, readers should ignore the contents of these additional
instances for long strings.
@table @code
@item uint32 value_label_start;
@itemx uint32 value_label_end;
For a variable with value labels, these specify offsets into the label
record of the start and end of this variable's value labels,
respectively. @xref{Record 2 Labels Record}, for more information.
For a variable without any value labels, these are both zero.
A long string variable may not have value labels.
@item uint32 var_label_ofs;
For a variable with a variable label, this specifies an offset into
the label record. @xref{Record 2 Labels Record}, for more
information.
For a variable without a variable label, this is zero.
@item uint32 format;
The variable's output format, in the same format used in system files.
@xref{System File Output Formats}, for details. SPSS/PC+ system files
only use format types 5 (F, for numeric variables) and 1 (A, for
string variables).
@item char name[8];
The variable's name, padded on the right with spaces.
@item union @{ @dots{} @} missing;
A user-missing value. For numeric variables, @code{missing.f} is the
variable's user-missing value. For string variables, @code{missing.s}
is a string missing value. A variable without a user-missing value is
indicated with @code{missing.f} set to the system-missing value, even
for string variables (!). A Long string variable may not have a
missing value.
@end table
In addition to the user-defined variables, every SPSS/PC+ system file
contains, as its first three variables, the following system-defined
variables, in the following order. The system-defined variables have
no variable label, value labels, or missing values.
@table @code
@item $CASENUM
A numeric variable with format F8.0. Most of the time this is a
sequence number, starting with 1 for the first case and counting up
for each subsequent case. Some files skip over values, which probably
reflects cases that were deleted.
@item $DATE
A string variable with format A8. Same format (including varying
padding) as the @code{creation_date} field in the main header record
(@pxref{Record 0 Main Header Record}). The actual date can differ
from @code{creation_date} and from record to record. This may reflect
when individual cases were added or updated.
@item $WEIGHT
A numeric variable with format F8.2. This represents the case's
weight; SPSS/PC+ files do not have a user-defined weighting variable.
If weighting has not been enabled, every case has value 1.0.
@end table
@node Record 2 Labels Record
@section Record 2: Labels Record
The labels record holds value labels and variable labels. Unlike the
other records, it is not meant to be read directly and sequentially.
Instead, this record must be interpreted one piece at a time, by
following pointers from the variables record.
The @code{value_label_start}, @code{value_label_end}, and
@code{var_label_ofs} fields in a variable record are all offsets
relative to the beginning of the labels record, with an additional
7-byte offset. That is, if the labels record starts at byte offset
@code{labels_ofs} and a variable has a given @code{var_label_ofs},
then the variable label begins at byte offset @math{@code{labels_ofs}
+ @code{var_label_ofs} + 7} in the file.
A variable label, starting at the offset indicated by
@code{var_label_ofs}, consists of a one-byte length followed by the
specified number of bytes of the variable label string, like this:
@example
uint8 length;
char s[length];
@end example
A set of value labels, extending from @code{value_label_start} to
@code{value_label_end} (exclusive), consists of a numeric or string
value followed by a string in the format just described. String
values are padded on the right with spaces to fill the 8-byte field,
like this:
@example
union @{
flt64 f;
char s[8];
@} value;
uint8 length;
char s[length];
@end example
The labels record begins with a pair of uint32 values. The first of
these is always 3. The second is between 8 and 16 less than the
number of bytes in the record. Neither value is important for
interpreting the file.
@node Record 3 Data Record
@section Record 3: Data Record
The format of the data record varies depending on the value of
@code{compressed} in the file header record:
@table @asis
@item 0: no compression
Data is arranged as a series of 8-byte elements, one per variable
instance variable in the variable record (@pxref{Record 1 Variables
Record}). Numeric values are given in @code{flt64} format; string
values are literal characters string, padded on the right with spaces
when necessary to fill out 8-byte units.
@item 1: bytecode compression
The first 8 bytes of the data record is divided into a series of
1-byte command codes. These codes have meanings as described below:
@table @asis
@item 0
The system-missing value.
@item 1
A numeric or string value that is not
compressible. The value is stored in the 8 bytes following the
current block of command bytes. If this value appears twice in a block
of command bytes, then it indicates the second group of 8 bytes following the
command bytes, and so on.
@item 2 through 255
A number with value @var{code} - 100, where @var{code} is the value of
the compression code. For example, code 105 indicates a numeric
variable of value 5.
@end table
The end of the 8-byte group of bytecodes is followed by any 8-byte
blocks of non-compressible values indicated by code 1. After that
follows another 8-byte group of bytecodes, then those bytecodes'
non-compressible values. The pattern repeats up to the number of
cases specified by the main header record have been seen.
The corpus does not contain any files with command codes 2 through 95,
so it is possible that some of these codes are used for special
purposes.
@end table
Cases of data often, but not always, fill the entire data record.
Readers should stop reading after the number of cases specified in the
main header record. Otherwise, readers may try to interpret garbage
following the data as additional cases.
@node Records 4 and 5 Data Entry
@section Records 4 and 5: Data Entry
Records 4 and 5 appear to be related to SPSS/PC+ Data Entry.
|