1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355
|
@c PSPP - a program for statistical analysis.
@c Copyright (C) 2017, 2020 Free Software Foundation, Inc.
@c Permission is granted to copy, distribute and/or modify this document
@c under the terms of the GNU Free Documentation License, Version 1.3
@c or any later version published by the Free Software Foundation;
@c with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
@c A copy of the license is included in the section entitled "GNU
@c Free Documentation License".
@c
@node Combining Data Files
@chapter Combining Data Files
This chapter describes commands that allow data from system files,
portable files, and open datasets to be combined to
form a new active dataset. These commands can combine data files in the
following ways:
@itemize
@item
@cmd{ADD FILES} interleaves or appends the cases from each input file.
It is used with input files that have variables in common, but
distinct sets of cases.
@item
@cmd{MATCH FILES} adds the data together in cases that match across
multiple input files. It is used with input files that have cases in
common, but different information about each case.
@item
@cmd{UPDATE} updates a master data file from data in a set of
transaction files. Each case in a transaction data file modifies a
matching case in the primary data file, or it adds a new case if no
matching case can be found.
@end itemize
These commands share the majority of their syntax, which is described
in the following section, followed by one section for each command
that describes its specific syntax and semantics.
@menu
* Combining Files Common Syntax::
* ADD FILES:: Interleave cases from multiple files.
* MATCH FILES:: Merge cases from multiple files.
* UPDATE:: Update cases using transactional data.
@end menu
@node Combining Files Common Syntax
@section Common Syntax
@display
Per input file:
/FILE=@{*,'@var{file_name}'@}
[/RENAME=(@var{src_names}=@var{target_names})@dots{}]
[/IN=@var{var_name}]
[/SORT]
Once per command:
/BY @var{var_list}[(@{D|A@})] [@var{var_list}[(@{D|A@}]]@dots{}
[/DROP=@var{var_list}]
[/KEEP=@var{var_list}]
[/FIRST=@var{var_name}]
[/LAST=@var{var_name}]
[/MAP]
@end display
This section describes the syntactical features in common among the
@cmd{ADD FILES}, @cmd{MATCH FILES}, and @cmd{UPDATE} commands. The
following sections describe details specific to each command.
Each of these commands reads two or more input files and combines them.
The command's output becomes the new active dataset.
None of the commands actually change the input files.
Therefore, if you want the changes to become permanent, you must explicitly
save them using an appropriate procedure or transformation (@pxref{System and Portable File IO}).
The syntax of each command begins with a specification of the files to
be read as input. For each input file, specify FILE with a system
file or portable file's name as a string, a dataset (@pxref{Datasets})
or file handle name, (@pxref{File Handles}), or an asterisk (@samp{*})
to use the active dataset as input. Use of portable files on @subcmd{FILE} is a
@pspp{} extension.
At least two @subcmd{FILE} subcommands must be specified. If the active dataset
is used as an input source, then @cmd{TEMPORARY} must not be in
effect.
Each @subcmd{FILE} subcommand may be followed by any number of @subcmd{RENAME}
subcommands that specify a parenthesized group or groups of variable
names as they appear in the input file, followed by those variables'
new names, separated by an equals sign (@subcmd{=}),
@i{e.g.} @subcmd{/RENAME=(OLD1=NEW1)(OLD2=NEW2)}. To rename a single
variable, the parentheses may be omitted: @subcmd{/RENAME=@var{old}=@var{new}}.
Within a parenthesized group, variables are renamed simultaneously, so
that @subcmd{/RENAME=(@var{A} @var{B}=@var{B} @var{A})} exchanges the
names of variables @var{A} and @var{B}.
Otherwise, renaming occurs in left-to-right order.
Each @subcmd{FILE} subcommand may optionally be followed by a single @subcmd{IN}
subcommand, which creates a numeric variable with the specified name
and format F1.0. The IN variable takes value 1 in an output case if
the given input file contributed to that output case, and 0 otherwise.
The @subcmd{DROP}, @subcmd{KEEP}, and @subcmd{RENAME} subcommands have no effect on IN variables.
If @subcmd{BY} is used (see below), the @subcmd{SORT} keyword must be specified after a
@subcmd{FILE} if that input file is not already sorted on the @subcmd{BY} variables.
When @subcmd{SORT} is specified, @pspp{} sorts the input file's data on the @subcmd{BY}
variables before it applies it to the command. When @subcmd{SORT} is used, @subcmd{BY}
is required. @subcmd{SORT} is a @pspp{} extension.
@pspp{} merges the dictionaries of all of the input files to form the
dictionary of the new active dataset, like so:
@itemize @bullet
@item
The variables in the new active dataset are the union of all the input files'
variables, matched based on their name. When a single input file
contains a variable with a given name, the output file will contain
exactly that variable. When more than one input file contains a
variable with a given name, those variables must be all string or all numeric.
If they are string variables, then the result will have the width of the longest
variable with that name, with narrower values padded on the right with spaces
to fill the width.
Variables are matched after renaming with the @subcmd{RENAME} subcommand.
Thus, @subcmd{RENAME} can be used to resolve conflicts.
Only variables in the output file can conflict, so @subcmd{DROP} or
@subcmd{KEEP}, as described below, can also resolve a conflict.
@item
The variable label for each output variable is taken from the first
specified input file that has a variable label for that variable, and
similarly for value labels and missing values.
@item
The file label of the new active dataset (@pxref{FILE LABEL}) is that of the
first specified @subcmd{FILE} that has a file label.
@item
The documents in the new active dataset (@pxref{DOCUMENT}) are the
concatenation of all the input files' documents, in the order in which
the @subcmd{FILE} subcommands are specified.
@item
If all of the input files are weighted on the same variable, then the
new active dataset is weighted on that variable. Otherwise, the new
active dataset is not weighted.
@end itemize
The remaining subcommands apply to the output file as a whole, rather
than to individual input files. They must be specified at the end of
the command specification, following all of the @subcmd{FILE} and related
subcommands. The most important of these subcommands is @subcmd{BY}, which
specifies a set of one or more variables that may be used to find
corresponding cases in each of the input files. The variables
specified on @subcmd{BY} must be present in all of the input files.
Furthermore, if any of the input files are not sorted on the @subcmd{BY}
variables, then @subcmd{SORT} must be specified for those input files.
The variables listed on @subcmd{BY} may include (A) or (D) annotations to
specify ascending or descending sort order. @xref{SORT CASES}, for
more details on this notation. Adding (A) or (D) to the @subcmd{BY} subcommand
specification is a @pspp{} extension.
The @subcmd{DROP} subcommand can be used to specify a list of variables to
exclude from the output. By contrast, the @subcmd{KEEP} subcommand can be used
to specify variables to include in the output; all variables not
listed are dropped. @subcmd{DROP} and @subcmd{KEEP} are executed in left-to-right order
and may be repeated any number of times. @subcmd{DROP} and @subcmd{KEEP} do not affect
variables created by the @subcmd{IN}, @subcmd{FIRST}, and @subcmd{LAST} subcommands, which are
always included in the new active dataset, but they can be used to drop
@subcmd{BY} variables.
The @subcmd{FIRST} and @subcmd{LAST} subcommands are optional. They may only be
specified on @cmd{MATCH FILES} and @cmd{ADD FILES}, and only when @subcmd{BY}
is used. @subcmd{FIRST} and @subcmd{LIST} each adds a numeric variable to the new
active dataset, with the name given as the subcommand's argument and F1.0
print and write formats. The value of the @subcmd{FIRST} variable is 1 in the
first output case with a given set of values for the @subcmd{BY} variables, and
0 in other cases. Similarly, the @subcmd{LAST} variable is 1 in the last case
with a given of @subcmd{BY} values, and 0 in other cases.
When any of these commands creates an output case, variables that are
only in files that are not present for the current case are set to the
system-missing value for numeric variables or spaces for string
variables.
These commands may combine any number of files, limited only by the
machine's memory.
@node ADD FILES
@section ADD FILES
@vindex ADD FILES
@display
ADD FILES
Per input file:
/FILE=@{*,'@var{file_name}'@}
[/RENAME=(@var{src_names}=@var{target_names})@dots{}]
[/IN=@var{var_name}]
[/SORT]
Once per command:
[/BY @var{var_list}[(@{D|A@})] [@var{var_list}[(@{D|A@})]@dots{}]]
[/DROP=@var{var_list}]
[/KEEP=@var{var_list}]
[/FIRST=@var{var_name}]
[/LAST=@var{var_name}]
[/MAP]
@end display
@cmd{ADD FILES} adds cases from multiple input files. The output,
which replaces the active dataset, consists all of the cases in all of
the input files.
@subcmd{ADD FILES} shares the bulk of its syntax with other @pspp{} commands for
combining multiple data files. @xref{Combining Files Common Syntax},
above, for an explanation of this common syntax.
When @subcmd{BY} is not used, the output of @subcmd{ADD FILES} consists of all the cases
from the first input file specified, followed by all the cases from
the second file specified, and so on. When @subcmd{BY} is used, the output is
additionally sorted on the @subcmd{BY} variables.
When @subcmd{ADD FILES} creates an output case, variables that are not part of
the input file from which the case was drawn are set to the
system-missing value for numeric variables or spaces for string
variables.
@node MATCH FILES
@section MATCH FILES
@vindex MATCH FILES
@display
MATCH FILES
Per input file:
/@{FILE,TABLE@}=@{*,'@var{file_name}'@}
[/RENAME=(@var{src_names}=@var{target_names})@dots{}]
[/IN=@var{var_name}]
[/SORT]
Once per command:
/BY @var{var_list}[(@{D|A@}] [@var{var_list}[(@{D|A@})]@dots{}]
[/DROP=@var{var_list}]
[/KEEP=@var{var_list}]
[/FIRST=@var{var_name}]
[/LAST=@var{var_name}]
[/MAP]
@end display
@cmd{MATCH FILES} merges sets of corresponding cases in multiple
input files into single cases in the output, combining their data.
@cmd{MATCH FILES} shares the bulk of its syntax with other @pspp{} commands for
combining multiple data files. @xref{Combining Files Common Syntax},
above, for an explanation of this common syntax.
How @cmd{MATCH FILES} matches up cases from the input files depends on
whether @subcmd{BY} is specified:
@itemize @bullet
@item
If @subcmd{BY} is not used, @cmd{MATCH FILES} combines the first case from each input
file to produce the first output case, then the second case from each
input file for the second output case, and so on. If some input files
have fewer cases than others, then the shorter files do not contribute
to cases output after their input has been exhausted.
@item
If @subcmd{BY} is used, @cmd{MATCH FILES} combines cases from each input file that
have identical values for the @subcmd{BY} variables.
When @subcmd{BY} is used, @subcmd{TABLE} subcommands may be used to introduce @dfn{table
lookup file}. @subcmd{TABLE} has same syntax as @subcmd{FILE}, and the @subcmd{RENAME}, @subcmd{IN}, and
@subcmd{SORT} subcommands may follow a @subcmd{TABLE} in the same way as @subcmd{FILE}.
Regardless of the number of @subcmd{TABLE}s, at least one @subcmd{FILE} must specified.
Table lookup files are treated in the same way as other input files
for most purposes and, in particular, table lookup files must be
sorted on the @subcmd{BY} variables or the @subcmd{SORT} subcommand must be specified
for that @subcmd{TABLE}.
Cases in table lookup files are not consumed after they have been used
once. This means that data in table lookup files can correspond to
any number of cases in @subcmd{FILE} input files. Table lookup files are
analogous to lookup tables in traditional relational database systems.
If a table lookup file contains more than one case with a given set of
@subcmd{BY} variables, only the first case is used.
@end itemize
When @cmd{MATCH FILES} creates an output case, variables that are only in
files that are not present for the current case are set to the
system-missing value for numeric variables or spaces for string
variables.
@node UPDATE
@section UPDATE
@vindex UPDATE
@display
UPDATE
Per input file:
/FILE=@{*,'@var{file_name}'@}
[/RENAME=(@var{src_names}=@var{target_names})@dots{}]
[/IN=@var{var_name}]
[/SORT]
Once per command:
/BY @var{var_list}[(@{D|A@})] [@var{var_list}[(@{D|A@})]]@dots{}
[/DROP=@var{var_list}]
[/KEEP=@var{var_list}]
[/MAP]
@end display
@cmd{UPDATE} updates a @dfn{master file} by applying modifications
from one or more @dfn{transaction files}.
@cmd{UPDATE} shares the bulk of its syntax with other @pspp{} commands for
combining multiple data files. @xref{Combining Files Common Syntax},
above, for an explanation of this common syntax.
At least two @subcmd{FILE} subcommands must be specified. The first @subcmd{FILE}
subcommand names the master file, and the rest name transaction files.
Every input file must either be sorted on the variables named on the
@subcmd{BY} subcommand, or the @subcmd{SORT} subcommand must be used just after the @subcmd{FILE}
subcommand for that input file.
@cmd{UPDATE} uses the variables specified on the @subcmd{BY} subcommand, which is
required, to attempt to match each case in a transaction file with a
case in the master file:
@itemize @bullet
@item
When a match is found, then the values of the variables present in the
transaction file replace those variables' values in the new active
file. If there are matching cases in more than more transaction file,
@pspp{} applies the replacements from the first transaction file, then
from the second transaction file, and so on. Similarly, if a single
transaction file has cases with duplicate @subcmd{BY} values, then those are
applied in order to the master file.
When a variable in a transaction file has a missing value or when a string
variable's value is all blanks, that value is never used to update the
master file.
@item
If a case in the master file has no matching case in any transaction
file, then it is copied unchanged to the output.
@item
If a case in a transaction file has no matching case in the master
file, then it causes a new case to be added to the output, initialized
from the values in the transaction file.
@end itemize
|