File: tutorial.texi

package info (click to toggle)
pspp 0.8.4-1
  • links: PTS
  • area: main
  • in suites: jessie, jessie-kfreebsd
  • size: 35,692 kB
  • ctags: 20,600
  • sloc: ansic: 218,288; sh: 12,890; xml: 11,342; perl: 715; lisp: 597; makefile: 157
file content (869 lines) | stat: -rw-r--r-- 35,249 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
@alias prompt = sansserif

@include tut.texi

@node Using PSPP
@chapter Using @pspp{}

@pspp{} is a tool for the statistical analysis of sampled data.
You can use it to discover patterns in the data,
to explain differences in one subset of data in terms of another subset
and to find out
whether certain beliefs about the data are justified.
This chapter does not attempt to introduce the theory behind the 
statistical analysis,
but it shows how such analysis can be performed using @pspp{}.

For the purposes of this tutorial, it is assumed that you are using @pspp{} in its 
interactive mode from the command line.
However, the example commands can also be typed into a file and executed in 
a post-hoc mode by typing @samp{pspp @var{filename}} at a shell prompt,
where @var{filename} is the name of the file containing the commands.
Alternatively, from the graphical interface, you can select
@clicksequence{File @click{} New @click{} Syntax} to open a new syntax window
and use the @clicksequence{Run} menu when a syntax fragment is ready to be 
executed.
Whichever method you choose, the syntax is identical.

When using the interactive method, @pspp{} tells you that it's waiting for your
data with a string like @prompt{PSPP>} or @prompt{data>}.
In the examples of this chapter, whenever you see text like this, it
indicates the prompt displayed by @pspp{}, @emph{not} something that you 
should type.

Throughout this chapter reference is made to a number of sample data files.
So that you can try the examples for yourself,
you should have received these files along with your copy of @pspp{}.@c
@footnote{These files contain purely fictitious data.  They should not be used
for research purposes.}
@note{Normally these files are installed in the directory
@file{@value{example-dir}}.
If however your system administrator or operating system vendor has
chosen to install them in a different location, you will have to adjust
the examples accordingly.}


@menu
* Preparation of Data Files::   
* Data Screening and Transformation::  
* Hypothesis Testing::          
@end menu

@node Preparation of Data Files
@section Preparation of Data Files


Before analysis can commence,  the data must be loaded into @pspp{} and
arranged such that both @pspp{} and humans can understand what
the data represents.
There are two aspects of data:

@itemize @bullet
@item The variables --- these are the parameters of a quantity
 which has been measured or estimated in some way.
 For example height, weight and geographic location are all variables.
@item The observations (also called `cases') of the variables ---
 each observation represents an instance when the variables were measured
 or observed. 
@end itemize

@noindent
For example, a data set which has the variables @var{height}, @var{weight}, and
@var{name}, might have the observations:
@example
1881 89.2 Ahmed
1192 107.01 Frank
1230 67 Julie
@end example
@noindent
The following sections explain how to define a dataset.

@menu
* Defining Variables::          
* Listing the data::            
* Reading data from a text file::  
* Reading data from a pre-prepared PSPP file::  
* Saving data to a PSPP file.::  
* Reading data from other sources::  
@end menu

@node Defining Variables
@subsection Defining Variables
@cindex variables

Variables come in two basic types, @i{viz}: @dfn{numeric} and @dfn{string}.
Variables such as age, height and satisfaction are numeric,
whereas name is a string variable.
String variables are best reserved for commentary data to assist the 
human observer.
However they can also be used for nominal or categorical data.


@ref{data-list} defines two variables @var{forename} and @var{height},
and reads data into them by manual input. 

@float Example, data-list
@cartouche
@example
@prompt{PSPP>} data list list /forename (A12) height.
@prompt{PSPP>} begin data.
@prompt{data>} Ahmed 188
@prompt{data>} Bertram 167
@prompt{data>} Catherine 134.231
@prompt{data>} David 109.1
@prompt{data>} end data
@prompt{PSPP>}
@end example
@end cartouche
@caption{Manual entry of data using the @cmd{DATA LIST} command.
Two variables
@var{forename} and @var{height} are defined and subsequently filled
with  manually entered data.}
@end float

There are several things to note about this example.

@itemize @bullet
@item
The words @samp{data list list} are an example of the @cmd{DATA LIST}
command. @xref{DATA LIST}.
It tells @pspp{} to prepare for reading data.
The word @samp{list} intentionally appears twice.
The first occurrence is part of the @cmd{DATA LIST} call,
whilst the second
tells @pspp{} that the data is to be read as free format data with
one record per line.

@item
The @samp{/} character is important. It marks the start of the list of
variables which you wish to define.

@item
The text @samp{forename} is the name of the first variable,
and @samp{(A12)} says that the variable @var{forename} is a string
variable and that its maximum length is 12 bytes.
The second variable's name is specified by the text @samp{height}.
Since no format is given, this variable has the default format.
Normally the default format expects numeric data, which should be
entered in the locale of the operating system.
Thus, the example is correct for English locales and other
locales which use a period (@samp{.}) as the decimal separator.
However if you are using a system with a locale which uses the comma (@samp{,})
as the decimal separator, then you should in the subsequent lines substitute
@samp{.} with @samp{,}.  
Alternatively, you could explicitly tell @pspp{} that the @var{height} 
variable is to be read using a period as its decimal separator by appending the
text @samp{DOT8.3} after the word @samp{height}.
For more information on data formats, @pxref{Input and Output Formats}.


@item
Normally, @pspp{} displays the  prompt @prompt{PSPP>} whenever it's
expecting a command.
However, when it's expecting data, the prompt changes to @prompt{data>}
so that you know to enter data and not a command.

@item
At the end of every command there is a terminating @samp{.} which tells
@pspp{} that the end of a command has been encountered.
You should not enter @samp{.} when data is expected (@i{ie.} when 
the @prompt{data>} prompt is current) since it is appropriate only for
terminating commands.
@end itemize

@node Listing the data
@subsection Listing the data
@vindex LIST

Once the data has been entered,
you could type
@example
@prompt{PSPP>} list /format=numbered.
@end example
@noindent
to list the data.
The optional text @samp{/format=numbered} requests the case numbers to be 
shown along with the data.
It should show the following output:
@example
@group
Case#     forename   height
----- ------------ --------
    1 Ahmed          188.00 
    2 Bertram        167.00 
    3 Catherine      134.23 
    4 David          109.10 
@end group
@end example
@noindent
Note that the numeric variable @var{height} is displayed to 2 decimal 
places, because the format for that variable is @samp{F8.2}.
For a complete description of the @cmd{LIST} command, @pxref{LIST}.

@node Reading data from a text file
@subsection Reading data from a text file
@cindex reading data

The previous example showed how to define a set of variables and to 
manually enter the data for those variables.
Manual entering of data is tedious work, and often
a file containing the data will be have been previously 
prepared.
Let us assume that you have a file called @file{mydata.dat} containing the
ascii encoded data:
@example
Ahmed          188.00 
Bertram        167.00 
Catherine      134.23 
David          109.10 
@              .
@              .
@              .
Zachariah      113.02
@end example
@noindent
You can can tell the @cmd{DATA LIST} command to read the data directly from
this file instead of by manual entry, with a command like:
@example
@prompt{PSPP>} data list file='mydata.dat' list /forename (A12) height.
@end example
@noindent
Notice however, that it is still necessary to specify the names of the
variables and their formats, since this information is not contained
in the file.
It is also possible to specify the file's character encoding and other 
parameters.
For full details refer to @pxref{DATA LIST}.

@node Reading data from a pre-prepared PSPP file
@subsection Reading data from a pre-prepared @pspp{} file
@cindex system files
@vindex GET

When working with other @pspp{} users, or users of other software which
uses the @pspp{} data format, you may be given the data in
a pre-prepared @pspp{} file.
Such files contain not only the data, but the variable definitions,
along with their formats, labels and other meta-data.
Conventionally, these files (sometimes called ``system'' files) 
have the suffix @file{.sav}, but that is
not mandatory.
The following syntax loads a file called @file{my-file.sav}.
@example
@prompt{PSPP>} get file='my-file.sav'.
@end example
@noindent
You will encounter several instances of this in future examples.


@node Saving data to a PSPP file.
@subsection Saving data to a @pspp{} file.
@cindex saving
@vindex SAVE

If you want to save your data, along with the variable definitions so
that you or other @pspp{} users can use it later, you can do this with
the @cmd{SAVE} command.

The following syntax will save the existing data and variables to a
file called @file{my-new-file.sav}.
@example
@prompt{PSPP>} save outfile='my-new-file.sav'.
@end example
@noindent
If @file{my-new-file.sav} already exists, then it will be overwritten.
Otherwise it will be created.


@node Reading data from other sources
@subsection Reading data from other sources
@cindex comma separated values
@cindex spreadsheets
@cindex databases

Sometimes it's useful to be able to read data from comma 
separated text, from spreadsheets, databases or other sources.
In these instances you should
use the @cmd{GET DATA} command (@pxref{GET DATA}).

  
@node Data Screening and Transformation
@section Data Screening and Transformation

@cindex screening
@cindex transformation

Once data has been entered, it is often desirable, or even necessary,
to transform it in some way before performing analysis upon it.
At the very least, it's good practice to check for errors.

@menu
* Identifying incorrect data::  
* Dealing with suspicious data::  
* Inverting negatively coded variables::  
* Testing data consistency::    
* Testing for normality ::      
@end menu

@node Identifying incorrect data
@subsection Identifying incorrect data
@cindex erroneous data
@cindex errors, in data

Data from real sources is rarely error free.
@pspp{} has a number of procedures which can be used to help 
identify data which might be incorrect.

The @cmd{DESCRIPTIVES} command (@pxref{DESCRIPTIVES}) is used to generate
simple linear statistics for a dataset.  It is also useful for
identifying potential problems in the data.
The example file @file{physiology.sav} contains a number of physiological
measurements of a sample of healthy adults selected at random.
However, the data entry clerk made a number of mistakes when entering
the data.
@ref{descriptives} illustrates the use of @cmd{DESCRIPTIVES} to screen this
data and identify the erroneous values.

@float Example, descriptives
@cartouche
@example
@prompt{PSPP>} get file='@value{example-dir}/physiology.sav'.
@prompt{PSPP>} descriptives sex, weight, height.
@end example

Output:
@example
DESCRIPTIVES.  Valid cases = 40; cases with missing value(s) = 0.
+--------#--+-------+-------+-------+-------+
|Variable# N|  Mean |Std Dev|Minimum|Maximum|
#========#==#=======#=======#=======#=======#
|sex     #40|    .45|    .50|    .00|   1.00|
|height  #40|1677.12| 262.87| 179.00|1903.00|
|weight  #40|  72.12|  26.70| -55.60|  92.07|
+--------#--+-------+-------+-------+-------+
@end example
@end cartouche
@caption{Using the @cmd{DESCRIPTIVES} command to display simple 
summary information about the data.
In this case, the results show unexpectedly low values in the Minimum
column, suggesting incorrect data entry.}
@end float

In the output of @ref{descriptives},
the most interesting column is the minimum value.
The @var{weight} variable has a minimum value of less than zero,
which is clearly erroneous.
Similarly, the @var{height} variable's minimum value seems to be very low.
In fact, it is more than 5 standard deviations from the mean, and is a 
seemingly bizarre height for an adult person.
We can examine the data in more detail with the @cmd{EXAMINE}
command (@pxref{EXAMINE}):

In @ref{examine} you can see that the lowest value of @var{height} is
179 (which we suspect to be erroneous), but the second lowest is 1598
which
we know from the @cmd{DESCRIPTIVES} command 
is within 1 standard deviation from the mean.
Similarly the @var{weight} variable has a lowest value which is
negative but a plausible value for the second lowest value.
This suggests that the two extreme values are outliers and probably 
represent data entry errors. 

@float Example, examine
@cartouche
[@dots{} continue from @ref{descriptives}]
@example
@prompt{PSPP>} examine height, weight /statistics=extreme(3).    
@end example

Output:
@example
#===============================#===========#=======#
#                               #Case Number| Value #
#===============================#===========#=======#
#Height in millimetres Highest 1#         14|1903.00#
#                              2#         15|1884.00#
#                              3#         12|1801.65#
#                     ----------#-----------+-------#
#                       Lowest 1#         30| 179.00#
#                              2#         31|1598.00#
#                              3#         28|1601.00#
#                     ----------#-----------+-------#
#Weight in kilograms   Highest 1#         13|  92.07#
#                              2#          5|  92.07#
#                              3#         17|  91.74#
#                     ----------#-----------+-------#
#                       Lowest 1#         38| -55.60#
#                              2#         39|  54.48#
#                              3#         33|  55.45#
#===============================#===========#=======#
@end example
@end cartouche
@caption{Using the @cmd{EXAMINE} command to see the extremities of the data
for different variables.  Cases 30 and 38 seem to contain values
very much lower than the rest of the data.
They are possibly erroneous.}
@end float

@node Dealing with suspicious data
@subsection Dealing with suspicious data

@cindex SYSMIS
@cindex recoding data
If possible, suspect data should be checked and re-measured.
However, this may not always be feasible, in which case the researcher may
decide to disregard these values.
@pspp{} has a feature whereby data can assume the special value `SYSMIS', and
will be disregarded in future analysis. @xref{Missing Observations}.
You can set the two suspect values to the `SYSMIS' value using the @cmd{RECODE}
command.
@example
@pspp{}> recode height (179 = SYSMIS).
@pspp{}> recode weight (LOWEST THRU 0 = SYSMIS).
@end example
@noindent
The first command says that for any observation which has a
@var{height} value of 179, that value should be changed to the SYSMIS
value.
The second command says that any @var{weight} values of zero or less
should be changed to SYSMIS.
From now on, they will be ignored in analysis.
For detailed information about the @cmd{RECODE} command @pxref{RECODE}.

If you now re-run the @cmd{DESCRIPTIVES} or @cmd{EXAMINE} commands in
@ref{descriptives} and @ref{examine} you
will see a data summary with more plausible parameters.
You will also notice that the data summaries indicate the two missing values.

@node Inverting negatively coded variables
@subsection Inverting negatively coded variables

@cindex Likert scale
@cindex Inverting data
Data entry errors are not the only reason for wanting to recode data.
The sample file @file{hotel.sav} comprises data gathered from a 
customer satisfaction survey of clients at a particular hotel.
In @ref{reliability}, this file is loaded for analysis.
The line @code{display dictionary.} tells @pspp{} to display the
variables and associated data.
The output from this command has been omitted from the example for the sake of clarity, but
you will notice that each of the variables
@var{v1}, @var{v2} @dots{} @var{v5}  are measured on a 5 point Likert scale,
with 1 meaning ``Strongly disagree'' and 5 meaning ``Strongly agree''.
Whilst variables @var{v1}, @var{v2} and @var{v4} record responses
to a positively posed question, variables @var{v3} and @var{v5} are 
responses to negatively worded questions.
In order to perform meaningful analysis, we need to recode the variables so
that they all measure in the same direction.
We could use the @cmd{RECODE} command, with syntax such as:
@example
recode v3 (1 = 5) (2 = 4) (4 = 2) (5 = 1).
@end example
@noindent
However an easier and more elegant way uses the @cmd{COMPUTE}
command (@pxref{COMPUTE}).
Since the variables are Likert variables in the range (1 @dots{} 5),
subtracting their value  from 6 has the effect of inverting them:
@example
compute @var{var} = 6 - @var{var}.
@end example
@noindent
@ref{reliability} uses this technique to recode the variables 
@var{v3} and @var{v5}.
After applying  @cmd{COMPUTE} for both variables,
all subsequent commands will use the inverted values.


@node Testing data consistency
@subsection Testing data consistency

@cindex reliability
@cindex consistency

A sensible check to perform on survey data is the calculation of
reliability.
This gives the statistician some confidence that the questionnaires have been 
completed thoughtfully.
If you examine the labels of variables @var{v1},  @var{v3} and @var{v5},
you will notice that they ask very similar questions.
One would therefore expect the values of these variables (after recoding) 
to closely follow one another, and we can test that with the @cmd{RELIABILITY} 
command (@pxref{RELIABILITY}).
@ref{reliability} shows a @pspp{} session where the user (after recoding
negatively scaled variables) requests reliability statistics for
@var{v1}, @var{v3} and @var{v5}.

@float Example, reliability
@cartouche
@example
@prompt{PSPP>} get file='@value{example-dir}/hotel.sav'.
@prompt{PSPP>} display dictionary.
@prompt{PSPP>} * recode negatively worded questions.
@prompt{PSPP>} compute v3 = 6 - v3.
@prompt{PSPP>} compute v5 = 6 - v5.
@prompt{PSPP>} reliability v1, v3, v5.
@end example

Output (dictionary information omitted for clarity):
@example
1.1 RELIABILITY.  Case Processing Summary
#==============#==#======#
#              # N|   %  #
#==============#==#======#
#Cases Valid   #17|100.00#
#      Excluded# 0|   .00#
#      Total   #17|100.00#
#==============#==#======#

1.2 RELIABILITY.  Reliability Statistics
#================#==========#
#Cronbach's Alpha#N of Items#
#================#==========#
#             .86#         3#
#================#==========#
@end example
@end cartouche
@caption{Recoding negatively scaled variables, and testing for
reliability with the @cmd{RELIABILITY} command. The Cronbach Alpha
coefficient suggests a high degree of reliability among variables
@var{v1}, @var{v2} and @var{v5}.}
@end float

As a rule of thumb, many statisticians consider a value of Cronbach's Alpha of 
0.7 or higher to indicate reliable data.
Here, the value is 0.86 so the data and the recoding that we performed 
are vindicated.


@node Testing for normality 
@subsection Testing for normality
@cindex normality, testing 

Many statistical tests rely upon certain properties of the data.
One common property, upon which many linear tests depend, is that of
normality --- the data must have been drawn from a normal distribution.
It is necessary then to ensure normality before deciding upon the
test procedure to use.  One way to do this uses the @cmd{EXAMINE} command.

In @ref{normality}, a researcher was examining the failure rates
of equipment produced by an engineering company.
The file @file{repairs.sav} contains the mean time between
failures (@var{mtbf}) of some items of equipment subject to the study.
Before performing linear analysis on the data, 
the researcher wanted to ascertain that the data is normally distributed.

A normal distribution has a skewness and kurtosis of zero.
Looking at the skewness of @var{mtbf} in @ref{normality} it is clear
that the mtbf figures have a lot of positive skew and are therefore
not drawn from a normally distributed variable.
Positive skew can often be compensated for by applying a logarithmic 
transformation.
This is done with the @cmd{COMPUTE} command in the line
@example
compute mtbf_ln = ln (mtbf).
@end example
@noindent
Rather than redefining the existing variable, this use of @cmd{COMPUTE}
defines a new variable @var{mtbf_ln} which is
the natural logarithm of @var{mtbf}.
The final command in this example calls @cmd{EXAMINE} on this new variable,
and it can be seen from the results that both the skewness and
kurtosis for @var{mtbf_ln} are very close to zero.
This provides some confidence that the @var{mtbf_ln} variable is 
normally distributed and thus safe for linear analysis.
In the event that no suitable transformation can be found,
then it would be worth considering 
an appropriate non-parametric test instead of a linear one.
@xref{NPAR TESTS}, for information about non-parametric tests.

@float Example, normality
@cartouche
@example
@prompt{PSPP>} get file='@value{example-dir}/repairs.sav'.
@prompt{PSPP>} examine mtbf 
                /statistics=descriptives.
@prompt{PSPP>} compute mtbf_ln = ln (mtbf).
@prompt{PSPP>} examine mtbf_ln
                /statistics=descriptives.
@end example

Output:
@example
1.2 EXAMINE.  Descriptives
#====================================================#=========#==========#
#                                                    #Statistic|Std. Error#
#====================================================#=========#==========#
#mtbf    Mean                                        #   8.32  |   1.62   #
#        95% Confidence Interval for Mean Lower Bound#   4.85  |          #
#                                         Upper Bound#  11.79  |          #
#        5% Trimmed Mean                             #   7.69  |          #
#        Median                                      #   8.12  |          #
#        Variance                                    #  39.21  |          #
#        Std. Deviation                              #   6.26  |          #
#        Minimum                                     #   1.63  |          #
#        Maximum                                     #  26.47  |          #
#        Range                                       #  24.84  |          #
#        Interquartile Range                         #   5.83  |          #
#        Skewness                                    #   1.85  |    .58   #
#        Kurtosis                                    #   4.49  |   1.12   #
#====================================================#=========#==========#

2.2 EXAMINE.  Descriptives
#====================================================#=========#==========#
#                                                    #Statistic|Std. Error#
#====================================================#=========#==========#
#mtbf_ln Mean                                        #   1.88  |    .19   #
#        95% Confidence Interval for Mean Lower Bound#   1.47  |          #
#                                         Upper Bound#   2.29  |          #
#        5% Trimmed Mean                             #   1.88  |          #
#        Median                                      #   2.09  |          #
#        Variance                                    #   .54   |          #
#        Std. Deviation                              #   .74   |          #
#        Minimum                                     #   .49   |          #
#        Maximum                                     #   3.28  |          #
#        Range                                       #   2.79  |          #
#        Interquartile Range                         #   .92   |          #
#        Skewness                                    #   -.16  |    .58   #
#        Kurtosis                                    #   -.09  |   1.12   #
#====================================================#=========#==========#
@end example
@end cartouche
@caption{Testing for normality using the @cmd{EXAMINE} command and applying
a logarithmic transformation.
The @var{mtbf} variable has a large positive skew and is therefore
unsuitable for linear statistical analysis.
However the transformed variable (@var{mtbf_ln}) is close to normal and
would appear to be more suitable.}
@end float


@node Hypothesis Testing
@section Hypothesis Testing

@cindex Hypothesis testing
@cindex p-value
@cindex null hypothesis

One of the most fundamental purposes of statistical analysis
is hypothesis testing.
Researchers commonly need to test hypotheses about a set of data.
For example, she might want to test whether one set of data comes from
the same distribution as another,
or
whether the mean of a dataset significantly differs from a particular
value.
This section presents just some of the possible tests that @pspp{} offers.

The researcher starts by making a @dfn{null hypothesis}.
Often this is a hypothesis which he suspects to be false.
For example, if he suspects that @var{A} is greater than @var{B} he will 
state the null hypothesis as @math{ @var{A} = @var{B}}.@c
@footnote{This example assumes that it is already proven that @var{B} is
not greater than @var{A}.}

The @dfn{p-value} is a recurring concept in hypothesis testing.
It is the highest acceptable probability that the evidence implying a
null hypothesis is false, could have been obtained when the null
hypothesis is in fact true.
Note that this is not the same as ``the probability of making an
error'' nor is it the same as ``the probability of rejecting a
hypothesis when it is true''.



@menu
* Testing for differences of means::  
* Linear Regression::           
@end menu

@node Testing for differences of means
@subsection Testing for differences of means

@cindex T-test
@vindex T-TEST

A common statistical test involves hypotheses about means.
The @cmd{T-TEST} command is used to find out whether or not two separate 
subsets have the same mean.

@ref{t-test} uses the file @file{physiology.sav} previously
encountered.
A researcher suspected that the heights and core body 
temperature of persons might be different depending upon their sex.
To investigate this, he posed two null hypotheses: 
@itemize @bullet
@item The mean heights of males and females in the population are equal.
@item The mean body temperature of males and
      females in the population are equal.
@end itemize
@noindent
For the purposes of the investigation the researcher
decided to use a  p-value of 0.05.

In addition to the T-test, the @cmd{T-TEST} command also performs the 
Levene test for equal variances.
If the variances are equal, then a more powerful form of the T-test can be used.
However if it is unsafe to assume equal variances,
then an alternative calculation is necessary.
@pspp{} performs both calculations.

For the @var{height} variable, the output shows the significance of the 
Levene test to be 0.33 which means there is a 
33% probability that the  
Levene test produces this outcome when the variances are equal.
Had the significance been less than 0.05, then it would have been unsafe to assume that
the variances were equal.
However, because the value is higher than 0.05 the homogeneity of variances assumption
is safe and the ``Equal Variances'' row (the more powerful test) can be used.
Examining this row, the two tailed significance for the @var{height} t-test 
is less than 0.05, so it is safe to reject the null hypothesis and conclude
that the mean heights of males and females are unequal.

For the @var{temperature} variable, the significance of the Levene test 
is 0.58 so again, it is safe to use the row for equal variances.
The equal variances row indicates that the two tailed significance for
@var{temperature} is 0.20.  Since this is greater than 0.05 we must reject
the null hypothesis and conclude that there is insufficient evidence to 
suggest that the body temperature of male and female persons are different.

@float Example, t-test
@cartouche
@example
@prompt{PSPP>} get file='@value{example-dir}/physiology.sav'.
@prompt{PSPP>} recode height (179 = SYSMIS).
@prompt{PSPP>} t-test group=sex(0,1) /variables = height temperature.
@end example
Output:
@example
1.1 T-TEST.  Group Statistics
#==================#==#=======#==============#========#
#              sex | N|  Mean |Std. Deviation|SE. Mean#
#==================#==#=======#==============#========#
#height      Male  |22|1796.49|         49.71|   10.60#
#            Female|17|1610.77|         25.43|    6.17#
#temperature Male  |22|  36.68|          1.95|     .42#
#            Female|18|  37.43|          1.61|     .38#
#==================#==#=======#==============#========#
1.2 T-TEST.  Independent Samples Test
#===========================#=========#===============================   =#
#                           # Levene's| t-test for Equality of Means      #
#                           #----+----+------+-----+------+---------+-   -#
#                           #    |    |      |     |      |         |     #
#                           #    |    |      |     |Sig. 2|         |     #
#                           #  F |Sig.|   t  |  df |tailed|Mean Diff|     #
#===========================#====#====#======#=====#======#=========#=   =#
#height      Equal variances# .97| .33| 14.02|37.00|   .00|   185.72| ... #
#          Unequal variances#    |    | 15.15|32.71|   .00|   185.72| ... #
#temperature Equal variances# .31| .58| -1.31|38.00|   .20|     -.75| ... #
#          Unequal variances#    |    | -1.33|37.99|   .19|     -.75| ... #
#===========================#====#====#======#=====#======#=========#=   =#
@end example                                                          
@end cartouche
@caption{The @cmd{T-TEST} command tests for differences of means. 
Here, the @var{height} variable's two tailed significance is less than
0.05, so the null hypothesis can be rejected.
Thus, the evidence suggests there is a difference between the heights of
male and female persons. 
However the significance of the test for the @var{temperature}
variable is greater than 0.05 so the null hypothesis cannot be
rejected, and there is insufficient evidence to suggest a difference
in body temperature.}
@end float

@node Linear Regression
@subsection Linear Regression
@cindex linear regression
@vindex REGRESSION

Linear regression is a technique used to investigate if and how a variable 
is linearly related to others.
If a variable is found to be linearly related, then this can be used to 
predict future values of that variable.

In example @ref{regression}, the service department of the company wanted to
be able to predict the time to repair equipment, in order to improve
the accuracy of their quotations.
It was suggested that the time to repair might be related to the time
between failures and the duty cycle of the equipment.
The p-value of 0.1 was chosen for this investigation.
In order to investigate this hypothesis, the @cmd{REGRESSION} command
was used.
This command not only tests if the variables are related, but also
identifies the potential linear relationship. @xref{REGRESSION}.


@float Example, regression 
@cartouche
@example
@prompt{PSPP>} get file='@value{example-dir}/repairs.sav'.
@prompt{PSPP>} regression /variables = mtbf duty_cycle /dependent = mttr.
@prompt{PSPP>} regression /variables = mtbf /dependent = mttr.
@end example
Output:
@example
1.3(1) REGRESSION.  Coefficients
#=============================================#====#==========#====#=====#
#                                             #  B |Std. Error|Beta|  t  #
#========#====================================#====#==========#====#=====#
#        |(Constant)                          #9.81|      1.50| .00| 6.54#
#        |Mean time between failures (months) #3.10|       .10| .99|32.43#
#        |Ratio of working to non-working time#1.09|      1.78| .02|  .61#
#        |                                    #    |          |    |     #
#========#====================================#====#==========#====#=====#

1.3(2) REGRESSION.  Coefficients
#=============================================#============#
#                                             #Significance#
#========#====================================#============#
#        |(Constant)                          #         .10#
#        |Mean time between failures (months) #         .00#
#        |Ratio of working to non-working time#         .55#
#        |                                    #            #
#========#====================================#============#
2.3(1) REGRESSION.  Coefficients
#============================================#=====#==========#====#=====#
#                                            #  B  |Std. Error|Beta|  t  #
#========#===================================#=====#==========#====#=====#
#        |(Constant)                         #10.50|       .96| .00|10.96#
#        |Mean time between failures (months)# 3.11|       .09| .99|33.39#
#        |                                   #     |          |    |     #
#========#===================================#=====#==========#====#=====#

2.3(2) REGRESSION.  Coefficients
#============================================#============#
#                                            #Significance#
#========#===================================#============#
#        |(Constant)                         #         .06#
#        |Mean time between failures (months)#         .00#
#        |                                   #            #
#========#===================================#============#
@end example
@end cartouche
@caption{Linear regression analysis to find a predictor for
@var{mttr}.
The first attempt, including @var{duty_cycle}, produces some
unacceptable high significance values.
However the second attempt, which excludes @var{duty_cycle}, produces
significance values no higher than 0.06.
This suggests that @var{mtbf} alone may be a suitable predictor
for @var{mttr}.}
@end float

The coefficients in the first table suggest that the formula
@math{@var{mttr} = 9.81 + 3.1 \times @var{mtbf} + 1.09 \times @var{duty_cycle}}
can be used to predict the time to repair.
However, the significance value for the @var{duty_cycle} coefficient
is very high, which would make this an unsafe predictor.
For this reason, the test was repeated, but omitting the
@var{duty_cycle} variable.
This time, the significance of all coefficients no higher than 0.06,
suggesting that at the 0.06 level, the formula
@math{@var{mttr} = 10.5 + 3.11 \times @var{mtbf}} is a reliable
predictor of the time to repair.


@c  LocalWords:  PSPP dir itemize noindent var cindex dfn cartouche samp xref
@c  LocalWords:  pxref ie sav Std Dev kilograms SYSMIS sansserif pre pspp emph
@c  LocalWords:  Likert Cronbach's Cronbach mtbf npplot ln myfile cmd NPAR Sig
@c  LocalWords:  vindex Levene Levene's df Diff clicksequence mydata dat ascii
@c  LocalWords:  mttr outfile