File: autoclass.1

package info (click to toggle)
autoclass 3.3.6.dfsg.1-2
  • links: PTS
  • area: main
  • in suites: bookworm, bullseye, sid
  • size: 5,808 kB
  • sloc: ansic: 16,726; makefile: 114; csh: 111; sh: 98; cpp: 95
file content (1588 lines) | stat: -rw-r--r-- 75,132 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
.\" -*-nroff-*-
.TH AUTOCLASS 1 "December 9, 2001"
.SH NAME
autoclass \- automatically discover classes in data
.SH SYNOPSIS
.ad l
.B autoclass "-search "
.I data_file header_file model_file s_param_file
.br
.B autoclass "-report "
.I results_file search_file r_params_file
.\" .br
.\" .B autoclass "-predict "
.\" .I data_file
.br
.B autoclass "-predict "
.I results_file search_file results_file
.ad b
.br
.SH "DESCRIPTION"
\fBAutoClass\fP solves the problem of automatic discovery of classes in data
(sometimes called clustering, or unsupervised learning), as distinct
from the generation of class descriptions from labeled examples
(called supervised learning).  It aims to discover the "natural"
classes in the data.  \fBAutoClass\fP is applicable to observations of
things that can be described by a set of attributes, without referring
to other things.  The data values corresponding to each attribute are
limited to be either numbers or the elements of a fixed set of
symbols.  With numeric data, a measurement error must be provided.
.PP
\fBAutoClass\fP is looking for the best classification(s) of the data it can find.
A classification is composed of: 
.IP 1) 
A set of classes, each of which is described by a set of class 
parameters, which specify how the class is distributed along the 
various attributes.  For example, "height normally distributed with 
mean 4.67 ft and standard deviation .32 ft", 
.IP 2) 
A set of class weights, describing what percentage of cases are 
likely to be in each class. 
.IP 3) 
A probabilistic assignment of cases in the data to these classes.  
I.e. for each case, the relative probability that it is a member of
each class.  
.PP
As a strictly Bayesian system (accept no substitutes!), the quality
measure \fBAutoClass\fP uses is the total probability that, had you known
nothing about your data or its domain, you would have found this set of
data generated by this underlying model.  This includes the prior
probability that the "world" would have chosen this number of classes,
this set of relative class weights, and this set of parameters for each
class, and the likelihood that such a set of classes would have generated
this set of values for the attributes in the data cases.
.PP
These probabilities are typically very small, in the range of e^-30000,
and so are usually expressed in exponential notation.
.PP
When run with the \fB-search\fP command, \fBAutoClass\fP searches for
a classification.  The required arguments are the paths to the four
input files, which supply the data, the data format, the desired
classification model, and the search parameters, respectively.
.PP
By default, \fBAutoClass\fP writes intermediate results in a binary file.
With the \fB-report\fP command, \fBAutoClass\fP generates an ASCII
report.  The arguments are the full path names of the .results, .search, and .r-params files.
.PP
When run with the \fB-predict\fP command, \fBAutoClass\fP predicts the
class membership of a "test" data set based on classes found in a
"training" data set (see "PREDICTIONS" below).
.SH "INPUT FILES"
An AutoClass data set resides in two files.  There is a header file
(file type "hd2") that describes the specific data format and attribute
definitions.  The actual data values are in a data file (file type "db2").
We use two files to allow editing of data descriptions without having to
deal with the entire data set.  This makes it easy to experiment with 
different descriptions of the database without having to reproduce the data
set.  Internally, an AutoClass database structure is identified by its
header and data files, and the number of data loaded.
.PP
For more detailed information on the formats of these files, see
\fI/usr/share/doc/autoclass/preparation-c.text\fP.
.SS "DATA FILE"
The data file contains a sequence of data objects (datum or case)
terminated by the end of the file. The number of values for each data
object must be equal to the number of attributes defined in the header
file.  Data objects must be groups of tokens delimited by "new-line".
Attributes are typed as REAL, DISCRETE, or DUMMY.  Real attribute
values are numbers, either integer or floating point.  Discrete
attribute values can be strings, symbols, or integers.  A dummy
attribute value can be any of these types.  Dummys are read in but
otherwise ignored -- they will be set to zeros in the the internal
database.  Thus the actual values will not be available for use in
report output.  To have these attribute values available, use either
type REAL or type DISCRETE, and define their model type as IGNORE in
the .model file.  Missing values for any attribute type may be
represented by either "?", or other token specified in the header
file.  All are translated to a special unique value after being read,
so this symbol is effectively reserved for unknown/missing values.
.PP
For example:
.nf
      white       38.991306 0.54248405  2 2 1
      red         25.254923 0.5010235   9 2 1
      yellow      32.407973 ?           8 2 1
      all_white   28.953982 0.5267696   0 1 1
.fi
.SS "HEADER FILE"
The header file specifies the data file format, and the definitions of
the data attributes.  The header file functional specifications
consists of two parts -- the data set format definition
specifications, and the attribute descriptors. ";" in column 1
identifies a comment.
.PP
A header file follows this general format:
.nf

    ;; num_db2_format_defs value (number of format def lines
    ;; that follow), range of n is 1 -> 5
    num_db2_format_defs n
    ;; number_of_attributes token and value required
    number_of_attributes <as required>
    ;; following are optional - default values are specified
    separator_char  ' '
    comment_char    ';'
    unknown_token   '?'
    separator_char  ','
    
    ;; attribute descriptors
    ;; <zero-based att#>  <att_type>  <att_sub_type>  <att_description>
    ;; <att_param_pairs>

.fi
Each attribute descriptor is a line of:
.nf

      Attribute index (zero based, beginning in column 1)
      Attribute type.  See below.
      Attribute subtype.  See below
      Attribute description: symbol (no embedded blanks) or
            string; <= 40 characters
      Specific property and value pairs.
            Currently available combinations:

         type           subtype         property type(s)
         ----           --------        ---------------
         dummy          none/nil        --
         discrete       nominal         range
         real           location        error
         real           scalar          zero_point rel_error

.fi
The ERROR property should represent your best estimate of the
average error expected in the measurement and recording of that real
attribute.  Lacking better information, the error can be taken as 1/2
the minimum possible difference between measured values.  It can be
argued that real values are often truncated, so that smaller errors
may be justified, particularly for generated data.  But AutoClass only
sees the recorded values.  So it needs the error in the recorded
values, rather than the actual measurement error.  Setting this error
much smaller than the minimum expressible difference implies the
possibility of values that cannot be expressed in the data.  Worse, it
implies that two identical values must represent measurements that
were much closer than they might actually have been.  This leads to
over-fitting of the classification.

The REL_ERROR property is used for SCALAR reals when the error is
proportional to the measured value.  The ERROR property is not
supported.

AutoClass uses the error as a lower bound on the width of the
normal distribution.  So small error estimates tend to give narrower
peaks and to increase both the number of classes and the
classification probability.  Broad error estimates tend to limit the
number of classes.

The scalar ZERO_POINT property is the smallest value that the
measurement process could have produced.  This is often 0.0, or less
by some error range.  Similarly, the bounded real's min and max
properties are exclusive bounds on the attributes generating process.
For a calculated percentage these would be 0-e and 100+e, where e is
an error value.  The discrete attribute's range is the number of
possible values the attribute can take on.  This range must include
unknown as a value when such values occur.

Header File Example:
.nf

!#; AutoClass C header file -- extension .hd2
!#; the following chars in column 1 make the line a comment:
!#; '!', '#', ';', ' ', and '\\n' (empty line)

;#! num_db2_format_defs <num of def lines -- min 1, max 4>
num_db2_format_defs 2
;; required
number_of_attributes 7
;; optional - default values are specified
;; separator_char  ' '
;; comment_char    ';'
;; unknown_token   '?'
separator_char     ','

;; <zero-based att#>  <att_type>  <att_sub_type>  <att_description>
<att_param_pairs>
0 dummy nil       "True class, range = 1 - 3"
1 real location "X location, m. in range of 25.0 - 40.0" error .25
2 real location "Y location, m. in range of 0.5 - 0.7" error .05
3 real scalar   "Weight, kg. in range of 5.0 - 10.0" zero_point 0.0
rel_error .001
4 discrete nominal  "Truth value, range = 1 - 2" range 2
5 discrete nominal  "Color of foobar, 10 values" range 10
6 discrete nominal  Spectral_color_group range 6
.fi
.SS "MODEL FILE"
A classification of a data set is made with respect to a model which
specifies the form of the probability distribution function for classes in that
data set.  Normally the model structure is defined in a model file (file
type "model"), containing one or more models.  Internally, a model is defined
relative to a particular database.  Thus it is identified by the corresponding
database, the model's model file and its sequential position in the
file.
.PP
Each model is specified by one or more model group definition lines.
Each model group line associates attribute indices with a
model term type.
.PP
Here is an example model file:
.nf

# AutoClass C model file -- extension .model
model_index 0 7
ignore 0
single_normal_cn 3
single_normal_cn 17 18 21
multi_normal_cn 1 2
multi_normal_cn 8 9 10
multi_normal_cn 11 12 13
single_multinomial default
.fi
.PP
Here, the first line is a comment.  The following characters in column
1 make the line a comment: `!', `#', ` ', `;', and `\\n' (empty line).
.PP
The tokens "model_index \fIn m\fP" must appear on the first non-comment
line, and precede the model term definition lines. \fIn\fP is the
zero-based model index, typically 0 where there is only one model --
the majority of search situations.  \fIm\fP is the number of model term
definition lines that follow.

The last seven lines are model group lines.  Each model group line
consists of:
.br
.ad l
.nh
.HP 4
A model term type (one of 
.BR single_multinomial ,
.BR single_normal_cm , 
.BR single_normal_cn ,
.BR multi_normal_cn ", or"
.BR ignore ).
.HP 4
A list of attribute indices (the attribute set list), or the symbol
\fBdefault\fP.  Attribute indices are zero-based.  Single model terms
may have one or more attribute indices on each line, while multi model
terms require two or more attribute indices per line.  An attribute
index must not appear more than once in a model list.
.ad b
.hy
.PP
Notes:
.IP 1)
At least one model definition is required (model_index token).
.IP 2)
There may be multiple entries in a model for any model term type.
.IP 3)
Model term types currently consist of:
.RS
.TP
.B single_multinomial
models discrete attributes as multinomials, with missing values.
.TP
.B single_normal_cn
models real valued attributes as normals; no missing values.
.TP
.B single_normal_cm
models real valued attributes with missing values.
.TP
.B multi_normal_cn
is a covariant normal model without missing values.
.TP
.B ignore
allows the model to ignore one or more attributes.
\fBignore\fP is not a valid default model term type.
.PP
See the documentation in models-c.text for further information about
specific model terms.
.RE
.IP 4)
\fBSingle_normal_cn\fP, \fBsingle_normal_cm\fP, and
\fBmulti_normal_cn\fP modeled data, whose subtype is \fBscalar\fP
(value distribution is away from 0.0, and is thus not a "normal"
distribution) will be log transformed and modeled with the log-normal
model.  For data whose subtype is \fBlocation\fP (value distribution
is around 0.0), no transform is done, and the normal model is used.
.SH SEARCHING
AutoClass, when invoked in the "search" mode will check the validity
of the set of data, header, model, and search parameter files.  Errors
will stop the search from starting, and warnings will ask the user
whether to continue.  A history of the error and warning messages is
saved, by default, in the log file.
.PP
Once you have succeeded in describing your data with a header file
and model file that passes the AUTOCLASS -SEARCH <...> input checks,
you will have entered the search domain where \fBAutoClass\fP classifies
your data.  (At last!)
.PP
The main function to use in finding a good classification of your data
is AUTOCLASS -SEARCH, and using it will take most of the computation
time.  Searches are invoked with:
.nf

autoclass -search <.db2 file path> <.hd2 file path> 
	<.model file path> <.s-params file path>

.fi
All files must be specified as fully qualified relative or absolute
pathnames.  File name extensions (file types) for all files are forced
to canonical values required by the AutoClass program:
.nf

        data file   ("ascii")   db2
        data file   ("binary")  db2-bin
        header file             hd2
        model file              model
        search params file      s-params
.fi
.PP
The sample-run (\fI/usr/share/doc/autoclass/examples/\fP) that comes
with \fBAutoClass\fP shows some sample searches, and browsing these is
probably the fastest way to get familiar with how to do searches.  The
test data sets located under \fI/usr/share/doc/autoclass/examples/\fP
will show you some other header (.hd2), model (.model), and search
params (.s-params) file setups.  The remainder of this section
describes how to do searches in somewhat more detail.
.PP
The \fBbold faced\fP tokens below are generally search params file
parameters.  For more information on the s-params file, see \fBSEARCH
PARAMETERS\fP below, or
\fI/usr/share/doc/autoclass/search-c.text.gz\fP.
.SS "WHAT RESULTS ARE"
\fBAutoClass\fP is looking for the best classification(s) of the data
it can find.  A classification is composed of:
.IP 1) 
a set of classes, each of which is described by a set of class 
parameters, which specify how the class is distributed along the 
various attributes.  For example, "height normally distributed with 
mean 4.67 ft and standard deviation .32 ft", 
.IP 2)
a set of class weights, describing what percentage of cases are 
likely to be in each class. 
.IP 3)
a probabilistic assignment of cases in the data to these classes.  
I.e. for each case, the relative probability that it is a member of
each class.  
.PP
As a strictly Bayesian system (accept no substitutes!), the quality
measure \fBAutoClass\fP uses is the total probability that, had you known
nothing about your data or its domain, you would have found this set of
data generated by this underlying model.  This includes the prior
probability that the "world" would have chosen this number of classes,
this set of relative class weights, and this set of parameters for each
class, and the likelihood that such a set of classes would have generated
this set of values for the attributes in the data cases.
.PP
These probabilities are typically very small, in the range of e^-30000,
and so are usually expressed in exponential notation.
.SS "WHAT RESULTS MEAN"
It is important to remember that all of these probabilities are GIVEN
that the real model is in the model family that \fBAutoClass\fP has
restricted its attention to.  If \fBAutoClass\fP is looking for
Gaussian classes and the real classes are Poisson, then the fact that
\fBAutoClass\fP found 5 Gaussian classes may not say much about how
many Poisson classes there really are.
.PP
The relative probability between different classifications found can
be very large, like e^1000, so the very best classification found is
usually overwhelmingly more probable than the rest (and overwhelmingly
less probable than any better classifications as yet undiscovered).
If \fBAutoClass\fP should manage to find two classifications that are
within about exp(5-10) of each other (i.e. within 100 to 10,000 times
more probable) then you should consider them to be about equally
probable, as our computation is usually not more accurate than this
(and sometimes much less).
.SS "HOW IT WORKS"
\fBAutoClass\fP repeatedly creates a random classification and then
tries to massage this into a high probability classification though
local changes, until it converges to some "local maximum".  It then
remembers what it found and starts over again, continuing until you
tell it to stop.  Each effort is called a "try", and the computed
probability is intended to cover the whole volume in parameter space
around this maximum, rather than just the peak.
.PP
The standard approach to massaging is to 
.IP 1)
Compute the probabilistic class memberships of cases using the class
parameters and the implied relative likelihoods.
.IP 2)
Using the new class members, compute class statistics (like mean)
and revise the class parameters.
.PP
and repeat till they stop changing.  There are three available
convergence algorithms: "converge_search_3" (the default),
"converge_search_4" and "converge".  Their specification is controlled
by search params file parameter \fBtry_fn_type\fP.
.SS "WHEN TO STOP"
You can tell AUTOCLASS -SEARCH to stop by: 1) giving a
\fBmax_duration\fP (in seconds) argument at the beginning; 2) giving a
\fBmax_n_tries\fP (an integer) argument at the beginning; or 3) by
typing a "q" and <return> after you have seen enough tries.  The
\fBmax_duration\fP and \fBmax_n_tries\fP arguments are useful if you
desire to run AUTOCLASS -SEARCH in batch mode.  If you are restarting
AUTOCLASS -SEARCH from a previous search, the value of
\fBmax_n_tries\fP you provide, for instance 3, will tell the program
to compute 3 more tries in addition to however many it has already
done.  The same incremental behavior is exhibited by
\fBmax_duration\fP.
.PP
Deciding when to stop is a judgment call and it's up to you.  Since the
search includes a random component, there's always the chance that if
you let it keep going it will find something better.  So you need to
trade off how much better it might be with how long it might take to
find it.  The search status reports that are printed when a new best
classification is found are intended to provide you information to help
you make this tradeoff.
.PP
One clear sign that you should probably stop is if most of the
classifications found are duplicates of previous ones (flagged by
"dup" as they are found).  This should only happen for very small sets
of data or when fixing a very small number of classes, like two.
.PP
Our experience is that for moderately large to extremely large data
sets (~200 to ~10,000 datum), it is necessary to run \fBAutoClass\fP
for at least 50 trials.
.SS "WHAT GETS RETURNED"
Just before returning, AUTOCLASS -SEARCH will give short descriptions
of the best classifications found.  How many will be described can be
controlled with \fBn_final_summary\fP.
.PP
By default AUTOCLASS -SEARCH will write out a number of files, both at
the end and periodically during the search (in case your system
crashes before it finishes).  These files will all have the same name
(taken from the search params pathname [<name>.s-params]), and differ
only in their file extensions.  If your search runs are very long and
there is a possibility that your machine may crash, you can have
intermediate "results" files written out.  These can be used to
restart your search run with minimum loss of search effort.  See the
documentation file \fI/usr/share/doc/autoclass/checkpoint-c.text\fP.
.PP
A ".log" file will hold a listing of most of what was printed to the
screen during the run, unless you set \fBlog_file_p\fP to false to say
you want no such foolishness.  Unless \fBresults_file_p\fP is false, a
binary ".results-bin" file (the default) or an ASCII ".results" text
file, will hold the best classifications that were returned, and
unless \fBsearch_file_p\fP is false, a ".search" file will hold the
record of the search tries. \fBsave_compact_p\fP controls whether the
"results" files are saved as binary or ASCII text.
.PP
If the C global variable "G_safe_file_writing_p" is defined as TRUE in
"autoclass-c/prog/globals.c", the names of "results" files (those that
contain the saved classifications) are modified internally to account 
for redundant file writing.  If the search params file name is
"my_saved_clsfs" you will see the following "results" file names
(ignoring directories and pathnames for this example)
.sp
.nf
  \fBsave_compact_p\fP = true --
  "my_saved_clsfs.results-bin"	- completely written file
  "my_saved_clsfs.results-tmp-bin" - partially written file, renamed
				  when complete

  \fBsave_compact_p\fP = false --
  "my_saved_clsfs.results"	- completely written file
  "my_saved_clsfs.results-tmp"  - partially written file, renamed
				  when complete
.fi
.sp
If check pointing is being done, these additional names will appear
.sp
.nf
  \fBsave_compact_p\fP = true --
  "my_saved_clsfs.chkpt-bin"	- completely written checkpoint file
  "my_saved_clsfs.chkpt-tmp-bin" - partially written checkpoint file, 
				     renamed when complete
  \fBsave_compact_p\fP = false --
  "my_saved_clsfs.chkpt"	- completely written checkpoint file
  "my_saved_clsfs.chkpt-tmp"    - partially written checkpoint file, 
				     renamed when complete
.fi
.sp
.SS "HOW TO GET STARTED"
The way to invoke AUTOCLASS -SEARCH is:
.nf

autoclass -search <.db2 file path> <.hd2 file path> 
	<.model file path> <.s-params file path>

.fi
To restart a previous search, specify that \fBforce_new_search_p\fP
has the value false in the search params file, since its default is
true.  Specifying false tells AUTOCLASS -SEARCH to try to find a
previous compatible search (<...>.results[-bin] & <...>.search) to
continue from, and will restart using it if found.  To force a new
search instead of restarting an old one, give the parameter
\fBforce_new_search_p\fP the value of true, or use the default.  If
there is an existing search (<...>.results[-bin] & <...>.search), the
user will be asked to confirm continuation since continuation will
discard the existing search.
.PP
If a previous search is continued, the message "RESTARTING SEARCH" will
be given instead of the usual "BEGINNING SEARCH".  It is generally
better to continue a previous search than to start a new one, unless
you are trying a significantly different search method, in which case
statistics from the previous search may mislead the current one.
.SS "STATUS REPORTS"
A running commentary on the search will be printed to the screen and
to the log file (unless \fBlog_file_p\fP is false).  Note that the
".log" file will contain a listing of all default search params
values, and the values of all params that are overridden.
.PP
After each try a very short report (only a few characters long) is
given.  After each new best classification, a longer report is given,
but no more often than \fBmin_report_period\fP (default is 30
seconds).
.SS "SEARCH VARIATIONS"
AUTOCLASS -SEARCH by default uses a certain standard search method or
"try function" (\fBtry_fn_type\fP = "converge_search_3").  Two others
are also available: "converge_search_4" and "converge").  They are
provided in case your problem is one that may happen to benefit from
them.  In general the default method will result in finding better
classifications at the expense of a longer search time.  The default
was chosen so as to be robust, giving even performance across many
problems.  The alternatives to the default may do better on some
problems, but may do substantially worse on others.
.PP
"converge_search_3" uses an absolute stopping criterion
(\fBrel_delta_range\fP, default value of 0.0025) which tests the variation
of each class of the delta of the log approximate-marginal-likelihood
of the class statistics with-respect-to the class hypothesis
(class->log_a_w_s_h_j) divided by the class weight (class->w_j)
between successive convergence cycles.  Increasing this value loosens
the convergence and reduces the number of cycles.  Decreasing this
value tightens the convergence and increases the number of
cycles. \fBn_average\fP (default value of 3) specifies how many successive
cycles must meet the stopping criterion before the trial terminates.
.PP
"converge_search_4" uses an absolute stopping criterion
(\fBcs4_delta_range\fP, default value of 0.0025) which tests the
variation of each class of the slope for each class of log
approximate-marginal-likelihood of the class statistics
with-respect-to the class hypothesis (class->log_a_w_s_h_j) divided by
the class weight (class->w_j) over \fBsigma_beta_n_values\fP (default value
6) convergence cycles.  Increasing the value of \fBcs4_delta_range\fP
loosens the convergence and reduces the number of cycles.  Decreasing
this value tightens the convergence and increases the number of
cycles.  Computationally, this try function is more expensive than
"converge_search_3", but may prove useful if the computational "noise"
is significant compared to the variations in the computed values.  Key
calculations are done in double precision floating point, and for the
largest data base we have tested so far ( 5,420 cases of 93
attributes), computational noise has not been a problem, although the
value of \fBmax_cycles\fP needed to be increased to 400.
.PP
"converge" uses one of two absolute stopping criterion which test the
variation of the classification (clsf) log_marginal (clsf->log_a_x_h)
delta between successive convergence cycles.  The largest of
\fBhalt_range\fP (default value 0.5) and \fBhalt_factor\fP *
\fBcurrent_clsf_log_marginal\fP) is used (default value of
\fBhalt_factor\fP is 0.0001).  Increasing these values loosens the
convergence and reduces the number of cycles.  Decreasing these values
tightens the convergence and increases the number of cycles.
\fBn_average\fP (default value of 3) specifies how many cycles must
meet the stopping criteria before the trial terminates.  This is a
very approximate stopping criterion, but will give you some feel for
the kind of classifications to expect.  It would be useful for
"exploratory" searches of a data base.
.PP
The purpose of \fBreconverge_type\fP = "chkpt" is to complete an
interrupted classification by continuing from its last checkpoint.
The purpose of \fBreconverge_type\fP = "results" is to attempt further
refinement of the best completed classification using a different
value of \fBtry_fn_type\fP ("converge_search_3", "converge_search_4",
"converge").  If \fBmax_n_tries\fP is greater than 1, then in each case,
after the reconvergence has completed, \fBAutoClass\fP will perform
further search trials based on the parameter values in the
<...>.s-params file.
.PP
With the use of \fBreconverge_type\fP ( default value ""), you may
apply more than one try function to a classification.  Say you
generate several exploratory trials using \fBtry_fn_type\fP =
"converge", and quit the search saving .search and .results[-bin]
files.  Then you can begin another search with \fBtry_fn_type\fP =
"converge_search_3", \fBreconverge_type\fP = "results", and
\fBmax_n_tries\fP = 1.  This will result in the further convergence of
the best classification generated with \fBtry_fn_type\fP = "converge",
with \fBtry_fn_type\fP = "converge_search_3".  When \fBAutoClass\fP
completes this search try, you will have an additional refined
classification.
.PP
A good way to verify that any of the alternate \fBtry_fun_type\fP are
generating a well converged classification is to run \fBAutoClass\fP
in prediction mode on the same data used for generating the
classification.  Then generate and compare the corresponding case or
class cross reference files for the original classification and the
prediction.  Small differences between these files are to be expected,
while large differences indicate incomplete convergence.  Differences
between such file pairs should, on average and modulo class deletions,
decrease monotonically with further convergence.
.PP
The standard way to create a random classification to begin a try is
with the default value of "random" for \fBstart_fn_type\fP.  At this
point there are no alternatives.  Specifying "block" for
\fBstart_fn_type\fP produces repeatable non-random searches.  That is
how the <..>.s-params files in the autoclass-c/data/.. sub-directories
are specified.  This is how development testing is done.
.PP
\fBmax_cycles\fP controls the maximum number of convergence cycles that will
be performed in any one trial by the convergence functions.  Its default
value is 200.  The screen output shows a period (".") for each cycle
completed. If your search trials run for 200 cycles, then either your
data base is very complex (increase the value), or the \fBtry_fn_type\fP
is not adequate for situation (try another of the available ones, and
use \fBconverge_print_p\fP to get more information on what is going on).
.PP
Specifying \fBconverge_print_p\fP to be true will generate a brief
print-out for each cycle which will provide information so that you
can modify the default values of \fBrel_delta_range\fP &
\fBn_average\fP for "converge_search_3"; \fBcs4_delta_range\fP &
\fBsigma_beta_n_values\fP for "converge_search_4"; and
\fBhalt_range\fP, \fBhalt_factor\fP, and \fBn_average\fP for
"converge".  Their default values are given in the <..>.s-params files
in the autoclass-c/data/..  sub-directories.
.SS "HOW MANY CLASSES?"
Each new try begins with a certain number of classes and may end up
with a smaller number, as some classes may drop out of the convergence.
In general, you want to begin the try with some number of classes that
previous tries have indicated look promising, and you want to be sure 
you are fishing around elsewhere in case you missed something before.
.PP
\fBn_classes_fn_type\fP = "random_ln_normal" is the default way to make this
choice.  It fits a log normal to the number of classes (usually called "j"
for short) of the 10 best classifications found so far, and randomly
selects from that.  There is currently no alternative.
.PP
To start the game off, the default is to go down \fBstart_j_list\fP
for the first few tries, and then switch to \fBn_classes_fn_type\fP.
If you believe that the probable number of classes in your data base
is say 75, then instead of using the default value of \fBstart_j_list\fP (2,
3, 5, 7, 10, 15, 25), specify something like 50, 60, 70, 80, 90, 100.
.PP
If one wants to always look for, say, three classes, one can use
\fBfixed_j\fP and override the above.  Search status reports will describe 
what the current method for choosing j is.
.SS "DO I HAVE ENOUGH MEMORY AND DISK SPACE?"
Internally, the storage requirements in the current system are of
order n_classes_per_clsf * (n_data + n_stored_clsfs * n_attributes *
n_attribute_values).  This depends on the number of cases, the number
of attributes, the values per attribute (use 2 if a real value), and
the number of classifications stored away for comparison to see if
others are duplicates -- controlled by \fBmax_n_store\fP (default
value = 10).  The search process does not itself consume significant
memory, but storage of the results may do so.
.PP
\fBAutoClass C\fP is configured to handle a maximum of 999 attributes.
If you attempt to run with more than that you will get array bound
violations.  In that case, change these configuration parameters in
prog/autoclass.h and recompile \fBAutoClass C\fP:
.nf

#define ALL_ATTRIBUTES                  999   
#define VERY_LONG_STRING_LENGTH         20000 
#define VERY_LONG_TOKEN_LENGTH          500 

.fi
For example, these values will handle several thousand attributes:
.nf

#define ALL_ATTRIBUTES                  9999
#define VERY_LONG_STRING_LENGTH         50000
#define VERY_LONG_TOKEN_LENGTH          50000

.fi
Disk space taken up by the "log" file will of course depend on the
duration of the search.  \fBn_save\fP (default value = 2) determines how
many best classifications are saved into the ".results[-bin]" file.
\fBsave_compact_p\fP controls whether the "results" and "checkpoint" files
are saved as binary.  Binary files are faster and more compact, but
are not portable.  The default value of \fBsave_compact_p\fP is true, which
causes binary files to be written.
.PP
If the time taken to save the "results" files is a problem, consider
increasing \fBmin_save_period\fP (default value = 1800 seconds or 30
minutes).  Files are saved to disk this often if there is anything
different to report.
.SS "JUST HOW SLOW IS IT?"
Compute time is of order n_data * n_attributes * n_classes * n_tries
* converge_cycles_per_try. The major uncertainties in this are the
number of basic back and forth cycles till convergence in each try, and of
course the number of tries.  The number of cycles per trial is typically 
10-100 for \fBtry_fn_type\fP "converge", and 10-200+ for "converge_search_3" 
and "converge_search-4".  The maximum number is specified by \fBmax_n_tries\fP
(default value = 200).  The number of trials is up to you and your
available computing resources.
.PP
The running time of very large data sets will be quite uncertain.  We
advise that a few small scale test runs be made on your system to
determine a baseline.  Specify \fBn_data\fP to limit how many data vectors
are read.  Given a very large quantity of data, \fBAutoClass\fP may
find its most probable classifications at upwards of a hundred
classes, and this will require that \fBstart_j_list\fP be specified
appropriately (See above section \fBHOW MANY CLASSES?\fP).  If you are
quite certain that you only want a few classes, you can force
\fBAutoClass\fP to search with a fixed number of classes specified by
\fBfixed_j\fP.  You will then need to run separate searches with each
different fixed number of classes.
.SS "CHANGING FILENAMES IN A SAVED CLASSIFICATION FILE"
\fBAutoClass\fP caches the data, header, and model file pathnames in
the saved classification structure of the binary (".results-bin") or
ASCII (".results") "results" files.  If the "results" and "search"
files are moved to a different directory location, the search cannot
be successfully restarted if you have used absolute pathnames.  Thus
it is advantageous to run invoke \fBAutoClass\fP in a parent directory
of the data, header, and model files, so that relative pathnames can
be used.  Since the pathnames cached will then be relative, the files
can be moved to a different host or file system and restarted --
providing the same relative pathname hierarchy exists.
.PP
However, since the ".results" file is ASCII text, those pathnames
could be changed with a text editor (\fBsave_compact_p\fP must be
specified as false).
.SS "SEARCH PARAMETERS"
The search is controlled by the ".s-params" file.  In this file, an
empty line or a line starting with one of these characters is treated
as a comment: "#", "!", or ";".  The parameter name and its value can
be separated by an equal sign, a space, or a tab:
.sp
.nf
	n_clsfs 1
	n_clsfs = 1
	n_clsfs<tab>1
.fi
.sp
Spaces are ignored if "=" or "<tab>" are used as separators.  Note
there are no trailing semicolons.
.PP
The search parameters, with their default values, are as follows:
.IP "\fBrel_error\fP = 0.01"
Specifies the relative difference measure used by clsf-DS-%=, when 
deciding if a new clsf is a duplicate of an old one.  
.IP "\fBstart_j_list\fP = 2, 3, 5, 7, 10, 15, 25"
Initially try these numbers of classes, so as not to narrow the
search too quickly.  The state of this list is saved in the <..>.search 
file and used on restarts, unless an override
specification of \fBstart_j_list\fP is made in the .s-params file for the
restart run.  This list should bracket your expected number of
classes, and by a wide margin!
"start_j_list = -999" specifies an empty list (allowed only on restarts)
.IP "\fBn_classes_fn_type\fP = ""random_ln_normal"""
Once \fBstart_j_list\fP is exhausted, \fBAutoClass\fP will call this
function to decide how many classes to start with on the next try,
based on the 10 best classifications found so far.  Currently only
"random_ln_normal" is available.
.IP "\fBfixed_j\fP = 0"
When \fBfixed_j\fP > 0, overrides \fBstart_j_list\fP and
\fBn_classes_fn_type,\fP and \fBAutoClass\fP will always use this value for
the initial number of classes.
.IP "\fBmin_report_period\fP = 30"
Wait at least this time (in seconds) since last report until reporting
verbosely again.  Should be set longer than the expected run time when
checking for repeatability of results.  For repeatable results, also
see \fBforce_new_search_p,\fP \fBstart_fn_type\fP and
\fBrandomize_random_p\fP. \fINOTE\fP: At least one of "interactive_p",
"max_duration", and "max_n_tries" must be active.  Otherwise
\fBAutoClass\fP will run indefinitely.  See below.
.IP "\fBinteractive_p\fP = true"
When false, allows run to continue until otherwise halted.
When true, standard input is queried on each cycle for the quit
character "q", which, when detected, triggers an immediate halt. 
.IP "\fBmax_duration\fP = 0"
When = 0, allows run to continue until otherwise halted.
When > 0, specifies the maximum number of seconds to run.  
.IP "\fBmax_n_tries\fP = 0"
When = 0, allows run to continue until otherwise halted.
When > 0, specifies the maximum number of tries to make.
.IP "\fBn_save\fP = 2"
Save this many clsfs to disk in the .results[-bin] and .search files.
if 0, don't save anything (no .search & .results[-bin] files). 
.IP "\fBlog_file_p\fP = true"
If false, do not write a log file.
.IP "\fBsearch_file_p\fP = true"
If false, do not write a search file. 
.IP "\fBresults_file_p\fP = true"
If false, do not write a results file.
.IP "\fBmin_save_period\fP = 1800"
CPU crash protection.  This specifies the maximum time, in seconds,
that \fBAutoClass\fP will run before it saves the current results to
disk.  The default time is 30 minutes.
.IP "\fBmax_n_store\fP = 10"
Specifies the maximum number of classifications stored internally.
.IP "\fBn_final_summary\fP = 10"
Specifies the number of trials to be printed out after search ends.
.IP "\fBstart_fn_type\fP = ""random"""
One of {"random", "block"}.  This specifies the type of class
initialization.  For normal search, use "random", which randomly
selects instances to be initial class means, and adds appropriate
variances. For testing with repeatable search, use "block", which
partitions the database into successive blocks of near equal size.
For repeatable results, also see \fBforce_new_search_p\fP,
\fBmin_report_period\fP, and \fBrandomize_random_p\fP.
.IP "\fBtry_fn_type\fP = ""converge_search_3"""
One of {"converge_search_3", "converge_search_4", "converge"}. 
These specify alternate search stopping criteria.  
"converge" merely tests the rate of change of the log_marginal
classification probability (clsf->log_a_x_h), without checking
rate of change of individual classes(see \fBhalt_range\fP and
\fBhalt_factor\fP).  
"converge_search_3" and "converge_search_4" each monitor the ratio
class->log_a_w_s_h_j/class->w_j for all classes, and continue
convergence until all pass the quiescence criteria for \fBn_average\fP
cycles.  "converge_search_3" tests differences between successive
convergence cycles (see \fBrel_delta_range\fP).  This provides a
reasonable, general purpose stopping criteria.
"converge_search_4" averages the ratio over "sigma_beta_n_values"
cycles (see \fBcs4_delta_range\fP).  This is preferred when
converge_search_3 produces many similar classes.
.IP "\fBinitial_cycles_p\fP = true"
If true, perform base_cycle in initialize_parameters.
false is used only for testing.
.IP "\fBsave_compact_p\fP = true"
true saves classifications as machine dependent binary
(.results-bin & .chkpt-bin).   
false saves as ascii text (.results & .chkpt)
.IP "\fBread_compact_p\fP = true"
true reads classifications as machine dependent binary 
(.results-bin & .chkpt-bin).
false reads as ascii text (.results & .chkpt).
.IP "\fBrandomize_random_p\fP = true"
false seeds lrand48, the pseudo-random number function with 1 
to give repeatable test cases.  true uses universal time clock 
as the seed, giving semi-random searches.
For repeatable results, also see \fBforce_new_search_p\fP, 
\fBmin_report_period\fP and \fBstart_fn_type\fP.
.IP "\fBn_data\fP = 0"
With n_data = 0, the entire database is read from .db2.  
With n_data > 0, only this number of data are read. 
.IP "\fBhalt_range\fP = 0.5"
Passed to try_fn_type "converge".  With the "converge"
try_fn_type, convergence is halted when the larger of halt_range
and (halt_factor * current_log_marginal) exceeds the difference
between successive cycle values of the classification log_marginal
(clsf->log_a_x_h).  Decreasing this value may tighten the
convergence and increase the number of cycles.
.IP "\fBhalt_factor\fP = 0.0001"
Passed to try_fn_type "converge".  With the "converge"
try_fn_type, convergence is halted when the larger of halt_range
and (halt_factor * current_log_marginal) exceeds the difference
between successive cycle values of the classification log_marginal
(clsf->log_a_x_h).  Decreasing this value may tighten the
convergence and increase the number of cycles.
.IP "\fBrel_delta_range\fP = 0.0025"
Passed to try function "converge_search_3", which monitors the
ratio of log approx-marginal-likelihood of class statistics
with-respect-to the class hypothesis (class->log_a_w_s_h_j)
divided by the class weight (class->w_j), for each class.
"converge_search_3" halts convergence when the difference between
cycles, of this ratio, for every class, has been exceeded by
"rel_delta_range" for "n_average" cycles.  Decreasing
"rel_delta_range" tightens the convergence and increases the
number of cycles.
.IP "\fBcs4_delta_range\fP = 0.0025"
Passed to try function "converge_search_4", which monitors the
ratio of (class->log_a_w_s_h_j)/(class->w_j), for each class,
averaged over "sigma_beta_n_values" convergence cycles.
"converge_search_4" halts convergence when the maximum difference
in average values of this ratio falls below "cs4_delta_range".
Decreasing "cs4_delta_range" tightens the convergence and
increases the number of cycles.
.IP "\fBn_average\fP = 3"
Passed to try functions "converge_search_3" and "converge".
The number of cycles for which the convergence criterion
must be satisfied for the trial to terminate.
.IP "\fBsigma_beta_n_values\fP = 6"
Passed to try_fn_type "converge_search_4".  The number of past 
values to use in computing sigma^2 (noise) and beta^2 (signal).
.IP "\fBmax_cycles\fP = 200"
This is the maximum number of cycles permitted for any one convergence 
of a classification, regardless of any other stopping criteria.  This
is very dependent upon your database and choice of model and
convergence parameters, but should be about twice the average number
of cycles reported in the screen dump and .log file 
.IP "\fBconverge_print_p\fP = false"
If true, the selected try function will print to the screen values
useful in specifying non-default values for \fBhalt_range\fP,
\fBhalt_factor\fP, \fBrel_delta_range\fP, \fBn_average\fP,
\fBsigma_beta_n_values\fP, and \fBrange_factor\fP.
.IP "\fBforce_new_search_p\fP = true"
If true, will ignore any previous search results, discarding the 
existing .search and .results[-bin] files after confirmation by the 
user; if false, will continue the search using the 
existing .search and .results[-bin] files. 
For repeatable results, also see \fBmin_report_period\fP,
\fBstart_fn_type\fP and \fBrandomize_random_p\fP.
.IP "\fBcheckpoint_p\fP = false"
If true, checkpoints of the current classification will be written
every "min_checkpoint_period" seconds, with file extension
\&.chkpt[-bin]. This is only useful for very large classifications
.IP "\fBmin_checkpoint_period\fP = 10800"
If checkpoint_p = true, the checkpointed classification will be 
written this often - in seconds (default = 3 hours)
.IP "\fBreconverge_type\fP = """
Can be either "chkpt" or "results".  If "checkpoint_p" = true and
"reconverge_type" = "chkpt", then continue convergence of the
classification contained in <...>.chkpt[-bin].  If "checkpoint_p "
= false and "reconverge_type" = "results", continue convergence of
the best classification contained in <...>.results[-bin].  
.IP "\fBscreen_output_p\fP = true"
If false, no output is directed to the screen.  Assuming 
log_file_p = true, output will be directed to the log file only.
.IP "\fBbreak_on_warnings_p\fP = true"
The default value asks the user whether or not to continue, when data
definition warnings are found.  If specified as false, then
\fBAutoClass\fP will continue, despite warnings -- the warning will
continue to be output to the terminal and the log file.
.IP "\fBfree_storage_p\fP = true"
The default value tells \fBAutoClass\fP to free the majority of its
allocated storage.  This is not required, and in the case of the DEC
Alpha causes core dump [is this still true?].  If specified as false,
\fBAutoClass\fP will not attempt to free storage.
.SS "HOW TO GET AUTOCLASS C TO PRODUCE REPEATABLE RESULTS"
In some situations, repeatable classifications are required: comparing
basic \fBAutoClass C\fP integrity on different platforms, porting
\fBAutoClass C\fP to a new platform, etc.  In order to accomplish this
two things are necessary: 1) the same random number generator must be
used, and 2) the search parameters must be specified properly.
.PP
Random Number Generator. This implementation of \fBAutoClass C\fP uses the
Unix srand48/lrand48 random number generator which generates
pseudo-random numbers using the well-known linear congruential
algorithm and 48-bit integer arithmetic.  lrand48() returns non-
negative long integers uniformly distributed over the interval [0,
2**31].
.PP
Search Parameters.
The following .s-params file parameters should be specified:
.nf

force_new_search_p = true
start_fn_type   "block"
randomize_random_p = false
;; specify the number of trials you wish to run
max_n_tries = 50
;; specify a time greater than duration of run
min_report_period = 30000

.fi
Note that no current best classification reports will be produced.
Only a final classification summary will be output.
.SH CHECKPOINTING
With very large databases there is a significant probability of a
system crash during any one classification try.  Under such
circumstances it is advisable to take the time to checkpoint the
calculations for possible restart.
.PP
Checkpointing is initiated by specifying "\fBcheckpoint_p\fP = true"
in the ".s-params" file.  This causes the inner convergence step, to
save a copy of the classification onto the checkpoint file each time
the classification is updated, providing a certain period of time has
elapsed.  The file extension is ".chkpt[-bin]".
.PP
Each time a AutoClass completes a cycle, a "." is output to the screen
to provide you with information to be used in setting the
\fBmin_checkpoint_period\fP value (default 10800 seconds or 3 hours).
There is obviously a trade-off between frequency of checkpointing and
the probability that your machine may crash, since the repetitive
writing of the checkpoint file will slow the search process.
.PP
Restarting AutoClass Search:
.PP
To recover the classification and continue the search after rebooting
and reloading AutoClass, specify \fBreconverge_type\fP = "chkpt" in
the ".s-params" file (specify \fBforce_new_search_p\fP as false).
.PP
AutoClass will reload the appropriate database and models, provided
there has been no change in their filenames since the time they were
loaded for the checkpointed classification run.  The ".s-params" file
contains any non-default arguments that were provided to the original
call.
.PP
In the beginning of a search, before \fBstart_j_list\fP has been
emptied, it will be necessary to trim the original list to what would
have remained in the crashed search.  This can be determined by
looking at the ".log" file to determine what values were already used.
If the \fBstart_j_list\fP has been emptied, then an empty
\fBstart_j_list\fP should be specified in the ".s-params" file.  This
is done either by
.sp
        \fBstart_j_list\fP =
.sp
or
.sp
        \fBstart_j_list\fP = -9999 
.sp
Here is an a set of scripts to demonstrate check-pointing:
.nf

autoclass -search data/glass/glassc.db2 data/glass/glass-3c.hd2 \\
     data/glass/glass-mnc.model data/glass/glassc-chkpt.s-params

Run 1)
  ## glassc-chkpt.s-params
  max_n_tries = 2
  force_new_search_p = true
  ## --------------------
  ;; run to completion

Run 2)
  ## glassc-chkpt.s-params
  force_new_search_p = false
  max_n_tries = 10
  checkpoint_p = true
  min_checkpoint_period = 2
  ## --------------------
  ;; after 1 checkpoint, ctrl-C to simulate cpu crash

Run 3)
  ## glassc-chkpt.s-params
  force_new_search_p = false
  max_n_tries = 1
  checkpoint_p = true
  min_checkpoint_period = 1
  reconverge_type = "chkpt"
  ## --------------------
  ;; checkpointed trial should finish 

.fi
.SH "OUTPUT FILES"
The standard reports are 
.IP 1) 
Attribute influence values: presents the relative influence or 
significance of the data's attributes both globally (averaged over
all classes), and locally (specifically for each class). A
heuristic
for relative class strength is also listed;
.IP 2) 
Cross-reference by case (datum) number: lists the primary class 
probability for each datum, ordered by case number.  When
report_mode = "data", additional lesser class probabilities 
(greater than or equal to 0.001) are listed for each datum;
.IP 3) 
Cross-reference by class number: for each class the primary class
probability and any lesser class probabilities (greater than or
equal to 0.001) are listed for each datum in the class, ordered by 
case number. It is also possible to list, for each datum, the values
of attributes, which you select.
.PP
The attribute influence values report attempts to provide relative
measures of the "influence" of the data attributes on the classes
found by the classification.  The normalized class strengths, the
normalized attribute influence values summed over all classes, and the
individual influence values (I[jkl]) are all only relative measures
and should be interpreted with more meaning than rank ordering, but
not like anything approaching absolute values.
.PP
The reports are output to files whose names and pathnames are taken
from the ".r-params" file pathname.  The report file types (extensions)
are:
.IP "\fBinfluence values report\fP"
"influ-o-text-\fIn\fP" or "influ-no-text-\fIn\fP"
.IP "\fBcross-reference by case\fP"
"case-text-\fIn\fP"
.IP "\fBcross-reference by class\fP"
"class-text-\fIn\fP" 
.PP
or, if report_mode is overridden to "data":
.IP "\fBinfluence values report\fP"
"influ-o-data-\fIn\fP" or "influ-no-data-\fIn\fP"
.IP "\fBcross-reference by case\fP"
"case-data-\fIn\fP"
.IP "\fBcross-reference by class\fP"
"class-data-\fIn\fP" 
.PP
where \fIn\fP is the classification number from the "results" file.
The first or best classification is numbered 1, the next best 2, etc.
The default is to generate reports only for the best classification in
the "results" file.  You can produce reports for other saved
classifications by using report params keywords \fBn_clsfs\fP and
\fBclsf_n_list\fP.  The "influ-o-text-\fIn\fP" file type is the
default (\fBorder_attributes_by_influence_p\fP = true), and lists each
class's attributes in descending order of attribute influence value.
If the value of \fBorder_attributes_by_influence_p\fP is overridden to
be false in the <...>.r-params file, then each class's attributes will
be listed in ascending order by attribute number.  The extension of
the file generated will be "influ-no-text-\fIn\fP".  This method of
listing facilitates the visual comparison of attribute values between
classes.
.PP
For example, this command:
.sp
.nf
	autoclass -reports sample/imports-85c.results-bin 
		sample/imports-85c.search sample/imports-85c.r-params
.fi
.sp
with this line in the ".r-params" file:
.sp
	xref_class_report_att_list = 2, 5, 6
.sp
will generate these output files:
.sp
.nf
	imports-85.influ-o-text-1
	imports-85.case-text-1
	imports-85.class-text-1
.fi
.PP
The \fBAutoClass C\fP reports provide the capability to compute sigma
class contour values for specified pairs of real valued attributes,
when generating the influence values report with the data option
(report_mode = "data").  Note that sigma class contours are not
generated from discrete type attributes.
.PP
The sigma contours are the two dimensional equivalent of n-sigma error
bars in one dimension.  Specifically, for two independent attributes
the n-sigma contour is defined as the ellipse where
.PP
((x - xMean) / xSigma)^2 + ((y - yMean) / ySigma)^2 == n 
.PP
With covariant attributes, the n-sigma contours are defined
identically, in the rotated coordinate system of the distribution's
principle axes.  Thus independent attributes give ellipses oriented
parallel with the attribute axes, while the axes of sigma contours of
covariant attributes are rotated about the center determined by the
means.  In either case the sigma contour represents a line where the
class probability is constant, irrespective of any other class
probabilities.
.PP
With three or more attributes the n-sigma contours become
k-dimensional ellipsoidal surfaces.  This code takes advantage of the
fact that the parallel projection of an n-dimensional ellipsoid, onto
any 2-dim plane, is bounded by an ellipse.  In this simplified case of
projecting the single sigma ellipsoid onto the coordinate planes, it
is also true that the 2-dim covariances of this ellipse are equal to
the corresponding elements of the n-dim ellipsoid's covariances.  The
Eigen-system of the 2-dim covariance then gives the variances
w.r.t. the principal components of the eclipse, and the rotation that
aligns it with the data.  This represents the best way to display a
distribution in the marginal plane.
.PP
To get contour values, set the keyword \fBsigma_contours_att_list\fP
to a list of real valued attribute indices (from .hd2 file), and
request an influence values report with the data option.  For example,
.sp
.nf
	report_mode = "data"
	sigma_contours_att_list = 3, 4, 5, 8, 15
.fi
.SS "OUTPUT REPORT PARAMETERS"
The contents of the output report are controlled by the ".r-params"
file.  In this file, an empty line or a line starting with one of
these characters is treated as a comment: "#", "!", or ";".  The
parameter name and its value can be separated by an equal sign, a
space, or a tab:
.sp
.nf
	n_clsfs 1
	n_clsfs = 1
	n_clsfs<tab>1
.fi
.sp
Spaces are ignored if "=" or "<tab>" are used as separators.  Note
there are no trailing semicolons.
.PP
The following are the allowed parameters and their default values:
.IP "\fBn_clsfs\fP = 1"
number of clsfs in the .results file for which to generate reports,
starting with the first or "best".
.IP "\fBclsf_n_list\fP = "
if specified, this is a one-based index list of clsfs in the clsf
sequence read from the .results file.  It overrides "n_clsfs".
For example: 
.sp
	clsf_n_list = 1, 2 
.sp
will produce the same output as
.sp
	n_clsfs = 2
.sp
but
.sp
	clsf_n_list = 2
.sp
will only output the "second best" classification report.
.IP "\fBreport_type\fP = \"all\""
type of reports to generate: "all", "influence_values", "xref_case", or
"xref_class".
.IP "\fBreport_mode\fP = \"text\""
mode of reports to generate. "text" is formatted text layout.  "data"
is numerical -- suitable for further processing.
.IP "\fBcomment_data_headers_p\fP = false"
the default value does not insert # in column 1 of most 
report_mode = "data" header lines.  If specified as true, the comment 
character will be inserted in most header lines.
.IP "\fBnum_atts_to_list\fP = "
if specified, the number of attributes to list in influence values report.
if not specified, \fIall\fP attributes will be listed. 
(e.g. "num_atts_to_list = 5")
.IP "\fBxref_class_report_att_list\fP = "
if specified, a list of attribute numbers (zero-based), whose values will 
be output in the "xref_class" report along with the case probabilities.  
if not specified, no attributes values will be output. 
(e.g. "xref_class_report_att_list = 1, 2, 3")
.IP "\fBorder_attributes_by_influence_p\fP = true"
The default value lists each class's attributes in descending order of
attribute influence value, and uses ".influ-o-text-n" as the
influence values report file type.  If specified as false, then each 
class's attributes will be listed in ascending order by attribute number.  
The extension of the file generated will be "influ-no-text-n".
.IP "\fBbreak_on_warnings_p\fP = true"
The default value asks the user whether to continue or not when data
definition warnings are found.  If specified as false, then \fBAutoClass\fP
will continue, despite warnings -- the warning will continue to be
output to the terminal.
.IP "\fBfree_storage_p\fP = true"
The default value tells \fBAutoClass\fP to free the majority of its
allocated storage.  This is not required, and in the case of the DEC
Alpha causes a core dump [is this still true?].  If specified as
false, \fBAutoClass\fP will not attempt to free storage.
.IP "\fBmax_num_xref_class_probs\fP = 5"
Determines how many lessor class probabilities will be printed for the 
case and class cross-reference reports.  The default is to print the
most probable class probability value and up to 4 lessor class prob-
ibilities.  Note this is true for both the "text" and "data" class
cross-reference reports, but only true for the "data" case cross-
reference report.  The "text" case cross-reference report only has the
most probable class probability.
.IP "\fBsigma_contours_att_list\fP = "
If specified, a list of real valued attribute indices (from .hd2 file) 
will be to compute sigma class contour values, when generating 
influence values report with the data option (report_mode = "data"). 
If not specified, there will be no sigma class contour output.
(e.g. "sigma_contours_att_list = 3, 4, 5, 8, 15")
.SH "INTERPRETATION OF AUTOCLASS RESULTS"
.br
.sp
.SS "WHAT HAVE YOU GOT?"
Now you have run \fBAutoClass\fP on your data set -- what have you got?
Typically, the \fBAutoClass\fP search procedure finds many classifications,
but only saves the few best.  These are now available for inspection
and interpretation.  The most important indicator of the relative
merits of these alternative classifications is Log total posterior
probability value.  Note that since the probability lies between 1 and
0, the corresponding Log probability is negative and ranges from 0 to
negative infinity. The difference between these Log probability values
raised to the power e gives the relative probability of the
alternatives classifications.  So a difference of, say 100, implies
one classification is e^100 ~= 10^43 more likely than the other.
However, these numbers can be very misleading, since they give the
relative probability of alternative classifications under the
\fBAutoClass\fP \fIassumptions\fP.
.SS "ASSUMPTIONS"
Specifically, the most important \fBAutoClass\fP assumptions are the use of normal
models for real variables, and the assumption of independence of attributes
within a class.  Since these assumptions are often violated in practice, the
difference in posterior probability of alternative classifications can be
partly due to one classification being closer to satisfying the assumptions
than another, rather than to a real difference in classification quality.
Another source of uncertainty about the utility of Log probability values is 
that they do not take into account any specific prior knowledge the user may 
have about the domain.  This means that it is often worth looking at 
alternative classifications to see if you can interpret them, but it is worth 
starting from the most probable first.  Note that if the Log probability value
is much greater than that for the one class case, it is saying that there is 
overwhelming evidence for \fIsome\fP structure in the data, and part of this 
structure has been captured by the \fBAutoClass\fP classification.
.SS "INFLUENCE REPORT "
So you have now picked a classification you want to examine, based on
its Log probability value; how do you examine it?  The first thing to
do is to generate an "influence" report on the classification using
the report generation facilities documented in
\fI/usr/share/doc/autoclass/reports-c.text\fP.  An influence report is
designed to summarize the important information buried in the
\fBAutoClass\fP data structures.
.PP
The first part of this report gives the heuristic class "strengths".
Class "strength" is here defined as the geometric mean probability that
any instance "belonging to" class, would have been generated from the
class probability model.  It thus provides a heuristic measure of how
strongly each class predicts "its" instances.
.PP
The second part is a listing of the overall "influence" of each of the
attributes used in the classification.  These give a rough heuristic
measure of the relative importance of each attribute in the
classification.  Attribute "influence values" are a class probability
weighted average of the "influence" of each attribute in the classes, as
described below.
.PP
The next part of the report is a summary description of each of the
classes.  The classes are arbitrarily numbered from 0 up to n, in order
of descending class weight.  A class weight of say 34.1 means that the
weighted sum of membership probabilities for class is 34.1.  Note that
a class weight of 34 does not necessarily mean that 34 cases belong to
that class, since many cases may have only partial membership in that
class.  Within each class, attributes or attribute sets are ordered by
the "influence" of their model term.
.SS "CROSS ENTROPY "
A commonly used measure of the divergence between two probability
distributions is the cross entropy: the sum over all possible values x,
of P(x|c...)*log[P(x|c...)/P(x|g...)], where c... and g... define the
distributions.  It ranges from zero, for identical distributions, to
infinite for distributions placing probability 1 on differing values of
an attribute.  With conditionally independent terms in the probability
distributions, the cross entropy can be factored to a sum over these
terms.  These factors provide a measure of the corresponding modeled
attribute's influence in differentiating the two distributions.
.PP
We define the modeled term's "influence" on a class to be the cross
entropy term for the class distribution w.r.t. the global class
distribution of the single class classification.  "Influence" is thus a
measure of how strongly the model term helps differentiate the class
from the whole data set.  With independently modeled attributes, the
influence can legitimately be ascribed to the attribute itself.  With
correlated or covariant attributes sets, the cross entropy factor is a
function of the entire set, and we distribute the influence value
equally over the modeled attributes.
.SS "ATTRIBUTE INFLUENCE VALUES"
In the "influence" report on each class, the attribute parameters for
that class are given in order of highest influence value for the model
term attribute sets.  Only the first few attribute sets usually have
significant influence values.  If an influence value drops below about
20% of the highest value, then it is probably not significant, but all
attribute sets are listed for completeness.  In addition to the
influence value for each attribute set, the values of the attribute
set parameters in that class are given along with the corresponding
"global" values.  The global values are computed directly from the
data independent of the classification.  For example, if the class
mean of attribute "temperature" is 90 with standard deviation of 2.5,
but the global mean is 68 with a standard deviation of 16.3, then this
class has selected out cases with much higher than average
temperature, and a rather small spread in this high range.  Similarly,
for discrete attribute sets, the probability of each outcome in that
class is given, along with the corresponding global probability --
ordered by its significance: the absolute value of (log
{<local-probability> / <global-probability>}).  The sign of the
significance value shows the direction of change from the global
class.  This information gives an overview of how each class differs
from the average for all the data, in order of the most significant
differences.
.SS "CLASS AND CASE REPORTS"
Having gained a description of the classes from the "influence"
report, you may want to follow-up to see which classes your favorite
cases ended up in.  Conversely, you may want to see which cases belong
to a particular class.  For this kind of cross-reference information
two complementary reports can be generated.  These are more fully
documented in \fI/usr/share/doc/autoclass/reports-c.text\fP. The
"class" report, lists all the cases which have significant membership
in each class and the degree to which each such case belongs to that
class.  Cases whose class membership is less than 90% in the current
class have their other class membership listed as well.  The cases
within a class are ordered in increasing case number.  The alternative
"cases" report states which class (or classes) a case belongs to, and
the membership probability in the most probable class.  These two
reports allow you to find which cases belong to which classes or the
other way around.  If nearly every case has close to 99% membership in
a single class, then it means that the classes are well separated,
while a high degree of cross-membership indicates that the classes are
heavily overlapped.  Highly overlapped classes are an indication that
the idea of classification is breaking down and that groups of
mutually highly overlapped classes, a kind of meta class, is probably
a better way of understanding the data.
.SS "COMPARING CLASS WEIGHTS AND CLASS/CASE REPORT ASSIGNMENTS"
The class weight given as the class probability parameter, is
essentially the sum over all data instances, of the normalized
probability that the instance is a member of the class.  It is
probably an error on our part that we format this number as an integer
in the report, rather than emphasizing its real nature.  You will find
the actual real value recorded as the w_j parameter in the class_DS
structures on any .results[-bin] file.
.PP
The .case and .class reports give probabilities that cases are members
of classes.  Any assignment of cases to classes requires some decision
rule.  The maximum probability assignment rule is often implicitly
assumed, but it cannot be expected that the resulting partition sizes
will equal the class weights unless nearly all class membership
probabilities are effectively one or zero.  With non-1/0 membership
probabilities, matching the class weights requires summing the
probabilities.
.PP
In addition, there is the question of completeness of the EM
(expectation maximization) convergence.  EM alternates between
estimating class parameters and estimating class membership
probabilities.  These estimates converge on each other, but never
actually meet.  \fBAutoClass\fP implements several convergence algorithms
with alternate stopping criteria using appropriate parameters in 
the .s-params file.  Proper setting of these parameters, to get reasonably
complete and efficient convergence may require experimentation.
.SS "ALTERNATIVE CLASSIFICATIONS "
In summary, the various reports that can be generated give you a way
of viewing the current classification.  It is usually a good idea to
look at alternative classifications even though they do not have the
minimum Log probability values.  These other classifications usually
have classes that correspond closely to strong classes in other
classifications, but can differ in the weak classes.  The "strength"
of a class within a classification can usually be judged by how
dramatically the highest influence value attributes in the class
differ from the corresponding global attributes.  If none of the
classifications seem quite satisfactory, it is always possible to run
\fBAutoClass\fP again to generate new classifications.
.SS "WHAT NEXT?"
Finally, the question of what to do after you have found an
insightful classification arises.  Usually, classification is a
preliminary data analysis step for examining a set of cases (things,
examples, etc.) to see if they can be grouped so that members of the
group are "similar" to each other.  \fBAutoClass\fP gives such a grouping
without the user having to define a similarity measure.  The built-in
"similarity" measure is the mutual predictiveness of the cases.  The
next step is to try to "explain" why some objects are more like others
than those in a different group.  Usually, domain knowledge suggests
an answer.  For example, a classification of people based on income,
buying habits, location, age, etc., may reveal particular social
classes that were not obvious before the classification analysis.  To
obtain further information about such classes, further information,
such as number of cars, what TV shows are watched, etc., would reveal
even more information.  Longitudinal studies would give information
about how social classes arise and what influences their attitudes --
all of which is going way beyond the initial classification.
.SH PREDICTIONS
Classifications can be used to predict class membership for new
cases.  So in addition to possibly giving you some insight into the
structure behind your data, you can now use \fBAutoClass\fP directly
to make predictions, and compare \fBAutoClass\fP to other learning
systems.
.PP
This technique for predicting class probabilities is applicable to all
attributes, regardless of data type/sub_type or likelihood model term type.
.PP
In the event that the class membership of a data case does not exceed
0.0099999 for any of the "training" classes, the following message will appear
in the screen output for each case:
.sp
        xref_get_data: case_num xxx => class 9999
.sp
Class 9999 members will appear in the "case" and "class" cross-reference 
reports with a class membership of 1.0.
.PP
Cautionary Points:
.PP
The usual way of using \fBAutoClass\fP is to put all of your data in a
data_file, describe that data with model and header files, and run
"autoclass -search".  Now, instead of one data_file you will have two,
a training_data_file and a test_data_file.
.PP
It is most important that both databases have the same \fBAutoClass\fP
internal representation.  Should this not be true, \fBAutoClass\fP
will exit, or possibly in in some situations, crash.  The prediction
mode is designed to hopefully direct the user into conforming to this
requirement.
.PP
Preparation:
.PP
Prediction requires having a training classification and a test
database.  The training classification is generated by the running of
"autoclass -search" on the training data_file
("data/soybean/soyc.db2"), for example:
.sp
.nf
    autoclass -search data/soybean/soyc.db2 data/soybean/soyc.hd2 
        data/soybean/soyc.model data/soybean/soyc.s-params
.fi
.sp
This will produce "soyc.results-bin" and "soyc.search".  Then create a
"reports" parameter file, such as "soyc.r-params" (see
\fI/usr/share/doc/autoclass/reports-c.text\fP), and run
\fBAutoClass\fP in "reports" mode, such as:
.sp
.nf
    autoclass -reports data/soybean/soyc.results-bin 
        data/soybean/soyc.search data/soybean/soyc.r-params
.fi
.sp
This will generate class and case cross-reference files, and an influence
values file.  The file names are based on the ".r-params" file name:
.sp
.nf
        data/soybean/soyc.class-text-1
        data/soybean/soyc.case-text-1
        data/soybean/soyc.influ-text-1
.fi
.sp
These will describe the classes found in the training_data_file.
Now this classification can be used to predict the probabilistic class
membership of the test_data_file cases ("data/soybean/soyc-predict.db2")
in the training_data_file classes.
.sp
.nf
    autoclass -predict data/soybean/soyc-predict.db2
        data/soybean/soyc.results-bin data/soybean/soyc.search
        data/soybean/soyc.r-params
.fi
.sp
This will generate class and case cross-reference files for the
test_data_file cases predicting their probabilistic class memberships
in the training_data_file classes.  The file names are based on the
".db2" file name:
.sp
.nf
        data/soybean/soyc-predict.class-text-1
        data/soybean/soyc-predict.case-text-1
.fi
.sp
.SH "SEE ALSO"
\fBAutoClass\fP is documented fully here:
.LP
.I /usr/share/doc/autoclass/introduction-c.text
Guide to the documentation 
.LP
.I /usr/share/doc/autoclass/preparation-c.text
How to prepare data for use by AutoClass
.LP
.I /usr/share/doc/autoclass/search-c.text
How to run AutoClass to find classifications.
.LP
.I /usr/share/doc/autoclass/reports-c.text
How to examine the classification in various ways.
.LP
.I /usr/share/doc/autoclass/interpretation-c.text
How to interpret AutoClass results.
.LP
.I /usr/share/doc/autoclass/checkpoint-c.text
Protocols for running a checkpointed search.
.LP
.I /usr/share/doc/autoclass/prediction-c.text
Use classifications to predict class membership for new cases.  
.PP
These provide supporting documentation:
.LP
.I /usr/share/doc/autoclass/classes-c.text
What classification is all about, for beginners.
.LP
.I /usr/share/doc/autoclass/models-c.text
Brief descriptions of the model term implementations.
.PP
The mathematical theory behind \fBAutoClass\fP is explained in these
documents:
.LP
.I /usr/share/doc/autoclass/kdd-95.ps
Postscript file containing:
P. Cheeseman, J. Stutz, "Bayesian Classification (AutoClass):
Theory and Results", in "Advances in Knowledge Discovery and
Data Mining", Usama M. Fayyad, Gregory Piatetsky-Shapiro,
Padhraic Smyth, & Ramasamy Uthurusamy, Eds. The AAAI Press, 
Menlo Park, expected fall 1995.
.LP
.I /usr/share/doc/autoclass/tr-fia-90-12-7-01.ps
Postscript file containing:
R. Hanson, J. Stutz, P. Cheeseman, "Bayesian Classification
Theory", Technical Report FIA-90-12-7-01, NASA Ames Research
Center, Artificial Intelligence Branch, May 1991
(The figures are not included, since they were inserted by
"cut-and-paste" methods into the original "camera-ready"
copy.)
.SH AUTHORS
.nf
Dr. Peter Cheeseman
Principal Investigator - NASA Ames, Computational Sciences Division
cheesem@ptolemy.arc.nasa.gov

John Stutz
Research Programmer - NASA Ames, Computational Sciences Division
stutz@ptolemy.arc.nasa.gov

Will Taylor
Support Programmer - NASA Ames, Computational Sciences Division
taylor@ptolemy.arc.nasa.gov
.fi
.\" .PP
.\" This manual page was written by James R. Van Zandt <jrv@debian.org>,
.\" for the Debian GNU/Linux system (but may be used by others).
.SH "SEE ALSO"
.BR multimix (1).