File: search-c.text

package info (click to toggle)
autoclass 3.3.4-6
  • links: PTS
  • area: main
  • in suites: lenny
  • size: 3,844 kB
  • ctags: 994
  • sloc: ansic: 16,674; makefile: 123; sh: 98; cpp: 95; csh: 77
file content (705 lines) | stat: -rw-r--r-- 34,240 bytes parent folder | download | duplicates (7)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
SEARCHING FOR GOOD CLASSIFICATIONS

 1.0 What Results Are
 2.0 What Results Mean
 3.0 How It Works
 4.0 When To Stop
 5.0 What Gets Returned
 6.0 How To Get Started
 7.0 Status Reports
 8.0 Search Variations
 9.0 How Many Classes?
10.0 Do I Have Enough Memory and Disk Space?
11.0 Just How Slow Is It?
12.0 Changing Filenames in a Saved Classification File
13.0 Those Parameters Again -- With Annotations
14.0 How to get AutoClass C to Produce Repeatable Results

Now that you have succeeded in describing your data with a header file
and model file that passes the AUTOCLASS -SEARCH <...> input checks,
you will have entered the search domain where Autoclass classifies
your data.  (At last!)

The main function to use in finding a good classification of your data
is AUTOCLASS -SEARCH, and using it will take most of the computation
time.  Searches are invoked with:

% autoclass -search <.db2 file path> <.hd2 file path> <.model file path> 
         <.s-params file path>

The sample-run (autoclass-c/sample/) that comes with Autoclass shows
some sample searches, and browsing these is probably the fastest way
to get familiar with how to do searches.  The test data sets located
under autoclass-c/data/ will show you some other header (.hd2), model
(.model), and search params (.s-params) file setups.  The remainder 
of this section describes how to do searches in somewhat more detail.

The capitalized tokens below are generally search params file parameters.


1.0 WHAT RESULTS ARE

Autoclass is looking for the best classification(s) of the data it can find.
A classification is composed of: 
 1) a set of classes, each of which is described by a set of class 
    parameters, which specify how the class is distributed along the 
    various attributes.  For example, "height normally distributed with 
    mean 4.67 ft and standard deviation .32 ft", 
 2) a set of class weights, describing what percentage of cases are 
    likely to be in each class. 
 3) a probabilistic assignment of cases in the data to these classes.  
    I.e. for each case, the relative probability that it is a member of
    each class.  

As a strictly Bayesian system (accept no substitutes!), the quality
measure Autoclass uses is the total probability that, had you known
nothing about your data or its domain, you would have found this set of
data generated by this underlying model.  This includes the prior
probability that the "world" would have chosen this number of classes,
this set of relative class weights, and this set of parameters for each
class, and the likelihood that such a set of classes would have generated
this set of values for the attributes in the data cases.

These probabilities are typically very small, in the range of e^-30000,
and so are usually expressed in exponential notation.

2.0 WHAT RESULTS MEAN

It is important to remember that all of these probabilities are GIVEN
that the real model is in the model family that Autoclass has restricted
its attention to.  If Autoclass is looking for Gaussian classes and the
real classes are Poisson, then the fact that Autoclass found 5 Gaussian
classes may not say much about how many Poisson classes there really are.

The relative probability between different classifications found can be
very large, like e^1000, so the very best classification found is
usually overwhelmingly more probable than the rest (and overwhelmingly
less probable than any better classifications as yet undiscovered).  If
Autoclass should manage to find two classifications that are within
about exp(5-10) of each other (i.e. within 100 to 10,000 times more
probable) then you should consider them to be about equally probable, as
our computation is usually not more accurate than this (and sometimes
much less).

3.0 HOW IT WORKS

Autoclass repeatedly creates a random classification and then tries to
massage this into a high probability classification though local changes,
until it converges to some "local maximum".  It then remembers what it
found and starts over again, continuing until you tell it to stop.  Each
effort is called a "try", and the computed probability is intended to
cover the whole volume in parameter space around this maximum, rather
than just the peak.

The standard approach to massaging is to 
 1) Compute the probabilistic class memberships of cases using the class
    parameters and the implied relative likelihoods.
 2) Using the new class members, compute class statistics (like mean)
    and revise the class parameters.
and repeat till they stop changing.  There are three available convergence
algorithms: "converge_search_3" (the default),  "converge_search_4"
and "converge".  Their specification is controlled by search params file 
parameter TRY_FN_TYPE.


4.0 WHEN TO STOP

You can tell AUTOCLASS -SEARCH to stop by: 1) giving a MAX_DURATION (in
seconds) argument at the beginning; 2)  giving a MAX_N_TRIES (an
integer) argument at the beginning; or 3) by typing a "q" and <return>
after you have seen enough tries.  The MAX_DURATION and MAX_N_TRIES 
arguments are useful if you desire to run AUTOCLASS -SEARCH in batch 
mode.  If you are restarting AUTOCLASS -SEARCH from a previous search, 
the value of MAX_N_TRIES you provide, for instance 3, will tell the 
program to compute 3 more tries in addition to however many it has 
already done.  The same incremental behavior is exhibited by
MAX_DURATION.

Deciding when to stop is a judgment call and it's up to you.  Since the
search includes a random component, there's always the chance that if
you let it keep going it will find something better.  So you need to
trade off how much better it might be with how long it might take to
find it.  The search status reports that are printed when a new best
classification is found are intended to provide you information to help
you make this tradeoff.

One clear sign that you should probably stop is if most of the
classifications found are duplicates of previous ones (flagged by "dup"
as they are found).  This should only happen for very small sets of data
or when fixing a very small number of classes, like two.

Our experience is that for moderately large to extremely large data sets
(~200 to ~10,000 datum), it is necessary to run AutoClass for at least
50 trials.


5.0 WHAT GETS RETURNED

Just before returning, AUTOCLASS -SEARCH will give short descriptions of
the best classifications found.  How many will be described can be
controlled with N_FINAL_SUMMARY.

By default AUTOCLASS -SEARCH will write out a number of files, both at
the end and periodically during the search (in case your system crashes
before it finishes).  These files will all have the same name
(taken from the search params pathname [<name>.s-params]), and differ
only in their file extensions.  If your search runs are very long and 
there is a possibility that your machine may crash, you can have 
intermediate "results" files written out.  These can be used to restart 
your search run with minimum loss of search effort.  See the 
documentation file checkpoint-c.text.

A ".log" file will hold a listing of most of what was printed to the
screen during the run, unless you set LOG_FILE_P to false to say you want
no such foolishness.  Unless RESULTS_FILE_P is false, a binary
".results-bin" file (the default) or an ASCII ".results" text file,
will hold the best classifications that were returned, and unless 
SEARCH_FILE_P is false, a ".search" file will hold the record of the
search tries. SAVE_COMPACT_P controls whether the "results" files are
saved as binary or ASCII text.

If the C global variable "G_safe_file_writing_p" is defined as TRUE in
"autoclass-c/prog/globals.c", the names of "results" files (those that
contain the saved classifications) are modified internally to account 
for redundant file writing.  If the search params file name is
"my_saved_clsfs" you will see the following "results" file names
(ignoring directories and pathnames for this example)

  SAVE_COMPACT_P = true --
  "my_saved_clsfs.results-bin"	- completely written file
  "my_saved_clsfs.results-tmp-bin" - partially written file, renamed
				  when complete

  SAVE_COMPACT_P = false --
  "my_saved_clsfs.results"	- completely written file
  "my_saved_clsfs.results-tmp"  - partially written file, renamed
				  when complete

If check pointing is being done, these additional names will appear

  SAVE_COMPACT_P = true --
  "my_saved_clsfs.chkpt-bin"	- completely written checkpoint file
  "my_saved_clsfs.chkpt-tmp-bin" - partially written checkpoint file, 
				     renamed when complete
  SAVE_COMPACT_P = false --
  "my_saved_clsfs.chkpt"	- completely written checkpoint file
  "my_saved_clsfs.chkpt-tmp"    - partially written checkpoint file, 
				     renamed when complete



6.0 HOW TO GET STARTED

The way to invoke AUTOCLASS -SEARCH is:

% autoclass -search <.db2 file path> <.hd2 file path> <.model file path> 
         <.s-params file path>

To restart a previous search, specify that FORCE_NEW_SEARCH_P has the
value false in the search params file, since its default is true.
Specifying false tells AUTOCLASS -SEARCH to try to find a previous
compatible search (<...>.results[-bin] & <...>.search) to continue
from, and will restart using it if found.  To force a new search 
instead of restarting an old one, give the parameter FORCE_NEW_SEARCH_P 
the value of true, or use the default.  If there is an existing
search (<...>.results[-bin] & <...>.search), the user will be asked
to confirm continuation since continuation will discard the
existing search.

If a previous search is continued, the message "RESTARTING SEARCH" will
be given instead of the usual "BEGINNING SEARCH".  It is generally
better to continue a previous search than to start a new one, unless
you are trying a significantly different search method, in which case
statistics from the previous search may mislead the current one.


7.0 STATUS REPORTS

A running commentary on the search will be printed to the screen and to
the log file (unless LOG_FILE_P is false).  Note that the ".log" file
will contain a listing of all default search params values, and the 
values of all params that are overridden.

After each try a very short report (only a few characters long) is 
given.  After each new best classification, a longer report is given, 
but no more often than MIN_REPORT_PERIOD (default is 30 seconds).


8.0 SEARCH VARIATIONS

AUTOCLASS -SEARCH by default uses a certain standard search method
or try function (TRY_FN_TYPE = "converge_search_3").  Two others are 
also available: "converge_search_4" and  "converge").  They are 
provided in case your problem is one that may happen to benefit from 
them.  In general the default method will result in finding better 
classifications at the expense of a longer search time.  The default 
was chosen so as to be robust, giving even performance across many 
problems.  The alternatives to the default may do better on some 
problems, but may do substantially worse on others.  

"converge_search_3" uses an absolute stopping criterion (REL_DELTA_RANGE,
default value of 0.0025) which tests the variation of each class of the delta
of the log approximate-marginal-likelihood of the class statistics
with-respect-to the class hypothesis (class->log_a_w_s_h_j) divided by the
class weight (class->w_j) between successive convergence cycles.  Increasing
this value loosens the convergence and reduces the number of cycles.
Decreasing this value tightens the convergence and increases the number of
cycles. N_AVERAGE (default value of 3) specifies how many successive cycles
must meet the stopping criterion before the trial terminates.

"converge_search_4" uses an absolute stopping criterion (CS4_DELTA_RANGE,
default value of 0.0025) which tests the variation of each class of the
slope for each class of log approximate-marginal-likelihood of the class 
statistics with-respect-to the class hypothesis (class->log_a_w_s_h_j) 
divided by the class weight (class->w_j) over SIGMA_BETA_N_VALUES
(default value 6) convergence cycles.  Increasing the value of 
CS4_DELTA_RANGE loosens the convergence and reduces the number of cycles.  
Decreasing this value tightens the convergence and increases the number 
of cycles.  Computationally, this try function is more expensive than
"converge_search_3", but may prove useful if the computational "noise"
is significant compared to the variations in the computed values.
Key calculations are done in double precision floating point, and for
the largest data base we have tested so far ( 5,420 cases of 93
attributes), computational noise has not been a problem, although the
value of MAX_CYCLES needed to be increased to 400.

"converge" uses one of two absolute stopping criterion which test the 
variation of the classification (clsf) log_marginal (clsf->log_a_x_h) 
delta between successive convergence cycles.  The largest of HALT_RANGE
(default value 0.5) and HALT_FACTOR * current_clsf_log_marginal) is 
used (default value of HALT_FACTOR is 0.0001).  Increasing these 
values loosens the convergence and reduces the number of cycles. 
Decreasing these values tightens the convergence and increases the
number of cycles.  N_AVERAGE (default value of 3) specifies how many
cycles must meet the stopping criteria before the trial terminates.
This is a very approximate stopping criterion, but will give you some 
feel for the kind of classifications to expect.  It would be useful 
for "exploratory" searches of a data base.

The purpose of RECONVERGE_TYPE = "chkpt" is to complete an interrupted
classification by continuing from its last checkpoint.  The purpose of
RECONVERGE_TYPE = "results" is to attempt further refinement of the best
completed classification using a different value of TRY_FN_TYPE
("converge_search_3", "converge_search_4", "converge").  If MAX_N_TRIES
is greater than 1, then in each case, after the reconvergence has completed,
AutoClass will perform further search trials based on the parameter
values in the <...>.s-params file.

With the use of RECONVERGE_TYPE ( default value ""), you may apply more 
than one try function to a classification.  Say you generate several
exploratory trials using TRY_FN_TYPE = "converge", and quit the search 
saving .search and .results[-bin] files.  Then you can begin another 
search with TRY_FN_TYPE = "converge_search_3", RECONVERGE_TYPE =
"results", and MAX_N_TRIES = 1.  This will result in the further
convergence of the best classification generated with TRY_FN_TYPE = 
"converge", with TRY_FN_TYPE = "converge_search_3".  When AutoClass
completes this search try, you will have an additional refined
classification.

A good way to verify that any of the alternate TRY_FUN_TYPE are generating a
well converged classification is to run AutoClass in prediction mode on the
same data used for generating the classification.  Then generate and compare
the corresponding case or class cross reference files for the original
classification and the prediction.  Small differences between these files are
to be expected, while large differences indicate incomplete convergence.
Differences between such file pairs should, on average and modulo class
deletions, decrease monotonically with further convergence.

The standard way to create a random classification to begin a try is
with the default value of "random" for START_FN_TYPE.  At this point
there are no alternatives.  Specifying "block" for START_FN_TYPE
produces repeatable non-random searches.  That is how the <..>.s-params 
files in the autoclass-c/data/.. sub-directories are specified.
This is how development testing is done.

MAX_CYCLES controls the maximum number of convergence cycles that will
be performed in any one trial by the convergence functions.  Its default
value is 200.  The screen output shows a period (".") for each cycle
completed. If your search trials run for 200 cycles, then either your
data base is very complex (increase the value), or the TRY_FN_TYPE
is not adequate for situation (try another of the available ones, and
use CONVERGE_PRINT_P to get more information on what is going on).

Specifying CONVERGE_PRINT_P to be true will generate a brief print-out
for each cycle which will provide information so that you can modify 
the default values of REL_DELTA_RANGE & N_AVERAGE for "converge_search_3";
CS4_DELTA_RANGE & SIGMA_BETA_N_VALUES for "converge_search_4"; and
HALT_RANGE, HALT_FACTOR, and N_AVERAGE for "converge".  Their default
values are given in the <..>.s-params files in the autoclass-c/data/..
sub-directories.


9.0 HOW MANY CLASSES?

Each new try begins with a certain number of classes and may end up
with a smaller number, as some classes may drop out of the convergence.
In general, you want to begin the try with some number of classes that
previous tries have indicated look promising, and you want to be sure 
you are fishing around elsewhere in case you missed something before.

N_CLASSES_FN_TYPE = "random_ln_normal" is the default way to make this
choice.  It fits a log normal to the number of classes (usually called "j"
for short) of the 10 best classifications found so far, and randomly
selects from that.  There is currently no alternative.

To start the game off, the default is to go down START_J_LIST for the
first few tries, and then switch to N_CLASSES_FN_TYPE.  If you believe
that the probable number of classes in your data base is say 75, then
instead of using the default value of START_J_LIST (2, 3, 5, 7, 10,
15, 25), specify something like 50, 60, 70, 80, 90, 100.

If one wants to always look for, say, three classes, one can use
FIXED_J and override the above.  Search status reports will describe 
what the current method for choosing j is.


10.0 DO I HAVE ENOUGH MEMORY AND DISK SPACE?

Internally, the storage requirements in the current system are of order
n_classes_per_clsf * (n_data + n_stored_clsfs * n_attributes * 
n_attribute_values).  This depends on the number of cases, the number 
of attributes, the values per attribute (use 2 if a real value), and 
the number of classifications stored away for comparison to see if 
others are duplicates -- controlled by MAX_N_STORE (default value = 10).  
The search process does not itself consume significant memory, but 
storage of the results may do so.

AutoClass C is configured to handle a maximum of 999 attributes.  If you
attempt to run with more than that you will get array bound violations.
In that case, change these configuration parameters in prog/autoclass.h
and recompile AutoClass C:

#define ALL_ATTRIBUTES                  999   
#define VERY_LONG_STRING_LENGTH         20000 
#define VERY_LONG_TOKEN_LENGTH          500 

For example, these values will handle several thousand attributes:

#define ALL_ATTRIBUTES                  9999
#define VERY_LONG_STRING_LENGTH         50000
#define VERY_LONG_TOKEN_LENGTH          50000

Disk space taken up by the "log" file will of course depend on the duration
of the search.  N_SAVE (default value = 2) determines how many best 
classifications are saved into the ".results[-bin]" file.  SAVE_COMPACT_P 
controls whether the "results" and "checkpoint" files are saved as binary.  
Binary files are faster and more compact, but are not portable.  The 
default value of SAVE_COMPACT_P is true, which causes binary files to be 
written.

If the time taken to save the "results" files is a problem, consider
increasing MIN_SAVE_PERIOD (default value = 1800 seconds or 30 minutes).  
Files are saved to disk this often if there is anything different to 
report.


11.0 JUST HOW SLOW IS IT?

Compute time is of order n_data * n_attributes * n_classes * n_tries
* converge_cycles_per_try. The major uncertainties in this are the
number of basic back and forth cycles till convergence in each try, and of
course the number of tries.  The number of cycles per trial is typically 
10-100 for TRY_FN_TYPE "converge", and 10-200+ for "converge_search_3" 
and "converge_search-4".  The maximum number is specified by MAX_N_TRIES
(default value = 200).  The number of trials is up to you and your
available computing resources.

The running time of very large data sets will be quite uncertain.  We 
advise that a few small scale test runs be made on your system to 
determine a baseline.  Specify N_DATA to limit how many data vectors
are read.  Given a very large quantity of data, AutoClass may find its 
most probable classifications at upwards of a hundred classes, and this
will require that START_J_LIST be specified appropriately (See section
9.0 HOW MANY CLASSES?).  If you are quite certain that you only want a 
few classes, you can force AutoClass to search with a fixed number of 
classes specified by FIXED_J.  You will then need to run separate 
searches with each different fixed number of classes.


12.0 CHANGING FILENAMES IN A SAVED CLASSIFICATION FILE

AutoClass caches the data, header, and model file pathnames in the saved
classification structure of the binary (".results-bin") or ASCII
(".results") "results" files.  If the "results" and "search" files are
moved to a different directory location, the search cannot be successfully 
restarted if you have used absolute pathnames.  Thus it is advantageous 
to run invoke AutoClass in a parent directory of the data, header, and 
model files, so that relative pathnames can be used.  Since the pathnames 
cached will then be relative, the files can be moved to a different host 
or file system and restarted -- providing the same relative pathname 
hierarchy exists.

However, since the ".results" file is ASCII text, those pathnames could
be changed with a text editor (SAVE_COMPACT_P must be specified as false).


13.0 THOSE PARAMETERS AGAIN -- WITH ANNOTATIONS

# PARAMETERS TO AUTOCLASS-SEARCH -- AutoClass C
# ---------------------------------------------------------------
# as the first character makes the line a comment, or
! as the first character makes the line a comment, or
; as the first character makes the line a comment, or
;;; '\n' as the first character (empty line) makes the line a comment.

# to override the following default parameters,
# enter below the line => #!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;
# <parameter_name> = <parameter_value>, or
# <parameter_name> <parameter_value>, or      # separator is a space
# <parameter_name>\tab<parameter_value>.
# note: blanks/spaces are ignored if '=', or '\tab' are separators;
# note: no trailing ';'s.
# ---------------------------------------------------------------
#  DEFAULT PARAMETERS
# ---------------------------------------------------------------
# rel_error = 0.01
!       Specifies the relative difference measure used by clsf-DS-%=, when 
!       deciding if a new clsf is a duplicate of an old one.  

# start_j_list = 2, 3, 5, 7, 10, 15, 25
!       Initially try these numbers of classes, so as not to narrow the
!       search too quickly.  The state of this list is saved in the
!       <..>.search file and used on restarts, unless an override
!       specification of start_j_list is made in the .s-params file for the
!       restart run.  This list should bracket your expected number of
!       classes, and by a  wide margin!
!       start_j_list = -999 specifies an empty list (allowed only on restarts)

# n_classes_fn_type = "random_ln_normal"
!       Once start_j_list is exhausted, AutoClass will call this function to 
!       decide how many classes to start with on the next try, based on the 
!       10 best classifications found so far.
!       Currently only "random_ln_normal" is available.

# fixed_j = 0
!       When fixed_j > 0, overrides start_j_list and n_classes_fn_type, and 
!       AutoClass will always use this value for the initial number of classes.

# min_report_period = 30
!       Wait at least this time (in seconds) since last report until
!       reporting verbosely again.   
!       Should be set longer than the expected run time when checking for 
!       repeatability of results.  For repeatable results, also see 
!       force_new_search_p, start_fn_type and randomize_random_p.

# NOTE: At least one of "interactive_p", "max_duration", and "max_n_tries"
must be active.  Otherwise AutoClass will run indefinitly.  See below.

# interactive_p = true
!       When false, allows run to continue until otherwise halted.
!       When true, standard input is queried on each cycle for the quit
!       character "q", which, when detected, triggers an immediate halt. 

# max_duration = 0
!       When = 0, allows run to continue until otherwise halted.
!       When > 0, specifies the maximum number of seconds to run.  

# max_n_tries = 0
!       When = 0, allows run to continue until otherwise halted.
!       When > 0, specifies the maximum number of tries to make.

# n_save = 2
!       Save this many clsfs to disk in the .results[-bin] and .search files.
!       if 0, don't save anything (no .search & .results[-bin] files). 

# log_file_p = true
!       If false, do not write a log file.

# search_file_p = true
!       If false, do not write a search file. 

# results_file_p = true
!       If false, do not write a results file.

# min_save_period = 1800
!       CPU crash protection.  This specifies the maximum time, in seconds,
!       that AutoClass will run before it saves the current results to disk.
!       The default time is 30 minutes.

# max_n_store = 10
!       Specifies the maximum number of classifications stored internally.

# n_final_summary = 10
!       Specifies the number of trials to be printed out after search ends.

# start_fn_type = "random"
!       One of {"random", "block"}.  This specifies the type of class
!       initialization.  For normal search, use "random", which randomly
!       selects instances to be initial class means, and adds appropriate
!       variances. For testing with repeatable search, use "block", which
!       partitions the database into successive blocks of near equal size.
!       For repeatable results, also see force_new_search_p,
!       min_report_period, and randomize_random_p.

# try_fn_type = "converge_search_3"
!       One of {"converge_search_3", "converge_search_4", "converge"}. 
!       These specify alternate search stopping criteria.  
!       "converge" mearly tests the rate of change of the log_marginal
!       classification probability (clsf->log_a_x_h), without checking
!       rate of change of individual classes(see halt_range and
!       halt_factor).  
!       "converge_search_3" and "converge_search_4" each monitor the ratio
!       class->log_a_w_s_h_j/class->w_j for all classes, and continue
!       convergence until all pass the quiescence criteria for n_average
!       cycles.  "converge_search_3" tests differences between successive
!       convergence cycles (see "rel_delta_range").  This provides a
!       reasonable, general purpose stopping criteria.
!       "converge_search_4" averages the ratio over "sigma_beta_n_values"
!       cycles (see "cs4_delta_range").  This is preferred when
!       "converge_search_3 produces many similar classes.

# initial_cycles_p = true
!       If true, perform base_cycle in initialize_parameters.
!       false is used only for testing.

# save_compact_p = true
!       true saves classifications as machine dependent binary
!       (.results-bin & .chkpt-bin).   
!       false saves as ascii text (.results & .chkpt)

# read_compact_p = true
!       true reads classifications as machine dependent binary 
!       (.results-bin & .chkpt-bin).
!       false reads as ascii text (.results & .chkpt).
        
# randomize_random_p = true
!       false seeds lrand48, the pseudo-random number function with 1 
!       to give repeatable test cases.  true uses universal time clock 
!       as the seed, giving semi-random searches.
!       For repeatable results, also see force_new_search_p, 
!       min_report_period and start_fn_type.

# n_data = 0
!       With n_data = 0, the entire database is read from .db2.  
!       With n_data > 0, only this number of data are read. 

# halt_range = 0.5
!       Passed to try_fn_type "converge".  With the "converge"
!       try_fn_type, convergence is halted when the larger of halt_range
!       and (halt_factor * current_log_marginal) exceeds the difference
!       between successive cycle values of the classification log_marginal
!       (clsf->log_a_x_h).  Decreasing this value may tighten the
!       convergence and increase the number of cycles.

# halt_factor = 0.0001
!       Passed to try_fn_type "converge".  With the "converge"
!       try_fn_type, convergence is halted when the larger of halt_range
!       and (halt_factor * current_log_marginal) exceeds the difference
!       between successive cycle values of the classification log_marginal
!       (clsf->log_a_x_h).  Decreasing this value may tighten the
!       convergence and increase the number of cycles.

# rel_delta_range = 0.0025
!       Passed to try function "converge_search_3", which monitors the
!       ratio of log approx-marginal-likelihood of class statistics
!       with-respect-to the class hypothesis (class->log_a_w_s_h_j)
!       divided by the class weight (class->w_j), for each class.
!       "converge_search_3" halts convergence when the difference between
!       cycles, of this ratio, for every class, has been exceeded by
!       "rel_delta_range" for "n_average" cycles.  Decreasing
!       "rel_delta_range" tightens the convergence and increases the
!       number of cycles.

# cs4_delta_range = 0.0025
!       Passed to try function "converge_search_4", which monitors the
!       ratio of (class->log_a_w_s_h_j)/(class->w_j), for each class,
!       averaged over "sigma_beta_n_values" convergence cycles.
!       "converge_search_4" halts convergence when the maximum difference
!       in average values of this ratio falls below "cs4_delta_range".
!       Decreasing "cs4_delta_range" tightens the convergence and
!       increases the number of cycles.

# n_average = 3
!       Passed to try functions "converge_search_3" and "converge".
!       The number of cycles for which the convergence criterion
!       must be satisfied for the trial to terminate.

# sigma_beta_n_values = 6
!       Passed to try_fn_type "converge_search_4".  The number of past 
!       values to use in computing sigma^2 (noise) and beta^2 (signal).

# max_cycles = 200
!       This is the maximum number of cycles permitted for any one convergence 
!       of a classification, regardless of any other stopping criteria.  This
!       is very dependent upon your database and choice of model and
!       convergence parameters, but should be about twice the average number
!       of cycles reported in the screen dump and .log file 

# converge_print_p = false
!       If true, the selected try function will print to the screen values
!       useful in specifying non-default values for halt_range, halt_factor,
!       rel_delta_range, n_average, sigma_beta_n_values, and range_factor.

# force_new_search_p = true
!       If true, will ignore any previous search results, discarding the 
!       existing .search & .results[-bin] files after confirmation by the 
!       user; if false, will continue the search using the existing 
!       .search & .results[-bin] files. 
!       For repeatable results, also see min_report_period, start_fn_type 
!       and randomize_random_p.

# checkpoint_p = false
!       If true, checkpoints of the current classification will be written
!       every "min_checkpoint_period" seconds, with file extension
!       .chkpt[-bin]. This is only useful for very large classifications

# min_checkpoint_period = 10800
!       If checkpoint_p = true, the checkpointed classification will be 
!       written this often - in seconds (default = 3 hours)

# reconverge_type = ""
!       Can be either "chkpt" or "results".  If "checkpoint_p" = true and
!       "reconverge_type" = "chkpt", then continue convergence of the
!       classification contained in <...>.chkpt[-bin].  If "checkpoint_p "
!       = false and "reconverge_type" = "results", continue convergence of
!       the best classification contained in <...>.results[-bin].  

# screen_output_p = true
!       If false, no output is directed to the screen.  Assuming 
!       log_file_p = true, output will be directed to the log file only.

# break_on_warnings_p = true
!       The default value asks the user whether or not to continue, when data
!       definition warnings are found.  If specified as false, then AutoClass
!       will continue, despite warnings -- the warning will continue to be
!       output to the terminal and the log file.

# free_storage_p = true
!       The default value tells AutoClass to free the majority of its allocated
!       storage.  This is not required, and in the case of DEC Alpha's causes
!       core dump.  If specified as false, AutoClass will not attempt to free
!       storage.

#!#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;
# OVERRIDE PARAMETERS
#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;#!;


14.0 HOW TO GET AUTOCLASS C TO PRODUCE REPEATABLE RESULTS

In some situations, repeatable classifications are required: comparing basic
AutoClass C integrity on different platforms, porting AutoClass C to a new 
platform, etc.  In order to accomplish this two things are necessary: 1) the 
same random number generator must be used, and 2) the search parameters must 
be specified properly.

Random Number Generator
This implementation of AutoClass C uses the Unix srand48/lrand48 random number
generator which generates  pseudo-random  numbers using  the  well-known linear 
congruential algorithm and 48-bit integer arithmetic.  lrand48() returns  non-
negative long integers uniformly distributed over the interval [0, 2**31].

Search Parameters
The following .s-params file parameters should be specified:
force_new_search_p = true
start_fn_type   "block"
randomize_random_p = false
;; specify the number of trials you wish to run
max_n_tries = 50
;; specify a time greater than duration of run
min_report_period = 30000

Note that no current best classification reports will be produced.  Only a
final classification summary will be output.