File: preparation-c.text

package info (click to toggle)
autoclass 3.3.4-6
  • links: PTS
  • area: main
  • in suites: lenny
  • size: 3,844 kB
  • ctags: 994
  • sloc: ansic: 16,674; makefile: 123; sh: 98; cpp: 95; csh: 77
file content (525 lines) | stat: -rw-r--r-- 26,929 bytes parent folder | download | duplicates (7)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
  		PREPARING DATA FOR AUTOCLASS

1.0  		Introduction
1.1  		Applicable Types of Data
1.1.1           Real Scalar Data: Error and Rel-error
1.2  		Probability Models
1.2.1              SINGLE_NORMAL_CN/CM and MULTI_NORMAL_CN Models
1.3  		Input Files
1.3.1  	   	   Data File
1.3.1.1               Handling Missing Values
1.3.2  	   	   Header File
1.3.2.1       	      Header File Example
1.3.3      	   Model File
1.3.3.1       	      Model File Example 
1.4     	Checking Input Files
Footnotes


1.0  Introduction

    This documentation file is directed at anyone who will be preparing data
sets for AutoClass C. It requires no statistics or Artificial Intelligence
background.

	
1.1  Applicable Types of Data

    AutoClass is applicable to observations of things that can be described by
a set of features or properties, without referring to other things.  This
allows us to represent the observations by a data vector corresponding to a
fixed attribute set.  Attributes are names of measurable or distinguishable
properties of the things observed.  The data values corresponding to each
attribute are thus limited to be either numbers or the elements of a fixed set
of attribute specific symbols.  With numeric data, a measurement error is
assumed and must be provided with the attribute description.  AutoClass cannot
express relationships between things because such relationships are not a
property of the thing itself.  Nor can AutoClass deal with properties expressed
as sets of values.  However the current models do allow for missing or unknown
values.  The program itself imposes no specific limit on the number of data,
but databases having more than 10^5 values (cases * attributes) may require
excessive search time.

    Note that there are techniques for re-expressing some data types in forms
acceptable to AutoClass.  If a set valued property is limited to subsets of a
small set of symbols, one can re-express the property as a set of binary
attributes, one for each of the possible symbols.  Temporal ordering data can
be expressed as "time of (year, week, day)" or "time elapsed since ...".  And
one can always indicate that a relation has been observed, even if the related
thing cannot be named.  A simple example of the later is the transformation of
`married-to' to `married?'.


1.1.1  Real Scalar Data: ERROR and REL_ERROR (see footnote #1)

AutoClass and its documentation were written with the idea that it would
be applied to "direct" measurements of instance properties.  In such
cases, multiple measurement of single instances will soon establish the
limit beyond which increasing measurement "precision" is simply noise.
The classic example results from attempting to use a digital VOM meter
with hand held probes on an oxidized contact: when set at the
appropriate range, the last few digits will vary almost randomly.  In
such cases it is relatively easy to establish an average error
appropriate for the reported measurements.  It is the range of digits
over which measurement noise dominates the measured value.  Thus with
measurement error the fundamental question is which digits are due to the
measured property and which to measurement noise.

Truncation error will often dominate measurement error.  Here the
classical example is human age: measurable to within a few minutes,
easily computable to within a few days, yet generally reported in years.
The reported value has been truncated to much less than its potential
precision.  Thus the error in that reported value is on the order of
half the least difference of the representation.  Truncation error can
arise from a variety of causes.  Its presence should be suspected
whenever measurements of intrinsically continuous properties are reported
as integers or limited precision floating point numbers.

Lacking any specific information bearing on the magnitude of measurement
or truncation errors, we adopt the default of assuming the reported data
to be truncated at the measurement error.  Thus we adopt a default error
of 1/2 the least difference of the representation.  This is often 1/2
the least significant reported digit. But beware of cases where the
least difference between values is greater than the least digit.

Things get more difficult with "indirect" attribute values computed as
functions over one or more measurements.  In principle one can carry any
known errors (or equivalently, precision) through the function to
determine the value's error.  In practice this is rarely done -
conventional math routines assume that floating point numbers are
integer sums of a limited range of integer powers of two.  Any
unspecified digits are assumed to be zero.

Mathematica's (Wolfram Research) implementation of Arbitrary-Precision 
Numbers makes no such assumptions about unspecified digits.  Its results 
are truncated at the first digit that could be affected by an unspecified 
digit in any input.  Thus it returns no more precision that is justified 
by the inputs and mathematical manipulations.  It can be most educational 
to see how quickly one loses precision in relatively simple calculations, 
and how such loss is affected by different forms of mathematically identical
calculations.  It is not difficult to devise calculations that start
with high precision inputs and return negative precision results: values
which are entirely meaningless.  We strongly recommend the use of such a
tool for investigating the effects of any data manipulations used to
generate AutoClass inputs.

Lacking access to Mathematica or an equivalent, one should certainly
investigate the effects of varying data values over their error range to
gauge the effect on the resulting functional values.  Sometime this can
be done symbolically.  More often it will require a numerical
investigation.  Of course, such investigations assume that one knows
error range or precision of the function inputs.  Lacking definite
information on this point, one can use the default truncation value.

The fundamental question in all of this is: "To what extent do you
believe the numbers that are to be given to AutoClass?"  AutoClass will
run quite happily with whatever it is given.  It is up to the user to
decide what is meaningful and what is not.

"Real scalar" is our term for singly bounded real values, typically
bounded below at zero.  Classical examples are the height and length of
an object, neither of which can be negative.  The corresponding counter
examples would be elevation and location, both measured with respect to
some arbitrary zero and capable of going negative.  For scalar reals we
use the Log-normal model which has zero probability density at and below
zero.  This is currently implemented by taking the logs of the data
values and applying the Gaussian Normal model.  The current Normal model
requires a constant error term to set the bounds of integration.
  
It turns out that a constant error in the logarithm of a value is
equivalent to a relative error in the original value.  That is, the
error in the value should be proportional to the value, rather than
being itself a constant.  And REL_ERROR is just the ratio of the error
to the value.  If your knowledge of the data generating process is
sufficient to specify such a ratio, just give it as the value of
REL_ERROR.  Otherwise give your estimate of the constant error as ERROR,
and AutoClass will compute the ratio of this to the average data value
and use this as REL_ERROR.

1.2  Probability Models

    The SINGLE_MULTINOMIAL, SINGLE_NORMAL_CM, and SINGLE_NORMAL_CN
probability models assume that the attributes are conditionally independent
given the class.  Thus within each class the probability that an instance
of the class would have a particular value of any attribute depended only
on the class and was independent of all other attribute values.  The
MULTI_NORMAL_CN covariant model expresses mutual dependences within the
class.  The probability that the class will produce any particular instance
is then the product of any independent and covariant probability terms. 

    We use covariant or independent multinomial model terms for discrete
attributes of nominal, ordered, and circular subtypes (all are currently
handled identically).  These model terms allow any number of values for an
attribute, including unknown values.  We use Gaussian normal model terms for
real numerical attributes, or any representing measurements.  There are
actually two independent versions, one of which allows for the possibility of
unknown values.  The covariant normal model term requires that all attribute
values be known for every case.  There is also an `ignore' model term for
attributes which are not to be considered in generating the classification.


1.2.1  SINGLE_NORMAL_CN/CM and MULTI_NORMAL_CN Models

    The SINGLE_NORMAL_CN/CM and  MULTI_NORMAL_CN models were originally
written for use with real valued attributes of the location subtype.
Such attributes are unbounded - their values can potentially range to
+/- infinity.  A scalar real valued attribute is singly bounded.  Its
values are constrained by prior information to lie to one side of a zero
point, typically 0.0, and have no values lying in the `negative' region.  Thus
Normal models that assign non-zero probability density to the `negative'
region are less than optimally informative.  Note that we say `less than
optimal' rather than `incorrect'.  The standard Normal model will
generally do a good job of classifying scalar reals, and will do an
excellent job with scalar reals that are clustered well away from the
zero point.  But we can do better, especially when the values cluster
close to the zero point.

    The better model is the Log-Normal, obtained by substituting ln(x-zero)
for x in the Normal model.  This model assigns zero probability density
to x <= zero.  The peak probability density value, at x=e^(mu-sigma^2),
can be arbitrarily close to the zero point while quite independent of the
distribution's variance of s^2 = e^(2*mu + sigma^2)*(e^(sigma^2) - 1). 
Yet for small sigma, say sigma < .1, the Log-Normal is visually
indistinguishable from a Normal.

    In the current AutoClass C we obtain the Log-Normal model by transforming
the attribute values to ln(x-zero) and applying the appropriate
xxx-Normal-yy model.  The key for obtaining this variation is the
specification of the real subtype as `scalar', with appropriate
ZERO_POINT and REL_ERROR values.  When AutoClass C is instructed
to apply a Normal model to such an attribute, it automatically performs
the transformation, effectively applying the equivalent Log-Normal model.
The specification of a real valued attribute's subtype is thus a
specification of the type of Normal model to be used on that attribute.  

The MULTI_NORMAL_CN model implements a multi-dimensional normal distribution 
over a group of attributes that have real continuous values, with no values 
missing.  It is the model of choice for such attributes when they are 
thought to have correlated values.  When such correlations are present,
the classifications obtained using the MULTI_NORMAL_CN model will generally 
be more probable than those obtained with the SINGLE_NORMAL_CN model, because
they better describe the data distributions.  But one should not apply it
indiscriminately.  Lacking strong prior evidence for correlations, but
suspecting them, one needs to try all reasonable combinations of attributes
and compare the probability of the resulting classifications.

As an example, consider a database of instance vectors describing physical
objects that have intrinsic size and shape, neither of which is recorded.
Then one expects that the recorded attributes length, width, and depth, will
vary linearly with size, and have differing ratios with respect to shape.  
Given a sufficient number of instances of each shape, the MULTI_NORMAL_CN 
model applied to length, width, and depth, should pick out the shape 
classes in terms of the attribute correlations.  The SINGLE_NORMAL_CN 
model might pick out the shapes, but it would tend to divide each shape 
into classes of similar size, and to merge similar sizes of differing 
shapes into common classes.


1.3  Input Files

    An AutoClass data set resides in two files.  There is a a header file
(file type "hd2") that describes the specific data format and attribute
definitions.  The actual data values are in a data file (file type "db2").
We use two files to allow editing of data descriptions without having to
deal with the entire data set.  This makes it easy to experiment with 
different descriptions of the database without having to reproduce the data
set.  Internally, an AutoClass database structure is identified by its
header and data files, and the number of data loaded. 

    A classification of a data set is made with respect to a model which
specifies the form of the probability distribution function for classes in that
data set.  Normally the model structure is defined in a model file (file
type "model"), containing one or more models.  Internally, a model is defined
relative to a particular database.  Thus it is identified by the corresponding
database, the model's model file and its sequential position in the file.


1.3.1  Data File

    The data file contains a sequence of data objects (datum or case)
terminated by the end of the file. The number of values for each data object
must be equal to the number of attributes defined in the header file.  There is
an implied "new-line" ('\n') after each data object.  Data objects must be
groups of tokens delimited by "new-line".  Attributes are typed as REAL,
DISCRETE, or DUMMY.  Real attribute values are numbers, either integer or
floating point.  Discrete attribute values can be strings, symbols, or integers.
A dummy attribute value can be any of these types.  Dummy's are read in but
otherwise ignored -- they will be set to zeros in the the internal database.
Thus the actual values will not be available for use in report output.
To have these attribute values available, use either type REAL or type
DISCRETE, and define their model type as IGNORE in the .model file.
Missing values for any attribute type may be represented by either '?', or
other token specified in the header file.  All are translated to a special
unique value after being read, so this symbol is effectively reserved for 
unknown/missing values.
	  
	Example: 

		  white	      38.991306 0.54248405  2 2 1
		  red         25.254923 0.5010235   9 2 1
		  yellow      32.407973 ?	    8 2 1
		  all_white   28.953982 0.5267696   0 1 1

The data file can optionally be input in binary format.  This is useful for
very large data files in order to reduce disk space and time for reading 
the file.  The user must create the binary file to conform to the following:

        - the file name extension must be ".db2-bin", rather than ".db2".
        - the file begins with a 12-byte header
                - char[8] = ".db2-bin",
                - 32-bit integer with byte-length of each data case.
        - the data cases follow in binary "float" format -- 32 bit fields.

Real valued data, and discrete integer data converted to floating point
format are accommodated.  Discrete character data (e.g. "white", in above
example) would have to be assigned integer values, and converted to
floating point format.

Note: DOS derived data files that are to be used in a Unix environment should
first be processed with dos2unix, to remove carriage returns (^M) from the
lines.  We have observed a case where such carriage returns were read as part
of a discrete data value, passed through AutoClass, and printed in the
xxx.influ report, where they destroyed the data formatting.  Should this
occur, xxx.influ data formatting can still be restored with dos2unix. 


1.3.1.1  Handling Missing Values

Since we were designing AutoClass to work with arbitrary data sets, we could 
make no universally valid assumptions about the mechanisms that generate any 
missing data the system might encounter.  Lacking specifics, we could choose 
no basis for "correcting" missing data.  Thus we were forced to deal with 
the data, and only the data, independent of any information about the data's 
origins.  This is the great disadvantage of any general purpose classifier: 
You either make assumptions that seem good for the current application, but 
will be absurd in others, or you ignore the background information that 
justifies such assumptions.

We took the latter course, treating missing values as valid data.  Thus our
classifications are actually over the convolution of original subjects through
the data collection process, and our results may be dominated by either.  When
no missing values are present, one expects the results to be dominated by the
subject characteristics.  With large proportion of missing values, the
subjects are much obscured by the data collection process, and one must expect
that any patterns found in the data may be due to the collection process
rather than the subjects.  Only strong prior knowledge about the collection
process can justify attempting to deconvolve the data.  

Note that if one regards a classifier as classifying subjects, rather than
data on subjects, then missing data is merely the most obvious example of
erroneous data, which presents a far larger and more intractable problem.  The
assumption, common under this viewpoint, that only the missing data are in
error, is clearly absurd.  AutoClass deals only with the existing record.
Attempting to classify what *should* have been recorded, requires a far more
sophisticated system that is carefully tuned to the collection process.


1.3.2  Header File

    The header file specifies the data file format, and the definitions of
the data attributes.  The header file functional specifications consists of
two parts -- the data set format definition specifications, and the
attribute descriptors (; in column 1 identifies a comment):

;; num_db2_format_defs value (number of format def lines that follow),
;;      range of n is 1 -> 5
num_db2_format_defs n
;; number_of_attributes token and value required
number_of_attributes <as required>
;; following are optional - default values are specified 
separator_char  ' '
comment_char    ';'
unknown_token   '?'
separator_char  ','

;; attribute descriptors
;; <zero-based att#>  <att_type>  <att_sub_type>  <att_description>  <att_param_pairs>

   Each attribute descriptor is a line of:
      Attribute index (zero based, beginning in column 1)
      Attribute type.  See below.
      Attribute subtype.  See below
      Attribute description: symbol (no embedded blanks) or string; <= 40 characters
      Specific property and value pairs. See below.

         Currently available combinations:

         type        	subtype         property type(s)
         ----        	--------        ---------------   
	 dummy		none/nil        --
         discrete	nominal 	range
	 real		location 	error
	 real		scalar 		zero_point rel_error


     An example is given below in section 1.3.2.1.

    The ERROR property should represent your best estimate of the average error
expected in the measurement and recording of that real attribute.  Lacking
better information, the error can be taken as 1/2 the minimum possible
difference between measured values.  It can be argued that real values are
often truncated, so that smaller errors may be justified, particularly for
generated data.  But AutoClass only sees the recorded values.  So it needs the
error in the recorded values, rather than the actual measurement error.  Setting
this error much smaller than the minimum expressible difference implies the
possibility of values that cannot be expressed in the data.  Worse, it implies
that two identical values must represent measurements that were much closer
than they might actually have been.  This leads to over-fitting of the
classification.

    The REL_ERROR property is used for SCALAR reals when the error is
proportional to the measured value.  The ERROR property is not supported.

    AutoClass uses the error as a lower bound on the width of the normal
distribution.  So small error estimates tend to give narrower peaks and to
increase both the number of classes and the classification probability.  Broad
error estimates tend to limit the number of classes.

    The scalar ZERO_POINT property is the smallest value that the measurement
process could have produced.  This is often 0.0, or less by some error range.
Similarly, the bounded real's min and max properties are exclusive bounds on
the attributes generating process.  For a calculated percentage these would be
0-e and 100+e, where e is an error value.  The discrete attribute's range is
the number of possible values the attribute can take on.  This range must
include unknown as a value when such values occur.


1.3.2.1  Header File Example
	    
!#; AutoClass C header file -- extension .hd2
!#; the following chars in column 1 make the line a comment:
!#; '!', '#', ';', ' ', and '\n' (empty line)

;#! num_db2_format_defs <num of def lines -- min 1, max 4>
num_db2_format_defs 2
;; required
number_of_attributes 7
;; optional - default values are specified 
;; separator_char  ' '
;; comment_char    ';'
;; unknown_token   '?'
separator_char     ','

;; <zero-based att#>  <att_type>  <att_sub_type>  <att_description>  <att_param_pairs>
0 dummy nil       "True class, range = 1 - 3"
1 real location "X location, m. in range of 25.0 - 40.0" error .25
2 real location "Y location, m. in range of 0.5 - 0.7" error .05
3 real scalar   "Weight, kg. in range of 5.0 - 10.0" zero_point 0.0 rel_error .001
4 discrete nominal  "Truth value, range = 1 - 2" range 2
5 discrete nominal  "Color of foobar, 10 values" range 10
6 discrete nominal  Spectral_color_group range 6 


1.3.3  Model File

    The model file contains data describing the model(s) that will be used for
the classification.  Each model is specified by one or more model group
definition lines.  Each model group line associates zero-based attribute indices
with a model term type.

    Each model group line consists of:
        A model term type (one of single_multinomial, single_normal_cm,
        single_normal_cn, multi_normal_cn, or ignore).
        One or more attribute indices (attribute set list), or the symbol
	default.

    Notes:
      - At least one model definition is required (model_index token).
      - There may be multiple entries in a model for any model term type.
      - An attribute index must not appear more than once in a model list.
      - ignore is not a valid default model term type.
      - Model term types currently consists of: 
	    single_multinomial - models discrete attributes as multinomials,
				   with missing values.
	    single_normal_cn - models real valued attributes as normals; no
				   missing values.
	    single_normal_cm - models real valued attributes with missing values.
	    multi_normal_cn - is a covariant normal model without missing values.
	    ignore - allows the model to ignore one or more attributes.
      - See the documentation in models-c.text for further information about
            specific model terms.
      - single_normal_cn/cm and multi_normal_cn modeled data, whose subtype is
        scalar (value distribution is away from 0.0, and is thus not a "normal"
        distribution) will be log transformed and modeled with the log-normal
        model.  For data whose subtype is location (value distribution is
        around 0.0), no transform is done, and the normal model is used.


1.3.3.1  Model File Example

The tokens "model_index n m" must appear on the first non-comment line, and
precede the model term definition lines. "n" is the zero-based model index,
typically 0 where there is only one model -- the majority of search situations.
"m" is the number of model term definition lines that follow.  Note that 
single model terms may have one or more zero-based attribute indices on 
each line.  Multi model term set lists are two or more zero-based attribute 
indices per line.
	    
!#; AutoClass C model file -- extension .model
!#; the following chars in column 1 make the line a comment:
!#; '!', '#', ';', ' ', and '\n' (empty line)

;; 1 or more model definitions
;; model_index <zero_based index> <number of model definition lines>
model_index 0 7
ignore 0
single_normal_cn 3 
single_normal_cn 17 18 21 
multi_normal_cn 1 2
multi_normal_cn 8 9 10
multi_normal_cn 11 12 13 
single_multinomial default
     

1.4  Checking Input Files

    AutoClass, when invoked in the "search" mode will check the validity
of the set of data, header, model, and search parameter files.  Errors
will stop the search from starting, and warnings will ask the user whether
to continue.  A history of the error and warning messages is saved,
by default, in the log file.  The AutoClass search form is:

   % autoclass -search <data file path> <header file path> <model file path> 
         <search params file path> 

    All files must be specified as fully qualified relative or absolute 
pathnames.  File name extensions (file types) for all files are forced to 
canonical values required by the AutoClass program:

	data file   ("ascii")   db2 
	data file   ("binary")  db2-bin 
	header file             hd2 
	model file              model
        search params file      s-params

    The search parameter definitions are discussed in search-c.text, as
well as contained as comments in all .s-params files: for example, 
autoclass-c/sample/imports-85.s-params.

    The log file will be named <search params file path>.  If LOG_FILE_P
(search params file parameter) is false, then no log file is generated.  
The log file is created such that multiple sessions of AUTOCLASS -SEARCH <...>
will result in only one log file.  The file extension of the log file is
forced to "log".

    N_DATA (search params file parameter), if supplied, allows the reading of
less than the full data file.  This is useful when the data file is very large
and you are just interested in validating the header and model file contents.

    All advisory, warning, and error messages are output to the screen, and to
the log file, providing that the LOG_FILE_P argument is true (the default).
Advisory messages are output to provide information which is not crucial to
the continuance of the run.  Warning messages contain information which may
affect the quality of the run.  The default condition is to stop
the run when one or more warning messages are generated, and ask the user
whether to proceed.  Error messages are fatal, and the run state will be
terminated.


Footnotes:

#1) REL_ERROR is in upper-case to distinguish it from the surrounding text.
    It represents the lower-case keyword rel_error which is how it is used
    in the .hd2 file.  This is true of other upper-case words or phrases
    which occur in this text.