1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525

PREPARING DATA FOR AUTOCLASS
1.0 Introduction
1.1 Applicable Types of Data
1.1.1 Real Scalar Data: Error and Relerror
1.2 Probability Models
1.2.1 SINGLE_NORMAL_CN/CM and MULTI_NORMAL_CN Models
1.3 Input Files
1.3.1 Data File
1.3.1.1 Handling Missing Values
1.3.2 Header File
1.3.2.1 Header File Example
1.3.3 Model File
1.3.3.1 Model File Example
1.4 Checking Input Files
Footnotes
1.0 Introduction
This documentation file is directed at anyone who will be preparing data
sets for AutoClass C. It requires no statistics or Artificial Intelligence
background.
1.1 Applicable Types of Data
AutoClass is applicable to observations of things that can be described by
a set of features or properties, without referring to other things. This
allows us to represent the observations by a data vector corresponding to a
fixed attribute set. Attributes are names of measurable or distinguishable
properties of the things observed. The data values corresponding to each
attribute are thus limited to be either numbers or the elements of a fixed set
of attribute specific symbols. With numeric data, a measurement error is
assumed and must be provided with the attribute description. AutoClass cannot
express relationships between things because such relationships are not a
property of the thing itself. Nor can AutoClass deal with properties expressed
as sets of values. However the current models do allow for missing or unknown
values. The program itself imposes no specific limit on the number of data,
but databases having more than 10^5 values (cases * attributes) may require
excessive search time.
Note that there are techniques for reexpressing some data types in forms
acceptable to AutoClass. If a set valued property is limited to subsets of a
small set of symbols, one can reexpress the property as a set of binary
attributes, one for each of the possible symbols. Temporal ordering data can
be expressed as "time of (year, week, day)" or "time elapsed since ...". And
one can always indicate that a relation has been observed, even if the related
thing cannot be named. A simple example of the later is the transformation of
`marriedto' to `married?'.
1.1.1 Real Scalar Data: ERROR and REL_ERROR (see footnote #1)
AutoClass and its documentation were written with the idea that it would
be applied to "direct" measurements of instance properties. In such
cases, multiple measurement of single instances will soon establish the
limit beyond which increasing measurement "precision" is simply noise.
The classic example results from attempting to use a digital VOM meter
with hand held probes on an oxidized contact: when set at the
appropriate range, the last few digits will vary almost randomly. In
such cases it is relatively easy to establish an average error
appropriate for the reported measurements. It is the range of digits
over which measurement noise dominates the measured value. Thus with
measurement error the fundamental question is which digits are due to the
measured property and which to measurement noise.
Truncation error will often dominate measurement error. Here the
classical example is human age: measurable to within a few minutes,
easily computable to within a few days, yet generally reported in years.
The reported value has been truncated to much less than its potential
precision. Thus the error in that reported value is on the order of
half the least difference of the representation. Truncation error can
arise from a variety of causes. Its presence should be suspected
whenever measurements of intrinsically continuous properties are reported
as integers or limited precision floating point numbers.
Lacking any specific information bearing on the magnitude of measurement
or truncation errors, we adopt the default of assuming the reported data
to be truncated at the measurement error. Thus we adopt a default error
of 1/2 the least difference of the representation. This is often 1/2
the least significant reported digit. But beware of cases where the
least difference between values is greater than the least digit.
Things get more difficult with "indirect" attribute values computed as
functions over one or more measurements. In principle one can carry any
known errors (or equivalently, precision) through the function to
determine the value's error. In practice this is rarely done 
conventional math routines assume that floating point numbers are
integer sums of a limited range of integer powers of two. Any
unspecified digits are assumed to be zero.
Mathematica's (Wolfram Research) implementation of ArbitraryPrecision
Numbers makes no such assumptions about unspecified digits. Its results
are truncated at the first digit that could be affected by an unspecified
digit in any input. Thus it returns no more precision that is justified
by the inputs and mathematical manipulations. It can be most educational
to see how quickly one loses precision in relatively simple calculations,
and how such loss is affected by different forms of mathematically identical
calculations. It is not difficult to devise calculations that start
with high precision inputs and return negative precision results: values
which are entirely meaningless. We strongly recommend the use of such a
tool for investigating the effects of any data manipulations used to
generate AutoClass inputs.
Lacking access to Mathematica or an equivalent, one should certainly
investigate the effects of varying data values over their error range to
gauge the effect on the resulting functional values. Sometime this can
be done symbolically. More often it will require a numerical
investigation. Of course, such investigations assume that one knows
error range or precision of the function inputs. Lacking definite
information on this point, one can use the default truncation value.
The fundamental question in all of this is: "To what extent do you
believe the numbers that are to be given to AutoClass?" AutoClass will
run quite happily with whatever it is given. It is up to the user to
decide what is meaningful and what is not.
"Real scalar" is our term for singly bounded real values, typically
bounded below at zero. Classical examples are the height and length of
an object, neither of which can be negative. The corresponding counter
examples would be elevation and location, both measured with respect to
some arbitrary zero and capable of going negative. For scalar reals we
use the Lognormal model which has zero probability density at and below
zero. This is currently implemented by taking the logs of the data
values and applying the Gaussian Normal model. The current Normal model
requires a constant error term to set the bounds of integration.
It turns out that a constant error in the logarithm of a value is
equivalent to a relative error in the original value. That is, the
error in the value should be proportional to the value, rather than
being itself a constant. And REL_ERROR is just the ratio of the error
to the value. If your knowledge of the data generating process is
sufficient to specify such a ratio, just give it as the value of
REL_ERROR. Otherwise give your estimate of the constant error as ERROR,
and AutoClass will compute the ratio of this to the average data value
and use this as REL_ERROR.
1.2 Probability Models
The SINGLE_MULTINOMIAL, SINGLE_NORMAL_CM, and SINGLE_NORMAL_CN
probability models assume that the attributes are conditionally independent
given the class. Thus within each class the probability that an instance
of the class would have a particular value of any attribute depended only
on the class and was independent of all other attribute values. The
MULTI_NORMAL_CN covariant model expresses mutual dependences within the
class. The probability that the class will produce any particular instance
is then the product of any independent and covariant probability terms.
We use covariant or independent multinomial model terms for discrete
attributes of nominal, ordered, and circular subtypes (all are currently
handled identically). These model terms allow any number of values for an
attribute, including unknown values. We use Gaussian normal model terms for
real numerical attributes, or any representing measurements. There are
actually two independent versions, one of which allows for the possibility of
unknown values. The covariant normal model term requires that all attribute
values be known for every case. There is also an `ignore' model term for
attributes which are not to be considered in generating the classification.
1.2.1 SINGLE_NORMAL_CN/CM and MULTI_NORMAL_CN Models
The SINGLE_NORMAL_CN/CM and MULTI_NORMAL_CN models were originally
written for use with real valued attributes of the location subtype.
Such attributes are unbounded  their values can potentially range to
+/ infinity. A scalar real valued attribute is singly bounded. Its
values are constrained by prior information to lie to one side of a zero
point, typically 0.0, and have no values lying in the `negative' region. Thus
Normal models that assign nonzero probability density to the `negative'
region are less than optimally informative. Note that we say `less than
optimal' rather than `incorrect'. The standard Normal model will
generally do a good job of classifying scalar reals, and will do an
excellent job with scalar reals that are clustered well away from the
zero point. But we can do better, especially when the values cluster
close to the zero point.
The better model is the LogNormal, obtained by substituting ln(xzero)
for x in the Normal model. This model assigns zero probability density
to x <= zero. The peak probability density value, at x=e^(musigma^2),
can be arbitrarily close to the zero point while quite independent of the
distribution's variance of s^2 = e^(2*mu + sigma^2)*(e^(sigma^2)  1).
Yet for small sigma, say sigma < .1, the LogNormal is visually
indistinguishable from a Normal.
In the current AutoClass C we obtain the LogNormal model by transforming
the attribute values to ln(xzero) and applying the appropriate
xxxNormalyy model. The key for obtaining this variation is the
specification of the real subtype as `scalar', with appropriate
ZERO_POINT and REL_ERROR values. When AutoClass C is instructed
to apply a Normal model to such an attribute, it automatically performs
the transformation, effectively applying the equivalent LogNormal model.
The specification of a real valued attribute's subtype is thus a
specification of the type of Normal model to be used on that attribute.
The MULTI_NORMAL_CN model implements a multidimensional normal distribution
over a group of attributes that have real continuous values, with no values
missing. It is the model of choice for such attributes when they are
thought to have correlated values. When such correlations are present,
the classifications obtained using the MULTI_NORMAL_CN model will generally
be more probable than those obtained with the SINGLE_NORMAL_CN model, because
they better describe the data distributions. But one should not apply it
indiscriminately. Lacking strong prior evidence for correlations, but
suspecting them, one needs to try all reasonable combinations of attributes
and compare the probability of the resulting classifications.
As an example, consider a database of instance vectors describing physical
objects that have intrinsic size and shape, neither of which is recorded.
Then one expects that the recorded attributes length, width, and depth, will
vary linearly with size, and have differing ratios with respect to shape.
Given a sufficient number of instances of each shape, the MULTI_NORMAL_CN
model applied to length, width, and depth, should pick out the shape
classes in terms of the attribute correlations. The SINGLE_NORMAL_CN
model might pick out the shapes, but it would tend to divide each shape
into classes of similar size, and to merge similar sizes of differing
shapes into common classes.
1.3 Input Files
An AutoClass data set resides in two files. There is a a header file
(file type "hd2") that describes the specific data format and attribute
definitions. The actual data values are in a data file (file type "db2").
We use two files to allow editing of data descriptions without having to
deal with the entire data set. This makes it easy to experiment with
different descriptions of the database without having to reproduce the data
set. Internally, an AutoClass database structure is identified by its
header and data files, and the number of data loaded.
A classification of a data set is made with respect to a model which
specifies the form of the probability distribution function for classes in that
data set. Normally the model structure is defined in a model file (file
type "model"), containing one or more models. Internally, a model is defined
relative to a particular database. Thus it is identified by the corresponding
database, the model's model file and its sequential position in the file.
1.3.1 Data File
The data file contains a sequence of data objects (datum or case)
terminated by the end of the file. The number of values for each data object
must be equal to the number of attributes defined in the header file. There is
an implied "newline" ('\n') after each data object. Data objects must be
groups of tokens delimited by "newline". Attributes are typed as REAL,
DISCRETE, or DUMMY. Real attribute values are numbers, either integer or
floating point. Discrete attribute values can be strings, symbols, or integers.
A dummy attribute value can be any of these types. Dummy's are read in but
otherwise ignored  they will be set to zeros in the the internal database.
Thus the actual values will not be available for use in report output.
To have these attribute values available, use either type REAL or type
DISCRETE, and define their model type as IGNORE in the .model file.
Missing values for any attribute type may be represented by either '?', or
other token specified in the header file. All are translated to a special
unique value after being read, so this symbol is effectively reserved for
unknown/missing values.
Example:
white 38.991306 0.54248405 2 2 1
red 25.254923 0.5010235 9 2 1
yellow 32.407973 ? 8 2 1
all_white 28.953982 0.5267696 0 1 1
The data file can optionally be input in binary format. This is useful for
very large data files in order to reduce disk space and time for reading
the file. The user must create the binary file to conform to the following:
 the file name extension must be ".db2bin", rather than ".db2".
 the file begins with a 12byte header
 char[8] = ".db2bin",
 32bit integer with bytelength of each data case.
 the data cases follow in binary "float" format  32 bit fields.
Real valued data, and discrete integer data converted to floating point
format are accommodated. Discrete character data (e.g. "white", in above
example) would have to be assigned integer values, and converted to
floating point format.
Note: DOS derived data files that are to be used in a Unix environment should
first be processed with dos2unix, to remove carriage returns (^M) from the
lines. We have observed a case where such carriage returns were read as part
of a discrete data value, passed through AutoClass, and printed in the
xxx.influ report, where they destroyed the data formatting. Should this
occur, xxx.influ data formatting can still be restored with dos2unix.
1.3.1.1 Handling Missing Values
Since we were designing AutoClass to work with arbitrary data sets, we could
make no universally valid assumptions about the mechanisms that generate any
missing data the system might encounter. Lacking specifics, we could choose
no basis for "correcting" missing data. Thus we were forced to deal with
the data, and only the data, independent of any information about the data's
origins. This is the great disadvantage of any general purpose classifier:
You either make assumptions that seem good for the current application, but
will be absurd in others, or you ignore the background information that
justifies such assumptions.
We took the latter course, treating missing values as valid data. Thus our
classifications are actually over the convolution of original subjects through
the data collection process, and our results may be dominated by either. When
no missing values are present, one expects the results to be dominated by the
subject characteristics. With large proportion of missing values, the
subjects are much obscured by the data collection process, and one must expect
that any patterns found in the data may be due to the collection process
rather than the subjects. Only strong prior knowledge about the collection
process can justify attempting to deconvolve the data.
Note that if one regards a classifier as classifying subjects, rather than
data on subjects, then missing data is merely the most obvious example of
erroneous data, which presents a far larger and more intractable problem. The
assumption, common under this viewpoint, that only the missing data are in
error, is clearly absurd. AutoClass deals only with the existing record.
Attempting to classify what *should* have been recorded, requires a far more
sophisticated system that is carefully tuned to the collection process.
1.3.2 Header File
The header file specifies the data file format, and the definitions of
the data attributes. The header file functional specifications consists of
two parts  the data set format definition specifications, and the
attribute descriptors (; in column 1 identifies a comment):
;; num_db2_format_defs value (number of format def lines that follow),
;; range of n is 1 > 5
num_db2_format_defs n
;; number_of_attributes token and value required
number_of_attributes <as required>
;; following are optional  default values are specified
separator_char ' '
comment_char ';'
unknown_token '?'
separator_char ','
;; attribute descriptors
;; <zerobased att#> <att_type> <att_sub_type> <att_description> <att_param_pairs>
Each attribute descriptor is a line of:
Attribute index (zero based, beginning in column 1)
Attribute type. See below.
Attribute subtype. See below
Attribute description: symbol (no embedded blanks) or string; <= 40 characters
Specific property and value pairs. See below.
Currently available combinations:
type subtype property type(s)
  
dummy none/nil 
discrete nominal range
real location error
real scalar zero_point rel_error
An example is given below in section 1.3.2.1.
The ERROR property should represent your best estimate of the average error
expected in the measurement and recording of that real attribute. Lacking
better information, the error can be taken as 1/2 the minimum possible
difference between measured values. It can be argued that real values are
often truncated, so that smaller errors may be justified, particularly for
generated data. But AutoClass only sees the recorded values. So it needs the
error in the recorded values, rather than the actual measurement error. Setting
this error much smaller than the minimum expressible difference implies the
possibility of values that cannot be expressed in the data. Worse, it implies
that two identical values must represent measurements that were much closer
than they might actually have been. This leads to overfitting of the
classification.
The REL_ERROR property is used for SCALAR reals when the error is
proportional to the measured value. The ERROR property is not supported.
AutoClass uses the error as a lower bound on the width of the normal
distribution. So small error estimates tend to give narrower peaks and to
increase both the number of classes and the classification probability. Broad
error estimates tend to limit the number of classes.
The scalar ZERO_POINT property is the smallest value that the measurement
process could have produced. This is often 0.0, or less by some error range.
Similarly, the bounded real's min and max properties are exclusive bounds on
the attributes generating process. For a calculated percentage these would be
0e and 100+e, where e is an error value. The discrete attribute's range is
the number of possible values the attribute can take on. This range must
include unknown as a value when such values occur.
1.3.2.1 Header File Example
!#; AutoClass C header file  extension .hd2
!#; the following chars in column 1 make the line a comment:
!#; '!', '#', ';', ' ', and '\n' (empty line)
;#! num_db2_format_defs <num of def lines  min 1, max 4>
num_db2_format_defs 2
;; required
number_of_attributes 7
;; optional  default values are specified
;; separator_char ' '
;; comment_char ';'
;; unknown_token '?'
separator_char ','
;; <zerobased att#> <att_type> <att_sub_type> <att_description> <att_param_pairs>
0 dummy nil "True class, range = 1  3"
1 real location "X location, m. in range of 25.0  40.0" error .25
2 real location "Y location, m. in range of 0.5  0.7" error .05
3 real scalar "Weight, kg. in range of 5.0  10.0" zero_point 0.0 rel_error .001
4 discrete nominal "Truth value, range = 1  2" range 2
5 discrete nominal "Color of foobar, 10 values" range 10
6 discrete nominal Spectral_color_group range 6
1.3.3 Model File
The model file contains data describing the model(s) that will be used for
the classification. Each model is specified by one or more model group
definition lines. Each model group line associates zerobased attribute indices
with a model term type.
Each model group line consists of:
A model term type (one of single_multinomial, single_normal_cm,
single_normal_cn, multi_normal_cn, or ignore).
One or more attribute indices (attribute set list), or the symbol
default.
Notes:
 At least one model definition is required (model_index token).
 There may be multiple entries in a model for any model term type.
 An attribute index must not appear more than once in a model list.
 ignore is not a valid default model term type.
 Model term types currently consists of:
single_multinomial  models discrete attributes as multinomials,
with missing values.
single_normal_cn  models real valued attributes as normals; no
missing values.
single_normal_cm  models real valued attributes with missing values.
multi_normal_cn  is a covariant normal model without missing values.
ignore  allows the model to ignore one or more attributes.
 See the documentation in modelsc.text for further information about
specific model terms.
 single_normal_cn/cm and multi_normal_cn modeled data, whose subtype is
scalar (value distribution is away from 0.0, and is thus not a "normal"
distribution) will be log transformed and modeled with the lognormal
model. For data whose subtype is location (value distribution is
around 0.0), no transform is done, and the normal model is used.
1.3.3.1 Model File Example
The tokens "model_index n m" must appear on the first noncomment line, and
precede the model term definition lines. "n" is the zerobased model index,
typically 0 where there is only one model  the majority of search situations.
"m" is the number of model term definition lines that follow. Note that
single model terms may have one or more zerobased attribute indices on
each line. Multi model term set lists are two or more zerobased attribute
indices per line.
!#; AutoClass C model file  extension .model
!#; the following chars in column 1 make the line a comment:
!#; '!', '#', ';', ' ', and '\n' (empty line)
;; 1 or more model definitions
;; model_index <zero_based index> <number of model definition lines>
model_index 0 7
ignore 0
single_normal_cn 3
single_normal_cn 17 18 21
multi_normal_cn 1 2
multi_normal_cn 8 9 10
multi_normal_cn 11 12 13
single_multinomial default
1.4 Checking Input Files
AutoClass, when invoked in the "search" mode will check the validity
of the set of data, header, model, and search parameter files. Errors
will stop the search from starting, and warnings will ask the user whether
to continue. A history of the error and warning messages is saved,
by default, in the log file. The AutoClass search form is:
% autoclass search <data file path> <header file path> <model file path>
<search params file path>
All files must be specified as fully qualified relative or absolute
pathnames. File name extensions (file types) for all files are forced to
canonical values required by the AutoClass program:
data file ("ascii") db2
data file ("binary") db2bin
header file hd2
model file model
search params file sparams
The search parameter definitions are discussed in searchc.text, as
well as contained as comments in all .sparams files: for example,
autoclassc/sample/imports85.sparams.
The log file will be named <search params file path>. If LOG_FILE_P
(search params file parameter) is false, then no log file is generated.
The log file is created such that multiple sessions of AUTOCLASS SEARCH <...>
will result in only one log file. The file extension of the log file is
forced to "log".
N_DATA (search params file parameter), if supplied, allows the reading of
less than the full data file. This is useful when the data file is very large
and you are just interested in validating the header and model file contents.
All advisory, warning, and error messages are output to the screen, and to
the log file, providing that the LOG_FILE_P argument is true (the default).
Advisory messages are output to provide information which is not crucial to
the continuance of the run. Warning messages contain information which may
affect the quality of the run. The default condition is to stop
the run when one or more warning messages are generated, and ask the user
whether to proceed. Error messages are fatal, and the run state will be
terminated.
Footnotes:
#1) REL_ERROR is in uppercase to distinguish it from the surrounding text.
It represents the lowercase keyword rel_error which is how it is used
in the .hd2 file. This is true of other uppercase words or phrases
which occur in this text.
