1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262

This file discusses what classification is, what a user should expect to
achieve (with AutoClass), and how to use the results.
Why Classification:
A classification is usually thought of as a partitioning of a set of things
into subsets in which the members are "more similar" to each other than to
nonmembers. The concept of "more similar" is open to numerous
interpretations. Whenever we say `this thing is a Fubar' we are assigning the
thing to the class designated by the name 'Fubar'. Classification is basic to
the act of identification, and provides one of the simplest forms of
prediction. In saying `all (or most) fubar are snafu' we expect that any thing
classified as fubar is likely to exhibit the property of snafu. Often the only
rational justification for a statement like `all (or most) fubar are snafu' is
a statistical association or causal hypothesis found by study of the snafu
property in things classified as fubar.
What to classify:
The members of almost any set of things that can be described in a regular
manner can be classified, however, finding an appropriate representation may be
a problem. In the AutoClass work we currently limit our analysis to things
that can be described by a set of features or properties, without referring to
other things. This allows us to represent our things by a data vector
corresponding to a fixed attribute set. The attributes are names of measurable
or distinguishable properties of the things. The data values corresponding to
each attribute are thus limited to be either numbers or the elements of a fixed
set of attributespecific symbols. AutoClass cannot express relationships
between things because such relationships are not a property of the thing
itself. In particular, AutoClass cannot express time relationships such as
"before" or "after". Nor can it account for any sequential ordering of the
data. Despite these limitations, AutoClass is applicable to the many problems
where such relations do not exist, can be ignored or transformed into
appropriate attributes. For instance, temporal ordering data can be
reexpressed as "time of (year, week, day)" or "time elapsed since ...". A
similar problem occurs with `subset' type attributes whose potential values are
subsets of a fixed set of possible values. These can be transformed into a set
of binary attributes corresponding to each of the possible values. The number
of thing to be classified can be large or small. The number of classes
AutoClass can be expected to find is typically much smaller than the number of
data.
What is a classification:
Classification is typically described as a partitioning of the set of
things. This is the way humans typically approach classification. Given the
question `what is it?', our response is to say `it is a fubar' or `it is a
baz'. Partitioning is also appealing when we classify in order to predict,
particularly when we want to predict some property of a thing. However,
partitioning implies sharp boundaries: either logical subsets of discrete
attributes or subranges of measured attributes. We may be limited in our
ability to distinguish discrete attribute values, and we are always limited in
our ability to measure continuous attribute values. Thus a partitioning
classifier may find some things lying "near" the class boundaries for which
unequivocal class assignment is not possible and may not be desirable.
Our solution to this problem is to redefine what is meant by
classification. In the AutoClass approach, class membership is expressed
probabilistically rather than as logical assignment. Thus every one of our
things is considered to have a probability that it belongs to each of the
possible classes. These class membership probabilities must sum to 1 for each
thing, because each thing must belong to **some** class, even though there is
insufficient information to say **which** class. This form of "fuzzy"
classification produces classes that are closer to everyday concepts of "cats",
"cars" and "crazy people" than partitioning does. There are no boundaries and
no class assignments, only the membership probabilities.
When every single thing has a probability of more than .99 in its most
probable class, the classes are well separated and we have well defined
measures of class similarity and dissimilarity. If few things have a
probability of more than .5 in any class, it means that the classes are heavily
overlapped. In this case the combination of the classes into larger structures
is usually more meaningful than considering each class separately. In either
case we have a measure of how well our classification fits the data and
individual data fit the classes. No simple partitioning can provide such
information.
Class Models:
We achieve probabilistic classification by redefining the classes in terms
of parameterized probability distributions, such as normal (Gaussian) or beta
distributions. We choose distributions that mimic the processes that are
suspected of having produced the observed data. We limit our choice to those
that have a probability value (discrete attributes) or probability density
(real valued attributes) throughout the attribute space. A thing's class
probabilities are then computed directly from the parameterized class
distributions. Thus the classes provide statistical models of thing
descriptions. The class parameters may be thought of as indirectly specifying
regions in attribute space where each class's probability dominates all others.
These regions correspond approximately to the subsets of the attribute space
that a partitioning classifier would use.
In AutoClass we refer to the functional form of our probability
distributions as the class model. A **class** is defined as a particular set
of parameter values and their associated model. A **classification** is
defined as a set of classes and the probabilities of each class. The
classification problem is to first choose an appropriate class model (or set of
alternate models), then to search out good classifications based on these
models, and finally to rate the relative quality of the alternative
classifications.
As an example of such modeling, consider the case of an attribute such as
"height". If we expect that the measured height of a thing depends only on
its class, we can model the probable value in terms of a characteristic
height. We know there will some measurement error and we usually expect to
find some intrinsic variation in the heights of a class of things. Our model
must allow for such variation. Also, height is a scalar quantity bounded below
by zero. Experience has shown that such scalar quantities typically have a
normal distribution in Log space (which is not bounded by zero). Therefore we
choose a log normal distribution for our model. The corresponding model
parameters are the mean log height and standard deviation of log height.
Given specific parameter values, we can calculate the probability that any
particular observation would result from measurement of the height of any thing
known to be a member of the class. Given multiple classes, we can calculate
the relative probability that a particular height would come from each class.
There is no generally accepted way to rate the relative quality of
alternate classifications. The method of setting up models and searching out
sets of descriptive classes has been the subject of statistical research for
decades. The type of model that we have described, that gives the probability
of the data conditioned on the hypothesized model and parameters: P(XH,p); is
known as a likelihood function. Maximum Likelihood Estimation (MLE) deals with
finding the set of models and parameters that maximizes this probability. In
some quarters this is the method of choice for classification. Unfortunately,
MLE fails to provide a convincing way to compare alternate classifications that
differ in class models or the number of classes. MLE increases with both model
complexity and number of classes (until the class number equals the number of
things). Clearly this contradicts our intuition that moderately simple
classifications are more desirable than very complex ones.
What we would really like to have is the probability of the hypothesized
model given the data, P(HX). Alternatively, we can find the **relative**
probabilities of models given the data; i.e. P(H1X)/P(H2X) = P(H1,X)/P(H2,X).
Then we can directly compare alternate models, in this case models with
different numbers of classes. The value of P(H,X) is obtained from P(H,X) =
P(XH)P(H), where P(XH) is obtained by integration over the parameters in
P(X,pH) = P(Xp,H)P(pH). So we need two additional probabilities for each
hypothesized model: the model probability P(H) and the conditional parameter
probability distribution P(pH). The mathematics for doing this integration in
the current AutoClass models are described in the referenced papers listed
in the "readme.text" file.
Results from AutoClass
The result of an AutoClass run is one or more of the best classifications
found. A classification effectively consists of the class model(s) and a set
of classes, each with the class probability and parameters. Classifications
are rated in terms of the log of the relative marginal probability of the
hypothesized model given the data. Those with log marginals that differ by
less than 5 are are considered to be nearly equally probable. This is because
any other model that gives better classifications will probably give far better
classifications. Note that the log marginal is strongly dependent on the data,
and thus comparisons can only be made between classifications made on a single
database.
Using Classification results:
The ultimate reason for doing classification is to increase understanding of
the domain or to improve predictions compared to unclassified data. Given a
classification and a partial observation, one can always use the classification
to make a statistical estimate of the unobserved attribute values. One can
also use a classification as the departure point for constructing new models,
or theories specific to the particular domain, based on the insight provided by
the classification and the user's domain knowledge.
When classification is done by partitioning, prediction can only be done by
selecting the classes compatible with a partial observation and then listing
the allowed values or ranges of the unknown attributes. With statistical
classification, the usual procedure is to select the class that most probably
generates the partial observation and then use that class's distributions for
the unknowns. This is generally satisfactory only if the probability of the
chosen class is far greater than any other. The result is quite doubtful when
the thing has relatively high probability of belonging to more than one class,
and especially so when alternative classes predict very different values for
the unknowns.
We prefer to use the class probabilities of the thing to construct a
weighted sum of the predictions of all the classes. If the resulting attribute
distribution is broad and flat we know that the partial observation does not
contain sufficient relevant information to predict this attribute. If the
distribution has a sharp single peak we can predict the attribute value with
confidence, knowing that either one class is nearly certain or that a choice
among the most probable classes would be of little influence. A multiply
peaked distribution gives us a set of weighted choices. We can still make a
prediction, but now the knowledge of the alternatives and their relative
probabilities may prevent overconfidence.
The prediction subsystem (not currently implemented) provides just such
a capability for the limited case of discrete attributes modeled with an
independent multinomial distribution. Extensions to dependent discrete
attributes and to real valued attributes are under consideration. Efficient
representation of probability distributions for real values is a special
problem. Until the later have been implemented, we provide a function for
predicting the class probabilities of new cases.
Refinement of Statistical Models:
The models currently used in AutoClass are quite simple. There are both
independent and covariant versions of the multinomial model for discrete
attributes and the Gaussian normal model for real valued attributes. There
are two minor refinements. For discrete attributes and independent reals, we
model missing attribute values as a discrete value of the attribute: `failure
to observe'. This acknowledges that we are classifying the results of
observations, not the things observed, and must model what we have rather than
what we might have obtained. The second refinement is to allow translations of
continuous valued attributes. The simple normal model implicitly assumes that
the attribute values represent noisy measurements of a class point value that
could lie anywhere between plus and minus infinity. The log normal and log
odds normal translations extend this model to classify point values that are
bounded below, and bounded above and below. These translations apply equally
to both independent and covariant models.
The simplest way to refine a model is to eliminate consideration of
attributes that are not contributing anything to the classification. The
obvious case of a discrete attribute where every observation has the same value
is flagged by AutoClass. A close inspection of the influence value reports
will usually show several attributes that have negligible influence on all
classes. In the absence of strong prior reasons for their retention, these can
be ignored. In AutoClass this is done by specifying the `ignore' model term
for any attribute. One will usually find that this increases the posterior
probabilities and sharpens the probabilities of both the class assignments and
the remaining attributes. But ignoring attributes introduces a problem in
comparing the resulting classifications. In ignoring some attributes, we are
effectively working with a smaller database. This alone will increase the
probability of our classifications. The solution is to normalize the
classification probability relative to the quantity of data used for the
classification. The simplest way is to use the ratio of the probability to the
number of active attributes. One should also factor in the number of cases
when comparing classifications over databases of differing case numbers.
The next step is to allow for dependence of attributes within the class.
The multixxx model terms provide a limited capability for this, allowing
covariance of discrete or real valued attributes, but not together. However
this version of AutoClass has no capability for searching the space of possible
covariances. Such search must be done manually by running with alternate model
files and comparing the resulting best classifications. There are two points
to keep in mind. A covariant attribute model requires more parameters than
would be needed to model the same attributes independently. Thus the covariant
model is apriori less probable than the independent model. This is another
example of the tradeoff between descriptivness and the cost of description. It
will rarely be advantageous to simply group all attributes into a single
covariant likelihood term. The real advantage of covariant models appears when
the covariant groups are carefully chosen with regard to everything known about
the attributes. The second point involves running time. There is an increase
in both the cycle time and the number of cycles required to generate a
classification with covariant attributes. And this time increase is roughly
proportional to the maximum covariant group size. So the use of large
covariant attribute groups entails large classification time penalties.
It may be advantageous to do a principle components analysis on the data
set, and to use the new independent variables as input to AutoClass. This will
tend to remove the major covariances that might be present. Minor covariances,
such as those limited to specific classes, will still need to explicitly
modeled.
