1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200
|
INTERPRETATION OF AUTOCLASS RESULTS
------------------------------------
1.0 What Have You Got?
2.0 Assumptions
3.0 Influence Report
4.0 Cross Entropy
5.0 Attribute Influence Values
6.0 Class And Case Reports
7.0 Comparing Influence Report Class Weights And Class/Case Report Assignments
8.0 Alternative Classifications
9.0 What Next?
1.0 WHAT HAVE YOU GOT?
----------------------
Now you have run AutoClass on your data set -- what have you got? Typically,
the AutoClass search procedure finds many classifications, but only saves the
few best. These are now available for inspection and interpretation. The most
important indicator of the relative merits of these alternative classifications
is Log total posterior probability value. Note that since the probability
lies between 1 and 0, the corresponding Log probability is negative and ranges
from 0 to negative infinity. The difference between these Log probability
values raised to the power e gives the relative probability of the alternatives
classifications. So a difference of, say 100, implies one classification is
e^100 ~= 10^43 more likely than the other. However, these numbers can be very
misleading, since they give the relative probability of alternative
classifications under the AutoClass ***assumptions***.
2.0 ASSUMPTIONS
---------------
Specifically, the most important AutoClass assumptions are the use of normal
models for real variables, and the assumption of independence of attributes
within a class. Since these assumptions are often violated in practice, the
difference in posterior probability of alternative classifications can be
partly due to one classification being closer to satisfying the assumptions
than another, rather than to a real difference in classification quality.
Another source of uncertainty about the utility of Log probability values is
that they do not take into account any specific prior knowledge the user may
have about the domain. This means that it is often worth looking at
alternative classifications to see if you can interpret them, but it is worth
starting from the most probable first. Note that if the Log probability value
is much greater than that for the one class case, it is saying that there is
overwhelming evidence for ***some*** structure in the data, and part of this
structure has been captured by the AutoClass classification.
3.0 INFLUENCE REPORT
--------------------
So you have now picked a classification you want to examine, based on its Log
probability value; how do you examine it? The first thing to do is to
generate an "influence" report on the classification using the report
generation facilities documented in "autoclass-c/doc/reports-c.text". An
influence report is designed to summarize the important information buried
in the AutoClass data structures.
The first part of this report gives the heuristic class "strengths".
Class "strength" is here defined as the geometric mean probability that
any instance "belonging to" class, would have been generated from the
class probability model. It thus provides a heuristic measure of how
strongly each class predicts "its" instances.
The second part is a listing of the overall "influence" of each of the
attributes used in the classification. These give a rough heuristic
measure of the relative importance of each attribute in the
classification. Attribute "influence values" are a class probability
weighted average of the "influence" of each attribute in the classes, as
described below.
The next part of the report is a summary description of each of the
classes. The classes are arbitrarily numbered from 0 up to n, in order
of descending class weight. A class weight of say 34.1 means that the
weighted sum of membership probabilities for class is 34.1. Note that
a class weight of 34 does not necessarily mean that 34 cases belong to
that class, since many cases may have only partial membership in that
class. Within each class, attributes or attribute sets are ordered by
the "influence" of their model term.
4.0 CROSS ENTROPY
-----------------
A commonly used measure of the divergence between two probability
distributions is the cross entropy: the sum over all possible values x,
of P(x|c...)*log[P(x|c...)/P(x|g...)], where c... and g... define the
distributions. It ranges from zero, for identical distributions, to
infinite for distributions placing probability 1 on differing values of
an attribute. With conditionally independent terms in the probability
distributions, the cross entropy can be factored to a sum over these
terms. These factors provide a measure of the corresponding modeled
attribute's influence in differentiating the two distributions.
We define the modeled term's "influence" on a class to be the cross
entropy term for the class distribution w.r.t. the global class
distribution of the single class classification. "Influence" is thus a
measure of how strongly the model term helps differentiate the class
from the whole data set. With independently modeled attributes, the
influence can legitimately be ascribed to the attribute itself. With
correlated or covariant attributes sets, the cross entropy factor is a
function of the entire set, and we distribute the influence value
equally over the modeled attributes.
5.0 ATTRIBUTE INFLUENCE VALUES
------------------------------
In the "influence" report on each class, the attribute parameters for that
class are given in order of highest influence value for the model term
attribute sets. Only the first few attribute sets usually have
significant influence values. If an influence value drops below about 20% of
the highest value, then it is probably not significant, but all attribute sets
are listed for completeness. In addition to the influence value for each
attribute set, the values of the attribute set parameters in that class are given
along with the corresponding "global" values. The global values are computed
directly from the data independent of the classification. For example, if the
class mean of attribute "temperature" is 90 with standard deviation of 2.5, but
the global mean is 68 with a standard deviation of 16.3, then this class has
selected out cases with much higher than average temperature, and a rather
small spread in this high range. Similarly, for discrete attribute sets, the
probability of each outcome in that class is given, along with the
corresponding global probability -- ordered by its significance: the
absolute value of (log {<local-probability> / <global-probability>}). The
sign of the significance value shows the direction of change from the global
class. This information gives an overview of how each class differs from the
average for all the data, in order of the most significant differences.
6.0 CLASS AND CASE REPORTS
--------------------------
Having gained a description of the classes from the "influence" report, you
may want to follow-up to see which classes your favorite cases ended up in.
Conversely, you may want to see which cases belong to a particular class. For
this kind of cross-reference information two complementary reports can be
generated. These are more fully documented in "reports-c.text". The "class"
report, lists all the cases which have significant membership in each class and
the degree to which each such case belongs to that class. Cases whose class
membership is less than 90% in the current class have their other class
membership listed as well. The cases within a class are ordered in increasing
case number. The alternative "cases" report states which class (or classes) a
case belongs to, and the membership probability in the most probable class.
These two reports allow you to find which cases belong to which classes or the
other way around. If nearly every case has close to 99% membership in a single
class, then it means that the classes are well separated, while a high degree
of cross-membership indicates that the classes are heavily overlapped. Highly
overlapped classes are an indication that the idea of classification is
breaking down and that groups of mutually highly overlapped classes, a kind of
meta class, is probably a better way of understanding the data.
7.0 COMPARING INFLUENCE REPORT CLASS WEIGHTS AND CLASS/CASE REPORT ASSIGNMENTS
------------------------------------------------------------------------------
The class weight given as the class probability parameter, is essentially
the sum over all data instances, of the normalized probability that the
instance is a member of the class. It is probably an error on our part that
we format this number as an integer in the report, rather than emphasizing
its real nature. You will find the actual real value recorded as the w_j
parameter in the class_DS structures on any .results[-bin] file.
The .case and .class reports give probabilities that cases are members of
classes. Any assignment of cases to classes requires some decision rule.
The maximum probability assignment rule is often implicitly assumed, but
it cannot be expected that the resulting partition sizes will equal the
class weights unless nearly all class membership probabilities are effectively
one or zero. With non-1/0 membership probabilities, matching the class
weights requires summing the probabilities.
In addition, there is the question of completeness of the EM (expectation
maximization) convergence. EM alternates between estimating class
parameters and estimating class membership probabilities. These estimates
converge on each other, but never actually meet. AutoClass implements several
convergence algorithms with alternate stopping criteria using appropriate
parameters in the .s-params file. Proper setting of these parameters, to get
reasonably complete and efficient convergence may require experimentation.
8.0 ALTERNATIVE CLASSIFICATIONS
-------------------------------
In summary, the various reports that can be generated give you a way of
viewing the current classification. It is usually a good idea to look at
alternative classifications even though they do not have the minimum Log
probability values. These other classifications usually have classes that
correspond closely to strong classes in other classifications, but can differ
in the weak classes. The "strength" of a class within a classification can
usually be judged by how dramatically the highest influence value attributes
in the class differ from the corresponding global attributes. If none of the
classifications seem quite satisfactory, it is always possible to run AutoClass
again to generate new classifications.
9.0 WHAT NEXT?
--------------
Finally, the question of what to do after you have found an insightful
classification arises. Usually, classification is a preliminary data analysis
step for examining a set of cases (things, examples, etc.) to see if they can
be grouped so that members of the group are "similar" to each other. AutoClass
gives such a grouping without the user having to define a similarity measure.
The built-in "similarity" measure is the mutual predictiveness of the cases.
The next step is to try to "explain" why some objects are more like others than
those in a different group. Usually, domain knowledge suggests an answer. For
example, a classification of people based on income, buying habits, location,
age, etc., may reveal particular social classes that were not obvious before
the classification analysis. To obtain further information about such classes,
further information, such as number of cars, what TV shows are watched, etc.,
would reveal even more information. Longitudinal studies would give
information about how social classes arise and what influences their
attitudes -- all of which is going way beyond the initial classification.
|