File: interpretation-c.text

package info (click to toggle)
autoclass 3.3.4-1
  • links: PTS
  • area: main
  • in suites: sarge
  • size: 3,832 kB
  • ctags: 994
  • sloc: ansic: 16,668; makefile: 106; sh: 98; cpp: 95; csh: 77
file content (200 lines) | stat: -rw-r--r-- 11,859 bytes parent folder | download | duplicates (9)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
INTERPRETATION OF AUTOCLASS RESULTS
------------------------------------

1.0 What Have You Got?
2.0 Assumptions
3.0 Influence Report 
4.0 Cross Entropy 
5.0 Attribute Influence Values
6.0 Class And Case Reports
7.0 Comparing Influence Report Class Weights And Class/Case Report Assignments
8.0 Alternative Classifications 
9.0 What Next?


1.0 WHAT HAVE YOU GOT?
----------------------
  Now you have run AutoClass on your data set -- what have you got?  Typically,
the AutoClass search procedure finds many classifications, but only saves the
few best.  These are now available for inspection and interpretation.  The most
important indicator of the relative merits of these alternative classifications
is Log total posterior probability value.  Note that since the probability
lies between 1 and 0, the corresponding Log probability is negative and ranges 
from 0 to negative infinity. The difference between these Log probability 
values raised to the power e gives the relative probability of the alternatives 
classifications.  So a difference of, say 100, implies one classification is 
e^100 ~= 10^43 more likely than the other.  However, these numbers can be very 
misleading, since they give the relative probability of alternative 
classifications under the AutoClass ***assumptions***.

2.0 ASSUMPTIONS
---------------
  Specifically, the most important AutoClass assumptions are the use of normal
models for real variables, and the assumption of independence of attributes
within a class.  Since these assumptions are often violated in practice, the
difference in posterior probability of alternative classifications can be
partly due to one classification being closer to satisfying the assumptions
than another, rather than to a real difference in classification quality.
Another source of uncertainty about the utility of Log probability values is 
that they do not take into account any specific prior knowledge the user may 
have about the domain.  This means that it is often worth looking at 
alternative classifications to see if you can interpret them, but it is worth 
starting from the most probable first.  Note that if the Log probability value
is much greater than that for the one class case, it is saying that there is 
overwhelming evidence for ***some*** structure in the data, and part of this 
structure has been captured by the AutoClass classification.

3.0 INFLUENCE REPORT 
--------------------
  So you have now picked a classification you want to examine, based on its Log
probability value; how do you examine it?  The first thing to do is to 
generate an "influence" report on the classification using the report 
generation facilities documented in "autoclass-c/doc/reports-c.text".   An
influence report is designed to summarize the important information buried 
in the AutoClass data structures.  

  The first part of this report gives the heuristic class "strengths".
Class "strength" is here defined as the geometric mean probability that
any instance "belonging to" class, would have been generated from the
class probability model.  It thus provides a heuristic measure of how
strongly each class predicts "its" instances.

  The second part is a listing of the overall "influence" of each of the
attributes used in the classification.  These give a rough heuristic
measure of the relative importance of each attribute in the
classification.  Attribute "influence values" are a class probability
weighted average of the "influence" of each attribute in the classes, as
described below.

  The next part of the report is a summary description of each of the
classes.  The classes are arbitrarily numbered from 0 up to n, in order
of descending class weight.  A class weight of say 34.1 means that the
weighted sum of membership probabilities for class is 34.1.  Note that
a class weight of 34 does not necessarily mean that 34 cases belong to
that class, since many cases may have only partial membership in that
class.  Within each class, attributes or attribute sets are ordered by
the "influence" of their model term.

4.0 CROSS ENTROPY 
-----------------
  A commonly used measure of the divergence between two probability
distributions is the cross entropy: the sum over all possible values x,
of P(x|c...)*log[P(x|c...)/P(x|g...)], where c... and g... define the
distributions.  It ranges from zero, for identical distributions, to
infinite for distributions placing probability 1 on differing values of
an attribute.  With conditionally independent terms in the probability
distributions, the cross entropy can be factored to a sum over these
terms.  These factors provide a measure of the corresponding modeled
attribute's influence in differentiating the two distributions.

  We define the modeled term's "influence" on a class to be the cross
entropy term for the class distribution w.r.t. the global class
distribution of the single class classification.  "Influence" is thus a
measure of how strongly the model term helps differentiate the class
from the whole data set.  With independently modeled attributes, the
influence can legitimately be ascribed to the attribute itself.  With
correlated or covariant attributes sets, the cross entropy factor is a
function of the entire set, and we distribute the influence value
equally over the modeled attributes.

5.0 ATTRIBUTE INFLUENCE VALUES
------------------------------
  In the "influence" report on each class, the attribute parameters for that 
class are given in order of highest influence value for the model term
attribute sets.  Only the first few attribute sets usually have 
significant influence values.  If an influence value drops below about 20% of 
the highest value, then it is probably not significant, but all attribute sets 
are listed for completeness.  In addition to the influence value for each
attribute set, the values of the attribute set parameters in that class are given
along with the corresponding "global" values.  The global values are computed
directly from the data independent of the classification.  For example, if the
class mean of attribute "temperature" is 90 with standard deviation of 2.5, but
the global mean is 68 with a standard deviation of 16.3, then this class has
selected out cases with much higher than average temperature, and a rather
small spread in this high range.  Similarly, for discrete attribute sets, the
probability of each outcome in that class is given, along with the
corresponding global probability -- ordered by its significance: the
absolute value of (log {<local-probability> / <global-probability>}).  The
sign of the significance value shows the direction of change from the global
class.  This information gives an overview of how each class differs from the
average for all the data, in order of the most significant differences.

6.0 CLASS AND CASE REPORTS
--------------------------
  Having gained a description of the classes from the "influence" report, you
may want to follow-up to see which classes your favorite cases ended up in.
Conversely, you may want to see which cases belong to a particular class.  For
this kind of cross-reference information two complementary reports can be
generated.  These are more fully documented in "reports-c.text". The "class"
report, lists all the cases which have significant membership in each class and
the degree to which each such case belongs to that class.  Cases whose class
membership is less than 90% in the current class have their other class
membership listed as well.  The cases within a class are ordered in increasing
case number.  The alternative "cases" report states which class (or classes) a
case belongs to, and the membership probability in the most probable class.
These two reports allow you to find which cases belong to which classes or the
other way around.  If nearly every case has close to 99% membership in a single
class, then it means that the classes are well separated, while a high degree
of cross-membership indicates that the classes are heavily overlapped.  Highly
overlapped classes are an indication that the idea of classification is
breaking down and that groups of mutually highly overlapped classes, a kind of
meta class, is probably a better way of understanding the data.


7.0 COMPARING INFLUENCE REPORT CLASS WEIGHTS AND CLASS/CASE REPORT ASSIGNMENTS
------------------------------------------------------------------------------
  The class weight given as the class probability parameter, is essentially 
the sum over all data instances, of the normalized probability that the 
instance is a member of the class.  It is probably an error on our part that 
we format this number as an integer in the report, rather than emphasizing 
its real nature.  You will find the actual real value recorded as the w_j 
parameter in the class_DS structures on any .results[-bin] file.

  The .case and .class reports give probabilities that cases are members of 
classes.  Any assignment of cases to classes requires some decision rule. 
The maximum probability assignment rule is often implicitly assumed, but
it cannot be expected that the resulting partition sizes will equal the 
class weights unless nearly all class membership probabilities are effectively 
one or zero.  With non-1/0 membership probabilities, matching the class 
weights requires summing the probabilities.

  In addition, there is the question of completeness of the EM (expectation
maximization) convergence.  EM alternates between estimating class 
parameters and estimating class membership probabilities.  These estimates 
converge on each other, but never actually meet.  AutoClass implements several 
convergence algorithms with alternate stopping criteria using appropriate 
parameters in the .s-params file.  Proper setting of these parameters, to get 
reasonably complete and efficient convergence may require experimentation.


8.0 ALTERNATIVE CLASSIFICATIONS 
-------------------------------
  In summary, the various reports that can be generated give you a way of
viewing the current classification.  It is usually a good idea to look at
alternative classifications even though they do not have the minimum Log 
probability values.  These other classifications usually have classes that 
correspond closely to strong classes in other classifications, but can differ 
in the weak classes.  The "strength" of a class within a classification can 
usually be judged by how dramatically the highest influence value attributes 
in the class differ from the corresponding global attributes.  If none of the
classifications seem quite satisfactory, it is always possible to run AutoClass
again to generate new classifications.

9.0 WHAT NEXT?
--------------
  Finally, the question of what to do after you have found an insightful
classification arises.  Usually, classification is a preliminary data analysis
step for examining a set of cases (things, examples, etc.) to see if they can 
be grouped so that members of the group are "similar" to each other.  AutoClass
gives such a grouping without the user having to define a similarity measure.
The built-in "similarity" measure is the mutual predictiveness of the cases.
The next step is to try to "explain" why some objects are more like others than
those in a different group.  Usually, domain knowledge suggests an answer.  For
example, a classification of people based on income, buying habits, location,
age, etc., may reveal particular social classes that were not obvious before
the classification analysis.  To obtain further information about such classes,
further information, such as number of cars, what TV shows are watched, etc.,
would reveal even more information.  Longitudinal studies would give
information about how social classes arise and what influences their
attitudes -- all of which is going way beyond the initial classification.