## File: interpretation-c.text

package info (click to toggle)
autoclass 3.3.6.dfsg.1-2
 `123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200` ``````INTERPRETATION OF AUTOCLASS RESULTS ------------------------------------ 1.0 What Have You Got? 2.0 Assumptions 3.0 Influence Report 4.0 Cross Entropy 5.0 Attribute Influence Values 6.0 Class And Case Reports 7.0 Comparing Influence Report Class Weights And Class/Case Report Assignments 8.0 Alternative Classifications 9.0 What Next? 1.0 WHAT HAVE YOU GOT? ---------------------- Now you have run AutoClass on your data set -- what have you got? Typically, the AutoClass search procedure finds many classifications, but only saves the few best. These are now available for inspection and interpretation. The most important indicator of the relative merits of these alternative classifications is Log total posterior probability value. Note that since the probability lies between 1 and 0, the corresponding Log probability is negative and ranges from 0 to negative infinity. The difference between these Log probability values raised to the power e gives the relative probability of the alternatives classifications. So a difference of, say 100, implies one classification is e^100 ~= 10^43 more likely than the other. However, these numbers can be very misleading, since they give the relative probability of alternative classifications under the AutoClass ***assumptions***. 2.0 ASSUMPTIONS --------------- Specifically, the most important AutoClass assumptions are the use of normal models for real variables, and the assumption of independence of attributes within a class. Since these assumptions are often violated in practice, the difference in posterior probability of alternative classifications can be partly due to one classification being closer to satisfying the assumptions than another, rather than to a real difference in classification quality. Another source of uncertainty about the utility of Log probability values is that they do not take into account any specific prior knowledge the user may have about the domain. This means that it is often worth looking at alternative classifications to see if you can interpret them, but it is worth starting from the most probable first. Note that if the Log probability value is much greater than that for the one class case, it is saying that there is overwhelming evidence for ***some*** structure in the data, and part of this structure has been captured by the AutoClass classification. 3.0 INFLUENCE REPORT -------------------- So you have now picked a classification you want to examine, based on its Log probability value; how do you examine it? The first thing to do is to generate an "influence" report on the classification using the report generation facilities documented in "autoclass-c/doc/reports-c.text". An influence report is designed to summarize the important information buried in the AutoClass data structures. The first part of this report gives the heuristic class "strengths". Class "strength" is here defined as the geometric mean probability that any instance "belonging to" class, would have been generated from the class probability model. It thus provides a heuristic measure of how strongly each class predicts "its" instances. The second part is a listing of the overall "influence" of each of the attributes used in the classification. These give a rough heuristic measure of the relative importance of each attribute in the classification. Attribute "influence values" are a class probability weighted average of the "influence" of each attribute in the classes, as described below. The next part of the report is a summary description of each of the classes. The classes are arbitrarily numbered from 0 up to n, in order of descending class weight. A class weight of say 34.1 means that the weighted sum of membership probabilities for class is 34.1. Note that a class weight of 34 does not necessarily mean that 34 cases belong to that class, since many cases may have only partial membership in that class. Within each class, attributes or attribute sets are ordered by the "influence" of their model term. 4.0 CROSS ENTROPY ----------------- A commonly used measure of the divergence between two probability distributions is the cross entropy: the sum over all possible values x, of P(x|c...)*log[P(x|c...)/P(x|g...)], where c... and g... define the distributions. It ranges from zero, for identical distributions, to infinite for distributions placing probability 1 on differing values of an attribute. With conditionally independent terms in the probability distributions, the cross entropy can be factored to a sum over these terms. These factors provide a measure of the corresponding modeled attribute's influence in differentiating the two distributions. We define the modeled term's "influence" on a class to be the cross entropy term for the class distribution w.r.t. the global class distribution of the single class classification. "Influence" is thus a measure of how strongly the model term helps differentiate the class from the whole data set. With independently modeled attributes, the influence can legitimately be ascribed to the attribute itself. With correlated or covariant attributes sets, the cross entropy factor is a function of the entire set, and we distribute the influence value equally over the modeled attributes. 5.0 ATTRIBUTE INFLUENCE VALUES ------------------------------ In the "influence" report on each class, the attribute parameters for that class are given in order of highest influence value for the model term attribute sets. Only the first few attribute sets usually have significant influence values. If an influence value drops below about 20% of the highest value, then it is probably not significant, but all attribute sets are listed for completeness. In addition to the influence value for each attribute set, the values of the attribute set parameters in that class are given along with the corresponding "global" values. The global values are computed directly from the data independent of the classification. For example, if the class mean of attribute "temperature" is 90 with standard deviation of 2.5, but the global mean is 68 with a standard deviation of 16.3, then this class has selected out cases with much higher than average temperature, and a rather small spread in this high range. Similarly, for discrete attribute sets, the probability of each outcome in that class is given, along with the corresponding global probability -- ordered by its significance: the absolute value of (log { / }). The sign of the significance value shows the direction of change from the global class. This information gives an overview of how each class differs from the average for all the data, in order of the most significant differences. 6.0 CLASS AND CASE REPORTS -------------------------- Having gained a description of the classes from the "influence" report, you may want to follow-up to see which classes your favorite cases ended up in. Conversely, you may want to see which cases belong to a particular class. For this kind of cross-reference information two complementary reports can be generated. These are more fully documented in "reports-c.text". The "class" report, lists all the cases which have significant membership in each class and the degree to which each such case belongs to that class. Cases whose class membership is less than 90% in the current class have their other class membership listed as well. The cases within a class are ordered in increasing case number. The alternative "cases" report states which class (or classes) a case belongs to, and the membership probability in the most probable class. These two reports allow you to find which cases belong to which classes or the other way around. If nearly every case has close to 99% membership in a single class, then it means that the classes are well separated, while a high degree of cross-membership indicates that the classes are heavily overlapped. Highly overlapped classes are an indication that the idea of classification is breaking down and that groups of mutually highly overlapped classes, a kind of meta class, is probably a better way of understanding the data. 7.0 COMPARING INFLUENCE REPORT CLASS WEIGHTS AND CLASS/CASE REPORT ASSIGNMENTS ------------------------------------------------------------------------------ The class weight given as the class probability parameter, is essentially the sum over all data instances, of the normalized probability that the instance is a member of the class. It is probably an error on our part that we format this number as an integer in the report, rather than emphasizing its real nature. You will find the actual real value recorded as the w_j parameter in the class_DS structures on any .results[-bin] file. The .case and .class reports give probabilities that cases are members of classes. Any assignment of cases to classes requires some decision rule. The maximum probability assignment rule is often implicitly assumed, but it cannot be expected that the resulting partition sizes will equal the class weights unless nearly all class membership probabilities are effectively one or zero. With non-1/0 membership probabilities, matching the class weights requires summing the probabilities. In addition, there is the question of completeness of the EM (expectation maximization) convergence. EM alternates between estimating class parameters and estimating class membership probabilities. These estimates converge on each other, but never actually meet. AutoClass implements several convergence algorithms with alternate stopping criteria using appropriate parameters in the .s-params file. Proper setting of these parameters, to get reasonably complete and efficient convergence may require experimentation. 8.0 ALTERNATIVE CLASSIFICATIONS ------------------------------- In summary, the various reports that can be generated give you a way of viewing the current classification. It is usually a good idea to look at alternative classifications even though they do not have the minimum Log probability values. These other classifications usually have classes that correspond closely to strong classes in other classifications, but can differ in the weak classes. The "strength" of a class within a classification can usually be judged by how dramatically the highest influence value attributes in the class differ from the corresponding global attributes. If none of the classifications seem quite satisfactory, it is always possible to run AutoClass again to generate new classifications. 9.0 WHAT NEXT? -------------- Finally, the question of what to do after you have found an insightful classification arises. Usually, classification is a preliminary data analysis step for examining a set of cases (things, examples, etc.) to see if they can be grouped so that members of the group are "similar" to each other. AutoClass gives such a grouping without the user having to define a similarity measure. The built-in "similarity" measure is the mutual predictiveness of the cases. The next step is to try to "explain" why some objects are more like others than those in a different group. Usually, domain knowledge suggests an answer. For example, a classification of people based on income, buying habits, location, age, etc., may reveal particular social classes that were not obvious before the classification analysis. To obtain further information about such classes, further information, such as number of cars, what TV shows are watched, etc., would reveal even more information. Longitudinal studies would give information about how social classes arise and what influences their attitudes -- all of which is going way beyond the initial classification. ``````