1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529
|
GENERAL INFORMATION FOR AUTOCLASS C
-------------------------------------------------------------------------------
CONTENTS:
What Is Autoclass
What Is Autoclass III
What Is Autoclass X
What Is Autoclass C
Update History
Compatibility & Porting Considerations
Limitations
Building The Autoclass C System
Use Of The Autoclass C System
Theoretical Questions
Technical Questions
Implementation Questions
References
WHAT IS AUTOCLASS:
AutoClass is an unsupervised Bayesian classification system that seeks a
maximum posterior probability classification.
Key features:
- determines the number of classes automatically;
- can use mixed discrete and real valued data;
- can handle missing values;
- processing time is roughly linear in the amount of the data;
- cases have probabilistic class membership;
- allows correlation between attributes within a class;
- generates reports describing the classes found; and
- predicts "test" case class memberships from a "training"
classification.
Inputs consist of a database of attribute vectors (cases), either real or
discrete valued, and a class model. Default class models are provided.
AutoClass finds the set of classes that is maximally probable with respect
to the data and model. The output is a set of class descriptions, and
partial membership of the cases in the classes.
For more details see "Bayesian Classification (AutoClass): Theory and
Results" (kdd-95.ps in ~/autoclass-c/doc/), "Bayesian Classification
Theory" (tr-fia-90-12-7-01.ps in ~/autoclass-c/doc/). A list of
references is included below.
WHAT IS AUTOCLASS III:
AutoClass III, programmed in Common Lisp, is the official released
implementation of AutoClass available from COSMIC (NASA's software
distribution agency):
COSMIC
University of Georgia
382 East Broad Street
Athens, GA 30602 USA
voice: (706) 542-3265 fax: (706) 542-4807
telex: 41- 190 UGA IRC ATHENS
e-mail: cosmic@@uga.bitnet or service@@cossack.cosmic.uga.edu
Request "AutoClass III - Automatic Class Discovery from Data (ARC-13180)".
WHAT IS AUTOCLASS X:
AutoClass X is an experimental extension to AutoClass III, available
only domestically, by means of a non-disclosure agreement. It
implements hierarchical classification where attributes are
associated with appropriate levels of the class hierarchy. The
search methodology is currently in development. It is implemented
in Common Lisp. Contact Will Taylor (taylor@ptolemy.arc.nasa.gov).
WHAT IS AUTOCLASS C:
AutoClass C is a publicly available implementation of AutoClass
III, with some improvements from AutoClass X, done in the C
language. It was programmed by Dr. Diane Cook
(cook@centauri.uta.edu) and Joseph Potts (potts@cse.uta.edu) of
the University of Texas at Arlington. Will Taylor
(taylor@ptolemy.arc.nasa.gov) "productized" the software through
extensive testing, addition of sample data bases, and re-working
the user documentation.
Significant new features of the C implementation are:
- it is about 10-20 times faster than the Lisp implementations:
AutoClass III & AutoClass X;
- it uses double precision floating point for its "inner loop"
weight calculations, producing a higher "signal-to-noise"
ratio than the Lisp versions, and thus more precise
convergences for very large data sets (adding double precision
to the Lisp versions would slow them down even more).
It provides four models:
single_multinomial - discrete attribute multinomial model,
including missing values.
single_normal - real valued attribute model with without
missing values; sub-types: location and scalar.
single_normal_missing - real valued attribute model with missing
values; sub-types: location and scalar.
multi_normal - real valued covariant normal model without
missing values.
Additional models were done in Lisp for AutoClass X, and may be
implemented in C at some later time. These models are:
single_multinomial_ignore - discrete attribute multinomial model,
ignoring missing values.
single_poisson - models low value count (integer) attributes
as Poisson distributions.
multi_multinomial_dense - a dense covariant multinomial model.
multi_multinomial_sparse - a sparse covariant multinomial model.
The C implementation also does not provide single_multinomial model
value translations, and canonical model group/attribute ordering.
UPDATE HISTORY:
Version: 1.0 15 Apr 95 initial version of AutoClass C
Version: 1.5 08 May 95 ported to Sun Solaris 2.4; corrected string
overwrite problems; compilation of file
search-control.c is now optimized; & added binary data file input
option. (See "autoclass-c/version-1-5.text")
Version: 2.0 08 Jun 95 ported to SGI IRIX version 5.2; converted
binary i/o from non-standard (open/close/
read/write) to ANSI (fopen/fclose/fread/fwrite); converted from
srand/rand to srand48/lrand48 for random number generation; added
prediction capability which uses a "training" classification to
predict probabilistic class membership for the cases of a "test"
data file; added new ".s-params" parameter "screen_output_p"; added
output of real and discrete attribute statistics when data base is
initially read; corrected garbage output when ".r-params" parameter
"xref_class_report_att_list" contains mixed real and discrete
attributes; corrected the handling of unknown real values in reports
output; and corrected an error in function "output_warning_msgs"
which caused an abort condition. (See "autoclass-c/version-2-0.text")
Version: 2.5 28 Jul 95 Influence values report has been
significantly revised and reformatted;
add SunOS/Solaris C compiler support; correct segmentation fault
which occurs when more than 25 type = real, subtype = scalar
attributes are defined; correct "LOG domain" errors in generation
of influence values for model "single_multinomial"; and added mods
for port to Linux operating system using gcc compiler. (See
"autoclass-c/version-2-5.text")
Version: 2.6 02 Aug 95 Correct segmentation fault which occurs
when more than 50 type = real, subtype =
scalar attributes are defined; add function safe_log to prevent
"log: SING error" error messages; and require user to confirm
search runs using test settings for .s-params file parameters:
start_fn_type and randomize_random_p. (See
"autoclass-c/version-2-6.text")
Version: 2.7 16 Aug 95 Add search parameter to allow AutoClass
to be run as a background task. (See
"autoclass-c/version-2-7.text")
Version: 2.8 03 Sep 96 Add search parameter "read_compact_p",
which directs AutoClass to read the "results" and "checkpoint"
files in either binary format or ascii format; redefine make
files with -I and -L parameters for SunOS 4.1.3; change make
file naming conventions; prevent corruption of discrete data
translation tables when translations are longer than 40
characters; increase from 3000 to 20000 the value of
VERY_LONG_STRING_LENGTH to handle very large datum lines;
increase DATA_ALLOC_INCREMENT from 100 to 1000 for reading very
large datasets; add DATA_ALLOC_INCREMENT logic of READ_DATA
to XREF_GET_DATA -- this will prevent segmentation faults
encountered when reading very large .db2 files into the
reports processing function of AutoClass; in
FORMAT_DISCRETE_ATTRIBUTE, do not process attributes with
warning or error messages -- this prevents segmentation faults;
in XREF_GET_DATA, free database allocated memory after it is
transferred into report data structures --this reduces the
amount of memory required when generating reports for very
large data bases, and prevents running out of memory; in all
functions calling malloc/realloc for dynamic memory allocation,
checks have been added to notify the user if memory is exhausted;
and port the "make" file for HP-UX operating system using the
bundled "cc" compiler. (See "autoclass-c/version-2-8.text")
Version: 2.9 21 Oct 96 Correct bugs which occur when generating
reports of discrete type data -- these were introduced in version
2.8. Added new parameter for both ".s-params" & ".r-params"
files: break_on_warnings_p. (See "autoclass-c/version-2-9.text")
Version: 3.0 15 Apr 97 New parameter for .r-params files:
report_mode -- "text" (current report output) or "data"
(parsable format for further processing); correct minor bugs;
improve input checking for .hd2 file; correct segmentation
fault which occurred in prediction runs when the size of the
"test" database was larger than that of the "training"
database; and new parameter for .s-params & .r-params files:
free_storage_p. (See "autoclass-c/version-3-0.text")
Version: 3.1 04 Jul 97 New parameters for .r-params files:
comment_data_headers_p, max_num_xref_class_probs,
start_sigma_contours_att, & stop_sigma_contours_att. Allow
checkpoint files to be loaded for reconvergence. Allow
reports to be generated for data sets of 100,000 cases and
more, without causing a segmentation fault. For "-predict"
runs, handle "test" cases which are not predicted in be in
any of the "training" classes. When there is more than one
covariant normal correlation matrix, print all of them.
In the case cross-reference report (report_type = "xref_case")
generated with the data option (report_mode = "data"), other class
probabilities are now printed. In the case and class cross-
reference reports, the print out of probabilities has increased
by one significant digit (0.04 => 0.041), and the minimum value
printed is now 0.001, rather than 0.01. Add capability to
compute sigma class contour values for specified pairs of
real valued attributes. (See "autoclass-c/version-3-1.text")
Version: 3.2 13 Apr 98 Changed the behavior of search
parameter force_new_search_p; amplified some documentation
sections; corrected several segmentation faults in reports
generation; corrected several errors in sigma contours output;
correct problem with cross-reference reports class assignment
when there are more than five marginal probabilities; change
layout of influence values report to print matrices after all
class attributes are listed; warn user when default start_j_list
may not find the correct number of classes in data set; warn
user of search trials which do not converge and print
convergence summary at the end of each run; the multi-normal
model was corrected to prevent oscillation in the expectation
maximization calculations; and allow non-contiguous groups of
attributes to be specified for sigma contours calculations.
(See "autoclass-c/version-3-2.text")
Version: 3.2.1 04 Jun 98 Minor documentation changes. (See
"autoclass-c/version-3-2-1.text")
Version: 3.2.2 02 Jul 98 Minor documentation changes. (See
"autoclass-c/version-3-2-2.text")
Version: 3.3 23 Sep 98 Integrated source port of version
3.2.2 to Windows NT/95. Update sample AutoClass C run files
contained in autoclass-c/sample. (See
"autoclass-c/version-3-3.text")
Version: 3.3.1 30 Nov 98 Correct incompatibility with
.results[-bin] files written by AutoClass C versions prior
to version 3.3. (See "autoclass-c/version-3-3-1.text")
Version: 3.3.2 13 Sep 99 In all situations warning and error
messages are now written to the log file. (See
"autoclass-c/version-3-3-2.text")
COMPATIBILITY & PORTING CONSIDERATIONS:
AutoClass C was written in ANSI C using the GNU gcc compiler
version 2.6.3 running on a SunSparc under SunOS 4.1.3.
It has also been ported to and tested on:
- SunSparc under Solaris 2.4 using SPARCompiler C version 3.00;
- SunSparc under SunOS 4.1.3 using SPARCompiler C version 3.00;
- SGI Indigo under IRIX 5.2 using the bundled cc compiler;
- Linux version 1.2.10, GCC version 2.5.8, libc version
4.6.25;
- HP9000/735 & HP9000/C110 under HPUX 10.10 using the bundled
cc compiler;
- Windows NT/95 using the Microsoft Visual C++ 5.0 compiler.
Considerations for porting to other platforms, operating systems,
and compilers:
- int & float types must be at least 32 bit words
- floating point arithmetic must be IEEE standard
- values.h constant #defines are not consistent with IEEE standard --
used Symbolics Genera 8.3 values in autoclass.h
- globals.c, io-results.c, & search-control-2.c:
G_safe_file_writing_p = TRUE; only supported under Unix,
since it does system calls to mv (rename file) and rm (delete
file).
- utils.c: char_input_test -- which implements the typing of 'q'
and <return> to quit the search -- uses Unix system call fcntl,
and file fcntlcom-ac.h; get_universal_time -- uses Unix system
call time.
- init.c: init -- uses Unix system call getcwd (get current working
directory); sets "normalizer" value for random number generator
library function "srand48".
- search-control.c, search-basic.c, search-control-2.c, & utils.c:
Use C library functions srand48/lrand48 for random number
generation.
LIMITATIONS:
AutoClass C is limited by memory requirements that are roughly in
proportion to the number of data, times the number of attributes (the
data space); plus the number of classes, times number of modeled
attributes (the model space); plus a fixed program space. Thus there
should be no limit on the number of attributes beyond the program
addressable memory, but there are definite tradeoffs with respect to
the model space, and performance degradations as paging requirements
increase.
For very large data sets, you may well find that even if you can handle
the data, the processing time is excessive. If that is the case, it may
be worthwhile to try class generation on random subsets of the data set.
This should pick out the major classes, although it will miss small
ones that are only vaguely represented in the random subsets. You can
then switch to prediction mode to classify the entire data set.
BUILDING THE AUTOCLASS C SYSTEM -- UNIX PLATFORMS
Assuming that "." is not in $PATH --
% cd ~/autoclass-c # or equivalent
% chmod u+x load-ac # if you have not already done so
% load-ac
{ Which compiler, GNU(gcc) or SunOS(acc)? - {gcc|acc}: }
{ Which compiler, GNU(gcc) or Solaris(cc)? - {gcc|cc}: }
{ no prompt if SGI or Linux }
<compiler and linker messages>
% ./autoclass-c/autoclass # show autoclass options
AutoClass Search:
% ./autoclass -search <.db2[-bin] file path> <.hd2 file path>
<.model file path> <.s-params file path>
AutoClass Reports:
% ./autoclass -reports <.results[-bin] file path> <.search file path>
<.r-params file path>
AutoClass Prediction:
% ./autoclass -predict <test.. .db2 file path>
<training.. .results[-bin] file path>
<training.. .search file path> <training.. .r-params file path>
BUILDING THE AUTOCLASS C SYSTEM -- WINDOWS PLATFORMS
Use Mirosoft Visual C++ 5.0 Developer Studio to build Autoclass.exe
File->Open Workspace: f:\autoclass-c-win\prog\AutoclassC.dsw
Build->Build Autoclass.exe
f:\autoclass-c-win> copy prog\Debug\Autoclass.exe .
f:\autoclass-c-win> Autoclass.exe # show autoclass options
AutoClass Search:
f:\autoclass-c-win> Autoclass.exe -search <.db2[-bin] file path> <.hd2 file path>
<.model file path> <.s-params file path>
AutoClass Reports:
f:\autoclass-c-win> Autoclass.exe -reports <.results[-bin] file path>
<.search file path> <.r-params file path>
AutoClass Prediction:
f:\autoclass-c-win> Autoclass.exe -predict <test.. .db2 file path>
<training.. .results[-bin] file path>
<training.. .search file path> <training.. .r-params file path>
USE OF THE AUTOCLASS C SYSTEM -- UNIX PLATFORMS
Assuming that "." is not in $PATH --
To use Autoclass, first you need data (your ".db2" file), then you need to
describe it to AutoClass (your ".hd2" & ".model" files), and also tell
AutoClass what parameter values to use for the search (your ".s-params"
file) and for the report generation (your ".r-params" file). Next, you
generate classification results from your data using
% cd ~/autoclass-c
% ./autoclass-c/autoclass -search data/glass/glassc.db2
data/glass/glass-3c.hd2 data/glass/glass-mnc.model
data/glass/glassc.s-params
and you produce reports with
% ./autoclass-c/autoclass -reports data/glass/glassc.results-bin
data/glass/glassc.search data/glass/glassc.r-params
and, optionally, use this classification for prediction of test cases
% ./autoclass-c/autoclass -predict data/glass/glassc-predict.db2
data/glass/glassc.results-bin
data/glass/glassc.search data/glass/glassc.r-params
See autoclass-c/doc/introduction-c.text for detailed documentation of the
AutoClass C system.
A database with sample classification run output is provided in
~/autoclass-c/sample/.
Test databases, with .db2, .hd2, .model, .s-params, and .r-params
files for each of the model term types, are provided in:
~/autoclass-c/data/autos/
~/autoclass-c/data/3-dim/
~/autoclass-c/data/glass/
~/autoclass-c/data/rna/
~/autoclass-c/data/soybean/
Test summary output for these databases is provided in:
~/autoclass-c/data/tests.c
Note that the parameters specified in the .s-params files for the
test data bases specify repeatable, non-random classification
runs. For proper random classifications of your data sets,
remove these "override" parameters in your .s-params files.
USE OF THE AUTOCLASS C SYSTEM -- WINDOWS PLATFORMS
To use Autoclass, first you need data (your ".db2" file), then you need to
describe it to AutoClass (your ".hd2" & ".model" files), and also tell
AutoClass what parameter values to use for the search (your ".s-params"
file) and for the report generation (your ".r-params" file). Next, you
generate classification results from your data using
> cd f:\autoclass-c-win # for example
f:\autoclass-c-win> Autoclass.exe -search data\glass\glassc.db2
data\glass\glass-3c.hd2 data\glass\glass-mnc.model
data\glass\glassc.s-params
and you produce reports with
f:\autoclass-c-win> Autoclass.exe -reports data\glass\glassc.results-bin
data\glass\glassc.search data\glass\glassc.r-params
and, optionally, use this classification for prediction of test cases
f:\autoclass-c-win> Autoclass.exe -predict data\glass\glassc-predict.db2
data\glass\glassc.results-bin
data\glass\glassc.search data\glass\glassc.r-params
See autoclass-c-win\doc\introduction-c.text for detailed documentation of the
AutoClass C system.
A database with sample classification run output is provided in
f:\autoclass-c-win\sample\.
Test databases, with .db2, .hd2, .model, .s-params, and .r-params
files for each of the model term types, are provided in:
f:\autoclass-c-win\data\autos\
f:\autoclass-c-win\data\3-dim\
f:\autoclass-c-win\data\glass\
f:\autoclass-c-win\data\rna\
f:\autoclass-c-win\data\soybean\
Test summary output for these databases is provided in:
f:\autoclass-c-win\data\tests.c
Note that the parameters specified in the .s-params files for the
test data bases specify repeatable, non-random classification
runs. For proper random classifications of your data sets,
remove these "override" parameters in your .s-params files.
THEORETICAL QUESTIONS:
Contact Peter Cheeseman (cheesem@ptolemy.arc.nasa.gov) if you have
questions concerning the theoretical aspects of AutoClass.
TECHNICAL QUESTIONS:
Contact John Stutz (stutz@ptolemy.arc.nasa.gov) if you have questions
concerning the applicability of AutoClass to your data analysis
situation.
IMPLEMENTATION QUESTIONS:
Contact Will Taylor (taylor@ptolemy.arc.nasa.gov) if you have questions
concerning the implementation, installation, and running of AutoClass C,
including "bugs" and features you may add to the existing code.
REFERENCES:
P. Cheeseman, et al. "Autoclass: A Bayesian Classification System",
Proceedings of the Fifth International Conference on Machine Learning,
Ann Arbor, MI. June 12-14 1988. Morgan Kaufmann, San Francisco, 1988,
pp. 54-64,
P. Cheeseman, et al. "Bayesian Classification", Proceedings of the
Seventh National Conference of Artificial Intelligence (AAAI-88),
St. Paul, MN. August 22-26, 1988. Morgan Kaufmann, San Francisco,
1988, pp. 607-611.
J. Goebel, et al. "A Bayesian Classification of the IRAS LRS Atlas",
Astron. Astrophys. 222, L5-L8 (1989).
P. Cheeseman, et al. "Automatic Classification of Spectra from the Infrared
Astronomical Satellite (IRAS)", NASA Reference Publication 1217 (1989)
P. Cheeseman, "On Finding the Most Probable Model", Computational Models
of Discovery and Theory Formation, ed. by Jeff Shrager and Pat Langley.
Morgan Kaufmann, San Francisco, 1990, pp. 73-96.
R. Hanson, J. Stutz, P. Cheeseman, "Bayesian Classification Theory",
Technical Report FIA-90-12-7-01, NASA Ames Research Center, Artificial
Intelligence Branch, May 1991
R. Hanson, J. Stutz, P. Cheeseman, "Bayesian Classification with
Correlation and Inheritance", Proceedings of 12th International Joint
Conference on Artificial Intelligence, Sydney, Australia. August 24-30,
1991. Morgan Kaufmann Publishers, San Francisco, 1991, pp.692-698.
B. Kanefsky, J. Stutz, P. Cheeseman, "An Automatic Classification of a
Landsat/TM Image from Kansas (FIFE)", Technical Report FIA-91-26,
NASA Ames Research Center, Artificial Intelligence Branch, September 1991.
B. Kanefsky, J. Stutz, P. Cheeseman, W. Taylor, "An Improved Automatic
Classification of a Landsat/TM Image from Kansas (FIFE)", Technical
Report FIA-94-01, NASA Ames Research Center, Artificial Intelligence
Branch, January 1994.
J. Stutz, P. Cheeseman, "AutoClass - a Bayesian Approach to Classification",
in "Maximum Entropy and Bayesian Methods, Cambridge 1994", John Skilling
& Subuiso Sibisi Eds. Kluwer Academic Publishers, Dordrecht, 1995.
P. Cheeseman, J. Stutz, "Bayesian Classification (AutoClass): Theory and
Results", in Advances in Knowledge Discovery and Data Mining,
Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, & Ramasamy
Uthurusamy, Eds. The AAAI Press, Menlo Park, expected fall 1995.
|