1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788
|
#
# classify_details.txt - How CRM114's LEARN and CLASSIFY really work.
#
# Copyright 2009 William S. Yerazunis.
# This file is under GPLv3, as described in COPYING.
How CRM114's LEARN and CLASSIFY really work.
This document describes the internal workings of the CRM114 LEARN
and CLASSIFY functions. You do _not_ need to know this to use CRM114
effectively; this is to satisfy the curiosity of those who really
want deep knowledge of the tools they use.
(NOTE: since CRM114 now has multiple classifiers available, please read
this whole document. Some of the classifiers are interoperable,
and some are not.)
The current distribution builds in this set of classifiers. The
classifiers are:
1) SBPH Markovian (the default) This is an extension of Bayesian
classification, mapping features in the input text into a Markov
Random Field. This turns each token in the input into 2^(N-1)
features, which gives high accuracy but at high computation
and memory cost. (note- you can get plain old Bayesian by
specifying the flag < unigram > )
2) OSB Markovian - This is a version of the Markovian that uses
an orthogonal sparse bigram (OSB) feature set, instead of
the SBPH features. OSB seems to be neck-and-neck in accuracy
versus SBPH but it's considerably faster and uses less memory
for the same amount of detail. Because OSB Markovian is a subset
of SBPH Markovian, you can "sort of" intermix .css files generated
by SBPH Markovian and OSB Markovian, although there will be some
loss of accuracy. Fidelis Assis contributed the idea of using OSB
instead of full SBPH feature sets and showed that OSB actually
had advantages.
3) OSB Winnow - This classifier uses the same feature set as OSB-Markovian
but doesn't use a probabalistic estimation at all. Instead, it uses
the Winnow algorithm. The data files aren't compatible, although a good
hacker could probably come up with a way to get an approximate
conversion back and forth from the Markovian models. Like Markovian,
you can specify < unigram > to get single-word features instead of
the OSB feature set; this decreases both disk space used and accuracy.
4) Correlator classification - This classifier doesn't do tokenization
at all. Instead, it slides the example and unknown texts across
each other and measures the cross-correllation. The final scores
go with the square of the run-lengths of matching strings. This
matcher is -very- slow... easily 40 to 100x slower than any of the
other classifiers. It _will_ work against binary files, though,
which none of the other classifiers will.
5) Hyperspatial classification - this experimental classifier
tokenizes, but does not use Bayes law at all, nor statistical
"clumping". During learning, each example document generates a
single point in a 4 billion dimensional hyperspace. The
classification algorithm places a light source at each of these
points, and measures the sum of the radiant power from all of the
light sources of each class. The class yielding the highest
radiant power for the unknown document is considered to be the
correct class. By default, this uses the OSB feature set, but you
can use < unigram > to switch to single-word features (decreases disk
usage but costs accuracy).
5) Format of the .css and .cow files and microgrooming - some design
notes and how microgrooming works.
Here's the details for each classifier:
Classifier 1: Markovian
The general concept is this: break the incoming text into short
phrases of from one to five words each. A phrase can have words in
the middle of the phrase skipped (e.g. "BUY <skip_word> ONLINE NOW!!!"
is always a bad sign.), and more than one phrase can use the same
word. You can't change the order of the words, but you _can_ bridge
across newlines, punctuation, etc. Make all the phrases you can make.
For each phrase you can make, keep track of how many times you
see that phrase in both the spam and nonspam categories. When you
need to classify some text, make the phrases, and count up how many times
all of the phrases appear in the two different categories. The
category with the most phrase matches wins.
Note that you never have to cross-reference between the two category
phrase sets. If a phrase appears in both categories an equal number
of times, then both categories get an equal score boost. Since
an equal score boost doesn't change which category will win, there's
no need to cross-reference category phrase counts.
NB: This process is called "sparse binary polynomial hashing" because
it uses a set of polynomials to generate a hash-of-hashes; sparse because not
all words are represented by nonzero terms, binary because the
changing coefficient terms are always either 0 or 1, and a hash
because, well, it's a hash. :)
Instead of simply comparing raw count scores, we do a Bayesian
chain-rule to calculate the probability of "good" versus "evil".
(CRM114 actually has no knowledge of spam and nonspam, just two
sets of user-defined classes that can be whatever you want to be.
This explanation will use 'spam' and 'nonspam', but internally, it's
just "these statistics files here" and "those statistics files there")
The Bayesian chainrule formula is
P(A|S) P(S)
P (S|A) = -------------------------
P(A|S) P(S) + P(A|NS) P(NS)
which (in words) says: "The NEW chance of spam, given some feature A,
is equal to the chance of A given spam times the OLD chance that it's
spam, divided by the sum of the chance of A given spam times the old
chance it's spam plus the chance of A given nonspam, times the old
chance it's nonspam".)
We start assuming that the chance of spam is 50/50.
We count up the total number of features in the "good" versus "evil"
feature .css files. We use these counts to normalize the chances of
good versus evil features, so if your training sets are mostly "good",
it doesn't predispose the filter to think that everything is good.
We repeatedy form a feature with the polynomials, check the .css files
to see what the counts of that feature are for spam and nonspam, and
use the counts to calculate P(A|S) and P(A|NS) [remember, we correct
for the fact that we may have different total counts in the spam and
nonspam categories].
We also bound P(A|S) and P(A|NS) to prevent any 0.0 or 1.0
probabilities from saturating the system. If you allow even _one_ 0.0
or 1.0 into the chain rule, there's no way for the system to recover
even in the face of overwhelming evidence to the contrary. The
actual bound in use depends on the total number of counts of the
feature A ever encountered, irrespective of their good/evil nature.
[additional note: versions from 20030630 to 20031200 used a
fairly gentle method to generate the local probabilities from
the relative hit counts. From 20031200 onward, this local probability
was modified by the number and sequence of the terms of the
polynomial. The best model found so far is a set of coefficients that
model a Markov chain; polynomials that have a longer chain length
(and therefore a closer match) get a significantly higher boost.]
Once we have P(A|S) and P(A|NS), we can calculate the new P(S) and
P(NS). Then we get the next feature out of the polynomial hash
pipeline (each extra word makes 15 features) and repeat until we hit
the end of the text. Whichever set has the greater probability wins.
We also take multiple files AS A GROUP, so it's as though we added
the corresponding hash buckets together for everything on the left
of the | and everything on the right.
-----
Now, on to the brutish details for the Markovian classifier:
In terms of the actual implementation, LEARN and CLASSIFY are
pipelined operations. The pipeline has these stages (as of the
2002-10-21 version) :
1) Tokenization. The input text is tokenized with the supplied regex
(usually [[:graph:]]+ ) into a series of disjointed word tokens.
2) Each word token is hashed separately. The hash used is a "fast hash",
not particularly secure, but with reasonably good statistics.
3) Each hash is pushed into the end of a five-stage pipeline. Each
value previously pushed moves down one level in the pipeline.
4) The pipeline stages are tapped to supply values H0 through H4 that
will be multiplied by the particular polynomial's coefficients. (H4
being the newest value).
5) After each value is pushed into the hash pipeline, the full set of
polynomials are evaluated. These polynomials have changed over
various releases, but as of 2002-10-23 the coefficients are:
poly# \ for: H4 H3 H2 H1 H0
1 0 0 0 0 1
2 0 0 0 3 1
3 0 0 5 0 1
4 0 0 5 3 1
5 0 9 0 0 1
6 0 9 0 3 1
7 0 9 5 0 1
8 0 9 5 3 1
9 17 0 0 0 1
10 17 0 0 3 1
11 17 0 5 0 1
12 17 0 5 3 1
13 17 9 0 0 1
14 17 9 0 3 1
15 17 9 5 0 1
16 17 9 5 3 1
(yes, it's like counting in binary, but the low-order bit is always
turned on so that the low order bits in the polynomial result is always
affected by all nonzero elements of the hash pipeline. "skipped"
words have a coefficient of zero, that zeroes their effect on the
output of that polynomial, "skipping" the word)
6) These 16 results (call them "superhashes") reflect all phrases up to
length 5 found in the input text. Each is 32 bits long.
7) Each of the .css files is mmapped into virtual memory. The default
size of a .css file is one megabyte plus one byte, and each byte of
a .css file is used as a single 8-bit unsigned integer. Using the
length of the .css file as a modulus, each superhash value maps
into a particular byte of the .css file. Each .css file also has a
"score", initialized to zero.
8) if we're LEARNing, we increment the byte at that superhash index in
the .css file (being careful to not overflow the bucket limit, so
the maximum value is actually something quite smaller than 32 bits)
9) (pre-Nov-2002 versions): if we're CLASSIFYing, we increment the
per-.css-file score of that .css file by the number found in that
superhash-indexed byte.
(post-Oct-2002 versions): if we're CLASSIFYing, instead of just
incrementing the per-.CSS file scores, we (a) normalize the
relative proportions of the .css files with respect to the total
number of features in each .css file, (b) convert the bin values
indexed by the superhash to a probability, (c) "clip" the
probability values to reasonable values (there is no such thing as
"certainty" with a finite sample of an infinite and nonstationary
source such as human language), and (d) update the running
probability using the Bayesian chain rule formula above.
10) repeat the above pipeline steps for each "word" in the text.
11) The .css file with the larger score (or probability) at the end
"wins".
There you have it. Previous plynomial sets (using only H0 thorugh H3 of
the hash pipeline, with prime-number coefficients) have reached over
99.87% accuracy. The best that the 5-stage pipeline
has reached for me is 99.984%, and it averages around 99.95%
accuracy over months and months of use.
n.b. slight error in edge effects - right now, we don't execute the
pipeline polynomial set until the pipeline is full; conversely we stop
executing the polynomial set when we run out of tokens. This means
that we don't give the first and last few tokens of the email the full
treatment; that's a bug that should be rectified. The other side of the
problem is that filling and flushing the pipe gives worse results
by putting too much emphasis on "zero hash" and too much emphasis
on the first and last few words.
n.b.: Arne's Optimization: If the singleton word (H0 alone) doesn't
appear or has a count of 0, then it's useless to check for any further
combinations, as you know they can't appear unless H0 also appeared.
This speedup gives you about 2x speed improvement.
---More details on the post-Nov-2002 release:---
In releaes after Nov 1 2002, instead of just comparing counts, we do
the true Bayesian chain rule to calculate the probabilities of pass
versus fail. The bounding limits are first to bound within
[ 1/featurecount+2 , 1 - 1/featurecount+2].
and then to add further uncertainty to that bound additionally by a
factor of 1/(featurecount+1).
We do the chain rule calculation and then we clip the minimum
probability to MINDOUBLE, which is host specific but is a VERY small
number (on the order of 10^-300 for Intel boxes). This further
prevents getting the chain rule stuck in a 0.0 / 1.0 state, from which
there is no recovery.
Lastly, because of underflow issues, we quickly lose significance in
the greater of the two probabilities. For example, 1.0 - (10^-30) is
exactly equal to 1.00000; yet 10^-30 is easily achieveable in the
first ten lines of text. Therefore, we calculate the chainrule
probabilities twice, using P(S) and P(NS) separately, and then use the
smaller one to recompute the larger one. Thus, even if there's
arithmetic underflow in computing the larger probability, we still
retain the full information in the smaller probability.
--- Yet More Details - for Post-200310xx Versions ----
During the summer and fall of 2003, I continued experimenting with
improvements to SBPH/BCR as described above. It became clear that
SBPH/BCR was _very_ good, but that it was still operating within the
limits of a linear classifier without hidden levels- e.g. it was
a perceptron (with all of the limitations that perceptron-based
classifiers have).
Luckily, the databases in CRM114 are more than adequate to support
a higher-level model than a simple linear perceptron classifier.
I tested a 5th order Markovian classifier, and found that it was
superior to any other classifier I had tried.
A Markovian classifier operates on the concept that _patterns_ of
words are far more important than individual words.
For example, a Bayesian encountering the phrase "the quick brown fox
jumped" would have five features: "the", "quick", "brown", "fox", and
"jumped".
A Sparse Binary Polynomial Hasher would have sixteen features:
the
the quick
the <skip> brown
the quick brown
the <skip> <skip> fox
the quick <skip> fox
the <skip> brown fox
the quick brown fox
... and so on. But each of these features would recieve the same
weighting in the Bayesian chain rule above.
The change to become a Markovian is simple- instead of giving each
Sparse Binary Polynomial Hash (SBPH) feature a weight of 1, give each
feature a weight corresponding to how long a Markov Chain it matches
in either of the archetype texts.
A simple way to do this would be to make the weight equal to the number
of words matched - in this case the weights would be:
the 1
the quick 2
the <skip> brown 2
the quick brown 3
the <skip> <skip> fox 2
the quick <skip> fox 3
the <skip> brown fox 3
the quick brown fox 4
and indeed, this gives some improvement over standard SBPH.
But there is room for further improvement. The filter as stated above
is still a linear filter; it cannot learn (or even express!) anything
of the form:
"A" or "B" but not both
This is a basic limit discovered by Minsky and Papert in 1969 and
published in _Perceptrons_.
In this particular case there is a convenient way to work around this
problem. The solution is to make the weights of the terms
"superincreasing", such that long Markov chain features have so high a
weight that shorter chains are completely overruled.
For example, if we wanted to do "A or B but not both" in such a
superincreasing filter, the weights:
"A" at 1
"B" at 1
"A B" at -4
will give the desired results.
For convenience in calculation, CRM114 uses the superincreasing
weights defined by the series 2^(2n)- that is,
the 1
the quick 4
the <skip> brown 4
the quick brown 16
the <skip> <skip> fox 4
the quick <skip> fox 16
the <skip> brown fox 16
the quick brown fox 64
Note that with these weights, a chain of length N can override
all chains of length N-1, N-2, N-3... and so on.
This is particularly satisfying, because the standard .css files
already contain all of the information needed to do this more advanced
calculation. The file format is not only compatible, it is _identical_
and so users don't have to re-start their training.
This Markovian matching gives a considerable increase in accuracy
over SBPH matching, and almost a factor of 2 improvement over Bayesian
matching. It is now the default matching system in CRM114 as of
version 200310xx.
--------------------------------------------------------
The OSB Markovian classifer
OSB (Orthogonal Sparse Bigram) is a simplification of SBPH inspired by
Fidelis Assis. The change is to _omit_ all of the word combinations
that don't have exactly two word tokens in it. This method has fewer
features, but is often as good as or even better than Markovian in
accuracy.
Because it has fewer features, it needs less space in the .css files
for equal accuracy; because it generates fewer features, it also runs
considerably faster than Markovian. Other than that, it's pretty
similar.
It's sufficiently similar that OSB and Markovian can even use each
other's .css files (with some decrease in accuracy). It's not
recommended, but it works.
---------------------------------------------------------------
The Winnow classifier
Winnow is a different way of classifying; it doesn't generate
probabilities but rather weights. The version in CRM114 at this
particular time uses the OSB feature set. Christian Siefkes, Shalendra
Chhabra, and Laird Breyer did the first hacking on this, then with
Fidelis Assis' OSB feature generator it really took off.
Here's a quick synopsys of the algorithm:
1) Every possible feature, from AAAA to ZZZZZZZ, starts with
a weight of 1.000000 (note, we only record weights that _aren't_
1.000000; so if we don't find a feature in our feature list, we
can assume it has a value of 1.0000).
2) To learn, we do these steps in order:
- generate all of the OSB features in the example text
- delete all duplicate features
- if the example is an example "in the class", multiply every
found feature's weight by the "Promotion Constant", which
is empirically set at 1.23
- if the example is a text that is NOT supposed to be "in the class",
we multiply each found feature's weight by the "Demotion
Constant", which is empirically set at .83
(note that no matter how many times a feature appears in
a particular example, it only gets promoted or demoted ONCE).
3) To classify, we do these steps in order:
- generate all of the OSB features in the unknown text
- delete all duplicate features
- add up all of the weights of these features in each
of the statistics files. (don't forget that any
feature that doesn't exist in the stats file gets a
default value of 1.00000 !)
- The score of each file is the total weight accumulated
by the per-features, divided by the total number of features.
(note that since not-seen-before features score 1.0000, a
totally inconclusive score is Nfeatures/Nfeatures = 1.0000)
- The file with the highest score wins.
Winnow works best when you add a "Thickness factor" correction, where
you train not just on error, but rather in this less subtle way:
If the _correct_ class didn't score at least "Thickness" above
the decision threshold (in pR, the decision threshold is 0.0)
then train the _correct_ class with the example text in
correct (promotion) mode.
If the _incorrect_ class didn't score at least "Thickness"
below the decision threshold ( again, in pR units), then
train the incorrect class in error (demotion) mode. This
is done with the < refute > flag.
Winnow is a well-known classification algorithm in pattern
recognition, the current implementation will probably be upgraded and
debugged in newer releases.
----------------------------------------------------------------
The Correlator classifier
The correlator classifier is different! The correlator classifier
slides the window of the unknown text across the known texts, and
counts places where the same character appears in both... well,
actually, it counts the sum of the squares of the runlengths of the
matching strings, reiterated at each point in the string. If the
letters don't match, nothing is counted.
So, "The quick brown fox jumped over the lazy dog's back 0123456789",
matched against "cat", will get just three points- one for the C in
back matching the c in cat, and two for the a in cat matching the a's
in lazy and back. (note that the T in The doesn't match the t in cat,
because they're different cases). However, "lawn fog" will match
the five-character sequence "wn fo" giving 1 + 4 + 9 + 16 + 25 = 55
points.
Note that EVERY POSSIBLE substring in the unknown text is compared
against the known texts. This is Markovian with a major vengeance
streak (or death wish, if you don't have lots of CPU and CPU cooling
to spare. > 100x slowdown is entirely possible with this correlation
classifier; consider yourself warned.).
The databases of correlator classifiers is NOT compatible with the
.css files of SBPH, OSB, Markovian, and Winnow classification. Don't
even think of intermixing them.
--------------------------------------------------------------
Update to versions Post June 27, 2005 - the Hyperspace Classifier
The hypespatial classifier is a new classifier built into CRM114; at
this writing it's more experimental than not. However, it shows
_extremely_ good statistics and uses very little CPU and disk space,
so we're putting it out there for people to play with.
Like the other CRM114 classifiers, the hyperspace classifier is
"activated" by an option flag (obviously, it's <hyperspace>); this
makes it easy to compare hyperspatial results with more conventional
methods without changing your testing framework (you can freely swap
around <unigram> <osb>, <winnow> <hyperspace>, etc. Just don't intermix
the storage file types, most are not interchangeable).
Most statistical classifiers combine statistics of a large number of
example documents into a single class. The hyperspatial classifier
doesn't - it considers each document to be a single data point in a
very-high-dimensional hyperspace. Thus, each document retains it's
individual identity; there is no mixing of features between documents.
This is both a strength (hyperspatial classification is basically
immune to word-salad spam) and a weakness (hyperspatial classification
generalizes between document types more weakly than other
classifiers).
The current implementation of hyperspatial classification uses an
intercepted-power decision algorithm- a light source is placed at each
learned data point in the N-dimensional hyperspace. These light
sources illuminate the hyperspace location of the unknown document,
with some total power, called the total radiance. The class with the
greatest total radiance at the unknown document's location in
hyperspace is considered to be the "correct" class.
By analogy - consider each known document to be a galaxy of stars;
when viewed from the hyperspace position of an unknown document, the
class with the brighter galaxies wins.
Note that because radiance drops off with the inverse square of
distance, proximity of two documents in hyperspace (that is,
similarity of the two documents) is a very strong indicator of class
membership.
This is quite different from the linear separators such as Bayesian or
Markovian classification. The hyperspace classifier does not create a
linear border like a Bayesian classifier based on weights, nor like an
SVM where the data is projected into a higer-dimensional space and
then a linear separation plane is calculated to maximize the error
margins. Instead, hyperspace classification uses 1/r^2 as the
weighting function and the dividing surface between classes can become
highly convoluted. The result is similar to a Voronoi diagram except
that points that are distant from the surface still exert a small
amount of control.
The hyperspatial classification method can "learn" complex
classification spaces that simply cannot be learned by linear
classifiers or statistical classifiers, and perhaps not by SVMs. For
example, no linear classifier can "learn" a checkerboard situation (an
XOR is a 2x2 checkerboard). An SVM classifier can only learn such a
situation if the original feature space can be mapped to some higher
dimensional feature space where all of the squares of one color map to
one side of a hyperplane and squares of the other color map to the
other side of the hyperplane.
In contrast, the hyperspatial classifier algorithm will learn such a
highly nonlinear problem like a checkerboard in O(n) examples, where n
is the number of squares on the checkerboard. Convergence is believed
to be assured by the 1/r^2 falloff.
--- Implementation Details ---
The actual location in hyperspace of each document is determined
by the features it contains. More precisely, each of the 2^32
dimensions of a document is determined by the following rule:
"The coordinate of a document in a particular dimension of the
hyperspace is equal to the number of times a feature appears in
that document such that the feature hashes to the index of that
hyperspace dimension."
Because most documents contain far fewer than 2^32 features (the
'features' being obtained via the OSB algorithm), the most compact
representation for a document's hyperspace coordinate is just the list
of nonzero dimensions. Because most nonzero dimensions are of value
1, we dont even include the distance along that dimension. In the
rare circumstance that more than one feature hashes to the same 32-bit
hyperspace dimension index, the distance of 2 along that dimension is
represented by having two copies of that dimension index in the
dimension list; a distance of three being represented by three copies,
and so on.
The hashed feature values of the unknown document are sorted after
generation; the known document feature hash values are stored in pre-sorted
form. This means a single two-index pass, similar to a merge-sort,
can quickly find all features that existed only in one document, that
existed only in the other document, or existed in both documents.
The distances between known and unknown documents are then calculated
based on the counts of mutually present and mutually disjoint features
found. The actual formula for distance used is:
found_in_both ^ 2
distance = SQRT ( -------------------------------------------)
found_only_in_known * found_only_in_unknown
The radiance recieved by the unknown document is then calculated by
the standard inverse-square law for radiant energy versus distance
on a per-document basis (in the current version, each document has the
same source energy emitted, but this is probably suboptimal and is
a topic under active research):
total radiance of class = SUM ( source_energy[i] / distance_to_source ^ 2 [i])
[ i = over all sources]
This gives a total radiance for each of the N classes being considered
for classification. The "winning" class is the class with the highest
total radiance.
Because every document that has ever been learned must be queried,
this sounds like an expensive computation. The actual reality is that
it's very fast to compute.
The current implementation ( nasty, brutish, and straight C code) does
no optimizations beyond sorting the features in each document
mentioned above. With this minimal optimization, training with a
thick threshold of pR of 0.5 the hyperspatial classifier runs more
than four times as fast as OSBF or OSB, and more than ten times faster
than Markovian. (the SpamAssassin test set of 4147 messages is fully
processed in as little as 2 minutes 15 seconds- that's 32 milliseconds
per message, versus Markovian at 25+ minutes per test set, both on a
Transmeta 990 MHz laptop)
The disk space required for hyperspace's data storage is on the order
of 300 Kbytes per class (compared to 12 MEGAbytes for Markovian and
Winnow, and 4 megabytes for OSB or OSBF), and the accuracy is about
twice as good as Markovian or OSB. It seems to be superior in
accuracy to everything except possibly OSBF (with which it's
comparable) but unlike OSBF, the hyperspatial classifier seems to
always converge.
As a further improvement, we can greatly increase the speed of the
current system by changing the storage of known documents from the
simple array currently used (with NULL-bucket markers to indicate the
end of each document) to a hash-based lookup. Each hashbucket
contains the 32-bit hash of the feature and the 32-bit identifier of
the known-class document that contained it; multiple documents
containing the same feature consume multiple buckets. Embedded in the
table are also hashed buckets containing the total feature count of
eack known document (we can steal hash codes x00000000 through
x0000ffff for these document-generic data buckets). In this way, we
will make only references into the hash table corresponding to the
actual features found; features not in the unknown document require
zero memory cycles and zero computation. This improvement is not yet
in the current code, and it may in fact not optimize for speed because
the current sequential pass is highly coherent in the CPU cache and
hash-based fetching is highly non-coherent.
Another alternative to be tested is a tree-based lookup to try to keep
greater cache locality.
As a third improvement (which is portable to other classifiers doing
feature lookup of any type) is to generate the document's features in
one phase, then sort those features in the next phase in such a way as
to maximize cache coherence (ideally, this means knowing the actual
layout of the backing storage system) and then performing the actual
feature database lookups in a third phase. This may have a significant
impact on the overall system speed.
---- Update - August 2007 - the Bit-Entropy classifier -----
Last year at TREC Andrej Bratko showed off a new classifier technology
based on optimal compression algorithms. The basic concept is that
you train text compressors on each class of the known texts, and then
use those compressors to compress the unknown text. A closer match
between the unknown text and the compressor's pretraining and yields a
higher compression rate (and a shorter output); the unknown text is
judged to belong to the class with the shortest output stream.
Bratko's original system is closed source and based on
character-at-a-time compression, and used large amounts of memory -
typically multiple 0.5 gigabyte chunks. The CRM114 implementation is
totally new code and instead works on the "bit at a time" principle;
each bit in the incoming message is treated as a separate symbol for
the purpose of classification. Thus, there is no need for a
tokenization or feature generation step; the features are the
incoming stream of zeroes and ones.
The result is a classifier that is comparable in speed to OSB but
needs no tokenization; the downside is that it has a fairly large
memory footprint. (36 megabytes for a million-node statistics
file, versus 6 megabytes for an OSB system.).
Additionally, the creation of an optimal incrementally-taught
compression algorithm is nontrivial. Typical optimal compression is
done with a Markov model, however the reader has to be aware that this
is not the same as the "Markov" classifier described above (which is
better described as a Markov Random Field representing a hidden Markov
model of the text, rather than a bit-serial Markov chain as used for
bit-entropy classification).
There are two methods used to construct the bit-entropy Markov chain:
one where an initial assumption of the chain is made, and one where it
isn't made. The default configuration is to assume that the bit-entropy
Markov chain is representable by a toroid, of some reasonable shape,
with the interconnections done in a reasonable way. The default
in CRM114 as of 20070101 is to form an array of 256 rows and 256 columns
of nodes (total 64,000 nodes), and connect them into a toroid by a
"perfect shuffle"- that is, the next state for a zero bit is the next
column, row index
--------------------------------------------------------------------
Data File Formats
The format of a SBPH or OSB Markovian .css file (and, for winnow a
.cow file) is a 64-bit hash of a feature (whether the feature is a
single word, a bigram, or a full SBPH does not matter) and a 32-bit
representation of the value. In .css files, the 32 bits is an
unsigned integer showing the number of occurrences of this particular
feature in the training set; in .cow files it's a 32-bit floating
point weight; greater than 1.000 means "preponderance of evidence in
favor", less than 1.000 means "preponderance of evidence against",
thus a value of 1.000 exactly means "no information" (and in the case
of a .cow file, like a .css file, 0.000 exactly, with all 64 words of
hash == 0, means "unused slot").
For fast access, the first 32 bits of the hash ( called h1 in the
code) is used as an index (modulo the length of the .css/.cow file)
and that's the preferred slot location to put this data. If that slot
is already in use, the next slot is used. If that is already in use,
the _next next_ slot is used... and so on.
This "next slot not yet used" is an overflow chain. For best
performance, you want to keep the chains short. But that wastes a lot
of space. 90-95 per cent utilization is a good compromise.
Note that the time-to-find a slot (or find it's not there) goes with
the length of the overflow chains- so long chains are _very_ bad for
performance. I usually set a limit of 256 or even 128 on chains.
Once you go past that limit, you need to start expiring old data out
of the chain. You can do that by zapping out low-valued (not very
significant) slots, but that means old, stale, but originally high-valued
slots never expire. Another method would be to use an LRS (Least
Recently Seen) tracking system, but that would use up a lot more
disk space for the .css/.cow files- almost doubling it is the
best estimate I have.
"Microgrooming" adds a random factor. A feature is microgroomed
if it's hash is equal to a pseudorandom number - and the microgrooming
is merely a _lessening_ of the significance. If the significance of a
slot drops below "saw it once", the slot is reclaimed for reuse.
Note also we don't groom the whole .css/.cow file. We groom _only_
the chain that we noticed was too long. This minimizes how much
data we lose in a microgroom (face it, database grooming/expiring is
brain surgery with a butter knife... microgrooming is just using
the serrations on the edge to minimize how much we scrape away).
Note that this works pretty well- most of the slots ever used in a
.css/.cow file contain only a single occurrence, so reclaiming a small
fraction of them (currently 1 in 16, scattered randomly) is a good
compromise. It also will eventually expire out even the largest
feature if that feature is not ever retrained.
(and the killer bug? Well, consider how we know we've reached
end-of-chain. We see a zeroed slot. Microgrooming puts in a number
of zeroed slots - each of which is seen as a chain terminator. BUT-
when we microgroom, we need to re-check the locations of each slot's
worth of data, to make sure it's findable - that is, it isn't separated
from it's optimal location by a freshly zeroed slot (which would indicate
end-of-chain).
This is "repacking" the chain. And the code that did it had a bug
that repacked only the first part of the chain and then stopped.
This meant that the tail of the chain (avg 50% or so) could NOT
be found- the data there was lost!
This bug has now been (hopefully) stomped.
-Bill Yerazunis
|