dbacl NEWS -- history of user-visible changes. From August 2004.
Copyright (C) 2004, 2005 Laird Breyer.
Added the "Can spam filters play chess?" essay to the bundled documentation,
look in the doc/chess directory. Added the TREC2005 options files to the
TREC directory. Fixed some parsing bugs.
There now is a new parser "-e char" which parses single characters. This
isn't useful on its own, but together with the -w switch this allows fast
construction of character n-gram models up to order 7. Note that you could
simply use a series of regular expressions to generate n-grams, but this
way doesn't have the regex overhead.
For some reason which appears to be a typo, the signal handling code
was disabled, but now works as advertised.
The score calculations now do renormalization slightly differently,
and document complexities are also changed from integers to reals.
This should be practically unnoticeable for simple models, but for
divergences and complexities of n-gram models it will be, although the
impact is minor asymptotically for large complexity. This change
allows more meaningful direct comparisons between models based on
widely differing tokenization schemes, ie in principle it allows
comparing a category which is based solely on alphabetical word tokens
with another category which is based solely on numbers, for
example, even though they don't compare similar tokens.
Which is not to say that you should do it. You're safe if you
always learn all your categories with exactly the same set of model
When using the -w switch, complex tokens no longer continue past
the end of a line and onto the next one. This is more consistent
with other switch behaviours, and you can force n-grams to straddle
newlines by using the -S switch.
When using the -o and -m switches together, some extra memory mapping
is now performed. This is useful for keeping the mapped pages
invariant for the TREC tests, but doesn't help in speeding up the
In the spamjig run [which performs classify/learn for every input
document], after all pages are locked into place, about 90% of the cpu
time is spent optimizing the weights [by contrast, in ordinary use,
about 70% of the running time is reading and parsing input]. The only
way I can see to improve the cpu bottleneck is to exploit symmetries
and compression techniques. However, this can't be done without
changing the learner hash structure, which must be thought through
carefully [and won't be done soon]. As an added benefit, doing this
correctly should imply much reduced memory requirements during
A new TREC directory contains the necessary scripts and instructions
for running dbacl in the TREC/spam testing framework (spamjig).
The mail body parser was tweaked, so it no longer ignores the preamble
before the first MIME section. This goes against RFC 2046 (p.20)
recommendations, but if a spammer uses it, there's got to be a reason.
So now we also parse the preamble (can be disabled, see
The -0 switch is now always on by default. Recall that its purpose is
to prevent weight preloading if the category file already
exists. Weight preloading speeds up the learning operation by starting
with the last known set of weights for the category. It's a nice idea,
but can cause trouble if the old category feature list is much different
from the new feature set to be learned. In particular, if you leave
an old category named "dummy" on your system, and months later you decide
to learn an unrelated category also named "dummy"...
Preloading must now be explicitly enabled with the new -1 switch if
you want to experiment with it.
The -g switch now scans a given regular expression for captures
(parentheses), and surrounds the expression with a single capture if
none were found as a convenience. The -g switch is powerful, but hard to
Many unix tools use regular expressions. Such an expression normally
matches a substring in the input, but if it also contains parentheses,
then whatever is inside those parentheses is "captured". So the
expression 'Hello .*' matches the string "Hello Fred", but the
expression 'Hello (.*)' both matches "Hello Fred" and also captures
"Fred". In dbacl, the -g switch lets you construct tokens from
captured expressions, but a corollary is that if you don't supply a
capture expression, then dbacl won't read any tokens at all! As a convenience,
if no parentheses exist, dbacl will now add some. Thus the command line switch
-g 'Hello .*' is converted to -g '(Hello .*)'
but -g 'Hello (.*)' is left untouched.
The categories are now "portable" by default, unless the architecture
prevents it. Portable categories are stored in network byte order,
and data is converted on the fly when needed. The switch had been
disabled in version 1.8.1 by mistake.
A new command hforge is available. It scans an email header and checks
for signs of forgery.
This is the first version of dbacl which includes a NEWS file.
This was forced at the GNU autotools insistence, and the author
is not responsible for the contents ;-).
dbacl now makes better use of the autotools, due in no small part
on liberal doses of RTFM. The most important aspect is the new
self-test suite, which can be invoked via make check. Other changes
in dbacl for this release are mainly bugfixes and code cleanup.
The -g and -i switches are now incompatible until further redesign.
The only other user-visible change is a new -m switch, which can speed
up repeated classifications tremendously.