1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199
|
DBACL - digramic Bayesian classifier
PURPOSE
dbacl is a command line program which can be used to categorize
several types of text documents. Each document category is
constructed as a maximum entropy language model, with respect to
a reference measure based on digrams (character pairs).
Before recognition can take place, a number of text corpora must
be "learned". For example, an English category could be based on
a text file containing the collected works of Shakespeare. The
Gutenberg project (http://promo.net/pg/) makes freely available
many public domain works in electronic form.
After learning, any number of text files can be compared, in terms
of Bayesian posterior probabilities, with up to 128 learned categories.
The actual number of categories is limited only by available memory, and
can be changed by editing dbacl.h and recompiling.
Last but not least, dbacl also has special support for email message
formats, which permits its use as an email filter. Explanations and
instructions are in the man pages and bundled tutorials.
dbacl is bundled with a few other utilities:
- bayesol is a postprocessor which takes the dbacl output and computes
an optimal decision based on costs of misclassification. Together with
dbacl, this allows the construction of sophisticated, multilingual,
classification scripts, if you're not afraid of some shell scripting.
- mailcross performs email classification cross validation. It can be used
to assess the performance of custom email classification scripts based on
dbacl and bayesol. It also has a testsuite functionality to compare dbacl
with other open source email filters on your own email collections (this
functionality existed separately in the 1.5 series, but has been merged now).
At time of writing, 11 different mail classifiers can be compared.
- mailtoe performs train on error (TOE) simulation. This command is virtually
identical to mailcross. In particular, it allows TOE comparisons between
dbacl and other email filters.
- mailfoot performs fully ordered online training simulation. It functions
like mailcross and mailtoe.
- mailinspect reads an mbox style mail folder and displays the emails in sorted
order, based on similarity to any given category.
DOCUMENTATION
See the bundled manpages. Generic instructions can be found in the
file INSTALL. A tutorial is to be found in the file tutorial.html,
and an exposition of the algorithms is in dbacl.ps. An alternative
tutorial which is specialized for email classification is in the file
email.html. All these files are in the doc directory, but should be
installed automatically with the programs.
LICENSE
With the exceptions noted below, DBACL is distributed under the terms
of the GNU General Public License (GPL) which can be found in the file
COPYING. Some source files have free public licenses, by Bob Jenkins (hash
functions) and Stephen L. Moshier (cephes functions).
The sample text files are fair use quotes from various sources, and not
covered by the GPL. To prevent statistical bias, the samples cannot be
marked directly with the author's name:
- sample1.txt, sample3 and sampe5.txt are in the public domain, by Mark Twain.
- sample2.txt, sample4.txt are in the public domain, by Aristotle.
- sample6.txt is a forwarded email, copyright unknown.
- japanese.txt is a Japanese translation of the GNU General Public License,
copyright by the Free Software Foundation.
BUILDING
There are several configuration options you can change in the file
dbacl.h, if you want to increase the maximum number of categories or
optimize hash table overhead.
To build and install the program, you can execute the following steps
from within the source DBACL directory:
./configure
make
make check
make install
The last part should be executed with superuser privileges for system
wide installation. Alternatively
./configure --prefix=/home/xyzzy
make
make check
make install
builds and installs in user xyzzy's home directory, without the need for
root privileges. In this case, the following environment variables
should be set permanently (e.g. in the file .profile):
PATH=$PATH:/home/xyzzy/bin
MANPATH=$MANPATH:/home/xyzzy/man
INTERNATIONALIZATION
dbacl uses the current locale for processing. 8-bit clean multibyte
character sets (such as UTF-8) are supported in the default mode, and
arbitrary multibyte character sets require the -i command line
option.
The following remarks only apply to previous versions of dbacl.
In version 1.8.1 and later, regexes are only available in multibyte
code. This is to discard the added complexity of dealing with wide
character regexes - wide regexes will be reintroduced in the future,
once a clean way of doing so is found.
<section obsolete>
If you intend to use the -i option together with regular
expressions, you must build with a wide character POSIX regex library:
ensure that the BOOST library is present on the system and type
./configure WIDE_REGEX=1
make
make install
Warnings:
1) there is a large performance penalty if you build dbacl this way,
which shows up whenever you use regular expressions. Only build this
way if you need correct regular expressions in a multibyte environment
which isn't 8-bit clean.
</section obsolete>
2) On some systems (e.g. BSD family), wide character functions are not
supported. The configure script will detect this and disable full
internationalization in this case. As a side effect, HTML entities
won't be converted into characters, which breaks some regression tests
when rinning make check. This is harmless.
3) The mailinspect command uses the slang library for interactive
mode. If the headers cannot be found in /usr/include or
/usr/include/slang, then compilation will skip interactive mode.
OTHER DEPENDENCIES
The main filter programs dbacl and bayesol have no special
dependencies, and can always be compiled.
mailinspect uses the readline and slang libraries for screen
management in interactive mode. The configure script will check for
these libraries and if it can't find them, mailinspect will be
compiled without interactive support.
mail{cross,toe,foot} are bash shell scripts which call awk and formail
at various points. They will test for the existence of these programs
in your path and refuse to run if it can't find them.
The testsuite functionality is partially implemented by the above
scripts, and partially by wrapper scripts for the various email
classifiers used. It is imperative that the scripts follow the
interface described in the mailcross.testsuite man page.
RUNNING
There is a tutorial which you can read with any web browser, point it to the
file tutorial.html. For command line options and examples of possible use,
type after installation:
man dbacl
man bayesol
man mailcross
man mailinspect
man mailtoe
man mailfoot
You can also find a technical description of the algorithms and statistics
in the postscript files dbacl.ps and costs.ps
UPGRADING FROM PREVIOUS VERSIONS
For simplicity, dbacl's category files are version coded. If you have
categories created by a previous version, simply relearn them with the
current version. Every time a category is relearned, the file is
rewritten from scratch in the correct format.
The mailcross.testsuite command from version 1.5 has been removed, but the
interface is unchanged. Any command which used to be of the form
"mailcross.testsuite xxx yyy" is now of the form "mailcross testsuite xxx yyy".
TUTORIAL SAMPLES
The tutorial.html document comes with several sample text files:
- sample{1,3,5}.txt are extracts from Mark Twain, Huckleberry Finn
- sample{2,4}.txt are extracts from Aristotle's Rhetoric.
AUTHOR
Laird A. Breyer <laird@lbreyer.com>
|