1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130
|
bmf -- Bayesian Mail Filter
About bmf
=========
This is a mail filter which uses the Bayes algorithm as explained in Paul
Graham's article "A Plan for Spam". It aims to be faster, smaller, and more
versatile than similar applicatios. Implementation is ANSI C and uses POSIX
functions. Supported platforms are (in theory) all POSIX systems. Support
for win32 is undecided.
This project provides features which are not available in other filters:
(1) Independence from external programs and libraries. Tokens are stored in
memory using simple vectors which require no heavyweight external data
structure libraries. Multiple token database formats are supported,
including flat files, libdb, and mysql. Conversion between formats will
always be possible with the included import/export utility and flat files
will always remain an option.
(2) Efficient processing. Input data is parsed by a handcrafted parser
which weighs in under 3% of the equivalent code generated by flex. No
portion of the input is ever copied and all i/o and memory allocation are
done in large chunks. Updated token lists are merged and written in one
step. Hashing is being considered for the next version to improve lookup
speed.
(3) Simple and elegant implementation. No heavyweight, copy-intensive mime
decoding routines are used. Decoding of quoted-printable text for selected
mime types is being considered for the next version.
Note: the core filter function is from esr's bogofilter v0.6 (available at
http://sourceforge.net/projects/bogofilter/) with bugfix updates.
For the most recent version of this software, see:
http://sourceforge.net/projects/bmf/
How to integrate bmf
====================
The following procmail recipes will invoke bmf for each incoming email and
place spam into $MAILDIR/spam. The first sample invokes bmf in its normal
mode of operation and the second invokes bmf as a filter.
### begin sample one ###
# Invoke bmf and use return code to filter spam in one step
:0HB
* ? bmf
| formail -A"X-Spam-Status: Yes, tests=bmf" >>$MAILDIR/spam
### begin sample two ###
# Invoke bmf as a filter
:0 fw
| bmf -p
# Filter spam
:0:
^X-Spam-Status: Yes
$MAILDIR/spam
The following maildrop equivalents are suggested by Christian Kurz.
### begin sample one ###
# Invoke bmf and use return code to filter spam in one step
exception {
`bmf`
if ( $RETURNCODE == 0 )
to $MAILDIR/spam
}
### begin sample two ###
# Invoke bmf as a filter
exception {
xfilter "bmf -p"
if (/^X-Stam-Status: Yes/)
to $MAILDIR/spam
}
If you put bmf in your procmail or maildrop scripts as suggested above, it
will always register an email as either spam or non-spam. To reverse this
registration and train bmf, the following mutt macros may be useful:
macro index \ed "<enter-command>unset wait_key\n<pipe-entry>bmf -S\n<enter-command>set wait_key\n<save-message>=spam\n"
macro index \et "<enter-command>unset wait_key\n<pipe-entry>bmf -t\n<enter-command>set wait_key\n"
macro index \eu "<enter-command>unset wait_key\n<pipe-entry>bmf -N\n<enter-command>set wait_key\n<save-message>=inbox\n"
These will override these commands:
<Esc>d = de-register as non-spam, register as spam, and move to spam folder.
<Esc>t = test for spamicity.
<Esc>u = de-register as spam, register as non-spam, and move to inbox folder.
How to train bmf
================
First, please keep in mind that bmf "learns" how to recognize spam from the
input that you give it. It works best if you give it exactly the email that
you receive, or have received in the recent past.
Here are some good techniques for training bmf:
- If you keep a history of email that you have received, use your current
and/or saved emails. It is fairly easy to create a small shell script
that will pass all of your normal email to "bmf -n" and all of your spam
to "bmf -s". Note that if you do not use the mbox storage format, you
MUST invoke bmf exactly once per email. Using "cat * | bmf -n" will NOT
work properly because bmf sees the entire input as one big email.
- If you already use spamassassin, you can use it to train bmf for a
couple of days or weeks. If spamassassin tags it as spam, run it
through "bmf -s". If not, run it through "bmf -n". This can be
automated with procmail or maildrop recipes.
Here are some things that you should NOT do:
- Get impatient with the training process and repeatedly pass one email
through "bmf -s".
- Manually move words around between lists and/or adjust the word counts.
Final words
===========
Thanks for trying bmf. If you have any problems, comments, or suggestions,
please direct them to the bmf mailing list, bmf-user@lists.sourceforge.net.
Tom Marshall
20 Oct 2002
|