1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229
|
.TH ANNOYANCE-FILTER 1 "19 FEB 2003"
.UC 4
.SH NAME
annoyance-filter \- automatically detect junk mail
.nh
.SH SYNOPSIS
.B annoyance-filter
[
.I options
]
.SH DESCRIPTION
.B annoyance-filter
uses Bayesian statistics to determine the probability an
E-mail message is junk based on an analysis of its contents
compared to collections of known junk and legitimate E-mail.
.PP
This program is under active development; new versions are
posted frequently at:
.ce 1
http://www.fourmilab.ch/annoyance-filter/
Please visit this page for news about the program and
to download the latest version.
.PP
The project is hosted on SourceForge, where you will find
the CVS source code repository and release archives:
.ce 1
http://sourceforge.net/projects/annoyancefilter/
.SH USAGE
.B annoyance-filter
has a multitude of options which permit it to be used in
many different ways, but the most common application involves
.I training
the program with collections of legitimate and junk mail
in order to create a
.I dictionary
which indicates the probability that words identify a
message as junk or non-junk (legitimate). Training must
be done before the program is used to classify incoming
mail, but need be done subsequently only when adding
messages to the training collections. As long as
the overall content of the mail, junk and legitimate, which
you receive remains pretty much the same, there's no
need to retrain, but the ability to do so allows the program
to automatically adapt to evolving message content, which is
particularly characteristic of junk mail.
.PP
Suppose you have a collection of legitimate mail (in other
words, mail you wish to read) in a file named
.I m\-good
and a collection of junk mail (that which you don't wish
to read) in file
.IR m\-junk .
These collections may be in ``Unix mail
folder'' format, which is simply the text of one or more
E-mail messages concatenated together in a single text file, or
may be the names of directories containing files, each of which
may be a single E-mail message or a Unix mail folder. In either
case, if a message file is compressed with
.BR gzip ,
it will be automatically uncompressed on the fly. Directories
of messages may not, however, contain other directories of
messages.
.PP
To train
.B annoyance-filter
with these collections and create a dictionary, use a
command like:
.PP
.ce 1
.BI "annoyance-filter \-\-mail " m-good " \-\-junk " m-junk " \-\-prune \-\-write " dict.bin
.PP
where
.I dict.bin
is the name of the dictionary file you wish to create.
.PP
Now that the dictionary has been created, you can use it on subsequent
runs to compute the probability a message is junk and classify it
accordingly. Suppose you have an E-mail message in the file
.IR mail.txt .
To compute its junk priority and display it on standard output,
use the command:
.PP
.ce 1
.BI "annoyance-filter \-\-read " dict.bin " \-\-test " mail.txt
.PP
To integrate
.B annoyance-filter
into a mail processing system such as
.BR procmail ,
you'll usually want to run it as a
.I filter
which reads incoming messages from standard input (piped there
by the mail processing system), classifies them and adds annotations
to the message header indicating the classification, then writes the
message with header annotations to standard output. The mail processing
system may then examine the header annotations and route the
message accordingly. To filter a message, again assuming the
dictionary created by the training run is in the file
.IR dict.bin ,
use the command:
.PP
.ce 1
.BI "annoyance-filter \-\-read " dict.bin " \-\-transcript \- \-\-test \-"
.PP
Here the
.B \-\-transcript
option is used to request the input message be copied to an
output file, in this case
standard output, specified by
.RB `` \- '',
with the message read from standard input, the
.RB `` \- ''
argument to the
.B \-\-test
option.
.SH OPTIONS
\"#include "annoyance-filter.w" "Options."
.SH "EXIT STATUS"
The program exits with a
status of 0 when processing is successfully completed,
1 when an error (I/O or file access in most cases)
occurs, and 2 to indicate a command line syntax error.
If the
.B \-\-classify
option is specified, an exit status of
0 identifies the message tested as legitimate mail,
3 marks it as junk, and a status of 4 is returned for
messages which cannot be confidently classified as
either mail or junk.
.SH FILES
Files are read or written as requested by options on the
command line; all options which read or write files
take a
.I fname
argument which gives the file name. The
.BR \-\-classify ,
.BR \-\-junk ,
.BR \-\-mail ,
.BR \-\-test ,
and
.B \-\-transcript
options interpret an argument of
.RB `` \- ''
as denoting standard input or output.
.PP
On systems which
provide the required services and utilities, arguments
to the
.B \-\-junk
and
.B \-\-mail
options may be compressed files or the name of a directory
containing one or more messages which will be read as if
logically concatenated. Messages in the directory may be
compressed or uncompressed.
.PP
Error messages and diagnostic output generated when
the
.B \-\-verbose
option is specified are written to standard error.
.SH BUGS
Millions, doubtless. This is a program which must cope with
whatever garbage is fed to it from mail folders, trying to
make the best of it. When it messes up, your efforts in
identifying the message which caused the problem and
submitting a verbatim copy of it with your bug report
are much appreciated.
.PP
Please report bugs to
.BR bugs @ fourmilab.ch
and include
.B annoyance-filter
in the Subject line. Thanks in advance.
.ne 10
.SH AUTHOR
.ce 2
John Walker
http://www.fourmilab.ch/
.PP
This software is in the public domain. Permission to use, copy,
modify, and distribute this software and its documentation for
any purpose and without fee is hereby granted, without any
conditions or restrictions. This software is provided ``as
is'' without express or implied warranty.
.SH "SEE ALSO"
.BR gnuplot (1),
.BR gs (1),
.BR gzip (1),
.BR netpbm (1),
.BR procmail (1),
.BR xpdf (1)
.PP
.B annoyance-filter
is written using the
.I "Literate Programming"
http://www.literateprogramming.com/ methodology; the
user manual, program, and internal documentation
are developed together, closely interlinked.
Whenever the program is modified, the documentation is
automatically updated, reducing the risk of divergence
between what the manual says and what the program does.
.PP
This
.B man
page is intended as a reference for the command line
options and most common applications of the program. For
comprehensive documentation, including details of how to
integrate
.B annoyance-filter
with the
.B procmail
mail processing system, please refer to the complete
documentation published in PDF format, available on the
Web at:
.ce 1
http://www.fourmilab.ch/annoyance-filter/annoyance-filter.pdf
.PP
If you have downloaded the
.B annoyance-filter
source distribution, the corresponding version of
.B \%annoyance-filter.pdf
is included in the archive. You can read PDF files
with Acrobat reader (a free download from
http://www.adobe.com/acrobat/readstep.html) or
the
.B xpdf
or Ghostscript
.RB ( gs )
utilities.
|