File: annoyance-filter.manm

package info (click to toggle)
annoyance-filter 1.0.0b-4
  • links: PTS
  • area: main
  • in suites: sarge
  • size: 6,400 kB
  • ctags: 1,953
  • sloc: ansic: 11,869; cpp: 7,278; sh: 2,674; perl: 1,030; makefile: 464; lisp: 442
file content (229 lines) | stat: -rw-r--r-- 7,157 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
.TH ANNOYANCE-FILTER 1 "19 FEB 2003"
.UC 4
.SH NAME
annoyance-filter \- automatically detect junk mail
.nh
.SH SYNOPSIS
.B annoyance-filter
[
.I options
]
.SH DESCRIPTION
.B annoyance-filter
uses Bayesian statistics to determine the probability an
E-mail message is junk based on an analysis of its contents
compared to collections of known junk and legitimate E-mail.
.PP
This program is under active development; new versions are
posted frequently at:
.ce 1
http://www.fourmilab.ch/annoyance-filter/
Please visit this page for news about the program and
to download the latest version.
.PP
The project is hosted on SourceForge, where you will find
the CVS source code repository and release archives:
.ce 1
http://sourceforge.net/projects/annoyancefilter/
.SH USAGE
.B annoyance-filter
has a multitude of options which permit it to be used in
many different ways, but the most common application involves
.I training
the program with collections of legitimate and junk mail
in order to create a
.I dictionary
which indicates the probability that words identify a
message as junk or non-junk (legitimate).  Training must
be done before the program is used to classify incoming
mail, but need be done subsequently only when adding
messages to the training collections.  As long as
the overall content of the mail, junk and legitimate, which
you receive remains pretty much the same, there's no
need to retrain, but the ability to do so allows the program
to automatically adapt to evolving message content, which is
particularly characteristic of junk mail.
.PP
Suppose you have a collection of legitimate mail (in other
words, mail you wish to read) in a file named
.I m\-good
and a collection of junk mail (that which you don't wish
to read) in file
.IR m\-junk .
These collections may be in ``Unix mail
folder'' format, which is simply the text of one or more
E-mail messages concatenated together in a single text file, or
may be the names of directories containing files, each of which
may be a single E-mail message or a Unix mail folder.  In either
case, if a message file is compressed with
.BR gzip ,
it will be automatically uncompressed on the fly.  Directories
of messages may not, however, contain other directories of
messages.
.PP
To train
.B annoyance-filter
with these collections and create a dictionary, use a
command like:
.PP
.ce 1
.BI "annoyance-filter \-\-mail " m-good " \-\-junk " m-junk " \-\-prune \-\-write " dict.bin
.PP
where
.I dict.bin
is the name of the dictionary file you wish to create.
.PP
Now that the dictionary has been created, you can use it on subsequent
runs to compute the probability a message is junk and classify it
accordingly.  Suppose you have an E-mail message in the file
.IR mail.txt .
To compute its junk priority and display it on standard output,
use the command:
.PP
.ce 1
.BI "annoyance-filter \-\-read " dict.bin " \-\-test " mail.txt
.PP
To integrate
.B annoyance-filter
into a mail processing system such as
.BR procmail ,
you'll usually want to run it as a
.I filter
which reads incoming messages from standard input (piped there
by the mail processing system), classifies them and adds annotations
to the message header indicating the classification, then writes the
message with header annotations to standard output.  The mail processing
system may then examine the header annotations and route the
message accordingly.  To filter a message, again assuming the
dictionary created by the training run is in the file
.IR dict.bin ,
use the command:
.PP
.ce 1
.BI "annoyance-filter \-\-read " dict.bin " \-\-transcript \- \-\-test \-"
.PP
Here the
.B \-\-transcript
option is used to request the input message be copied to an
output file, in this case
standard output, specified by
.RB `` \- '',
with the message read from standard input, the
.RB `` \- ''
argument to the
.B \-\-test
option.
.SH OPTIONS
\"#include "annoyance-filter.w" "Options."
.SH "EXIT STATUS"
The program exits with a
status of 0 when processing is successfully completed,
1 when an error (I/O or file access in most cases)
occurs, and 2 to indicate a command line syntax error.
If the
.B \-\-classify
option is specified, an exit status of
0 identifies the message tested as legitimate mail,
3 marks it as junk, and a status of 4 is returned for
messages which cannot be confidently classified as
either mail or junk.
.SH FILES
Files are read or written as requested by options on the
command line; all options which read or write files
take a
.I fname
argument which gives the file name.  The
.BR \-\-classify ,
.BR \-\-junk ,
.BR \-\-mail ,
.BR \-\-test ,
and
.B \-\-transcript
options interpret an argument of
.RB `` \- ''
as denoting standard input or output.
.PP
On systems which
provide the required services and utilities, arguments
to the
.B \-\-junk
and
.B \-\-mail 
options may be compressed files or the name of a directory
containing one or more messages which will be read as if
logically concatenated.  Messages in the directory may be
compressed or uncompressed.
.PP
Error messages and diagnostic output generated when
the
.B \-\-verbose
option is specified are written to standard error.
.SH BUGS
Millions, doubtless.  This is a program which must cope with
whatever garbage is fed to it from mail folders, trying to
make the best of it.  When it messes up, your efforts in
identifying the message which caused the problem and
submitting a verbatim copy of it with your bug report
are much appreciated.
.PP
Please report bugs to
.BR bugs @ fourmilab.ch
and include
.B annoyance-filter
in the Subject line.  Thanks in advance.
.ne 10
.SH AUTHOR
.ce 2
John Walker
http://www.fourmilab.ch/
.PP
This software is in the public domain. Permission to use, copy,
modify, and distribute this software and its documentation for
any purpose and without fee is hereby granted, without any
conditions or restrictions.  This software is provided ``as
is'' without express or implied warranty.
.SH "SEE ALSO"
.BR gnuplot (1),
.BR gs (1),
.BR gzip (1),
.BR netpbm (1),
.BR procmail (1),
.BR xpdf (1)
.PP
.B annoyance-filter
is written using the
.I "Literate Programming"
http://www.literateprogramming.com/ methodology; the
user manual, program, and internal documentation
are developed together, closely interlinked.
Whenever the program is modified, the documentation is
automatically updated, reducing the risk of divergence
between what the manual says and what the program does.
.PP
This
.B man
page is intended as a reference for the command line
options and most common applications of the program.  For
comprehensive documentation, including details of how to
integrate
.B annoyance-filter
with the
.B procmail
mail processing system, please refer to the complete
documentation published in PDF format, available on the
Web at:
.ce 1
http://www.fourmilab.ch/annoyance-filter/annoyance-filter.pdf
.PP
If you have downloaded the
.B annoyance-filter
source distribution, the corresponding version of
.B \%annoyance-filter.pdf
is included in the archive.  You can read PDF files
with Acrobat reader (a free download from
http://www.adobe.com/acrobat/readstep.html) or
the
.B xpdf
or Ghostscript
.RB ( gs )
utilities.