1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149
|
SylFilter - a message filter
Copyright (C) 2011-2013 Hiroyuki Yamamoto <hiro-y@kcn.ne.jp>
Copyright (C) 2011-2013 Sylpheed Development Team
About This Program
==================
This is SylFilter, a generic message filter library and command-line tools.
SylFilter provides a bayesian filter which is very popular as a spam filtering
algorithm. SylFilter is also internationalized and can be applied to any
languages.
SylFilter library provides simple but powerful C APIs and can be used from C
programs.
SylFilter command-line tool can be used as a junk filter program like major
tools such as bogofilter and bsfilter etc.
SylFilter is free software and distributed under the BSD-like license.
See COPYING for detail.
Install
=======
This program requires GLib and a key-value store engine. Install them before building.
Currently SQLite (enabled by default), QDBM and GDBM are supported for key-value store engine.
$ ./configure
( $ ./configure --disable-sqlite --enable-qdbm (enables QDBM) )
( $ ./configure --disable-sqlite --enable-gdbm (enables GDBM) )
$ make
$ sudo make install
By default, built-in subset of libsylph is used for message parsing.
To use libsylph installed on your system, specify --with-libsylph option.
./configure --with-libsylph=builtin use built-in LibSylph (default)
./configure --with-libsylph=standalone use standalone version of LibSylph
./configure --with-libsylph=sylpheed use Sylpheed's LibSylph
If libsylph is installed on non-standard location, also use
--with-libsylph-dir option.
Usage
=====
SylFilter accepts rfc822 message files (for example: MH, Maildir, eml).
Learning junk mails
$ sylfilter -j ~/Mail/junk/*
Learning clean mails
$ sylfilter -c ~/Mail/clean/*
Classifying mails
$ sylfilter ~/Mail/inbox/1234
Show learn status
$ sylfilter -s
Show learn status and all learned tokens
$ sylfilter -s -v
Show help message
$ sylfilter -h
$ sylfilter --help
Usage with Sylpheed
===================
On 'Common preferences... - Junk mail - Learning command:', manually set
each command as following:
Junk : sylfilter -j
Not Junk : sylfilter -c
Classifying command : sylfilter
Other information
=================
Token database files are created under ~/.sylfilter/ .
(On Windows: %APPDATA%\SylFilter\)
Library Design
==============
The filtering of SylFilter consists of a set of simple filter modules.
(Learning) (Classifying)
rfc822 message rfc822 message
| |
[ text content filter ] [ text content filter ]
| |
[ word separator filter ] [ blacklist filter ] --> spam
| |
[ n-gram filter ] [ word separator filter ]
| |
[ learning filter ] [ n-gram filter ]
|
[ bayesian filter ] --> spam
|
non-spam
The library users can create arbitrary combination of provided filters.
Users also can add their original custom filters.
Please read the source of src/sylfilter.c for library usage.
Algorithm of Bayesian Filter
============================
SylFilter implements Fisher's method which is described by Gary Robinson.
It is also implemented by bogofilter and bsfilter.
http://radio-weblogs.com/0101454/stories/2002/09/16/spamDetection.html
http://www.bgl.nu/bogofilter/fisher.html
SylFilter initially implemented the customized version of algorithm
described by Paul Graham.
http://paulgraham.com/spam.html
http://paulgraham.com/better.html
Robinson-Fisher method is used by default.
Basically the algorithm can be described as follows:
1. Counts the number of occurrences of words in a spam and non-spam.
2. Calculates the probability that a message containing it is a spam for
each words in a message.
3. Calculates the combined probability using important words in the message.
See the above Web pages for the detail.
|