1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73
|
fuzzyocr for Debian
-------------------
--- config file
The main config file is installed in /etc/spamassassin/FuzzyOcr.cf.real
When the package is installed, there is
a symlink FuzzyOcr.cf -> FuzzyOcr.cf.real
(so, when the package is removed, but not purged, then the
symlink is absent, and spamassassin does not try to
initialize the plugin).
--- spamc/spamd
In the main config file, the settings for focr_logfile and
focr_digest_db do not make sense when an user is using spamc/spamd (as
I do). Currently both are then disabled. This way, FuzzyOcr works
out-of-the-box with spamc/spamd
It is still possible, though, for an user to use those features;
for example, I added into /home/debdev/.spamassassin/user_prefs
focr_verbose 2
focr_logfile /home/debdev/var/FuzzyOcr.log
focr_enable_image_hashing 1
focr_digest_db /home/debdev/var/FuzzyOcr.hashdb
-- A Mennucc1 <mennucc1@debian.org>, Sun, 28 Sep 2008 09:26:50 +0200
This following is an upstream introduction to FuzzyOcr:
FuzzyOcr is a plugin for SpamAssassin which is aimed at unsolicited
bulk mail (also known as "Spam") containing images as the main content
carrier. Using different methods, it analyzes the content and
properties of images to distinguish between normal mails (Ham) and
spam mails. The methods mainly are:
* Optical Character Recognition using different engines and settings
* Fuzzy word matching algorithm applied to OCR results
* Image hashing system to learn unique properties of known spam images
* Dimension, size and integrity checking of images
* Content-Type verification for the containing email
For a brief description of features, resource aspects and scalability,
see the detailed list below:
* Matching and learning techniques
o Flexible Optical Character Recognition interface
+ Official Support for gocr and ocrad
+ Generic support for TesserAct and others upcoming
(planned for 3.5)
o Fuzzy word matching algorithm applied to OCR results
o Recognition of duplicate (already processed) or similar images
using feature vectors (Hashing)
+ Efficient MLDBM database
+ Mysql Support (planned for 3.5)
o Dimension, size and integrity checking
o Content-Type checking of containing email
* Resource saving techniques
o Only scan mails which where not recognized yet as Ham or Spam
by other SpamAssassin rules or plugins (using score thresholds)
o Optional skip of other scanning facilities once one scores
already with a given threshold (planned for 3.5)
o Mail skipping based on direct feature analysis
(Dimensions and file size) (planned for 3.5)
* Safety measures
o Configurable timeout against Denial of Service attacks against
the third party tools
o Context based word sets instead of simple lists to prevent
false positives (planned for 3.5)
|