File: README.Debian

package info (click to toggle)
fuzzyocr 3.6.0-15
  • links: PTS, VCS
  • area: main
  • in suites: bullseye
  • size: 804 kB
  • sloc: perl: 3,127; sh: 45; makefile: 2
file content (73 lines) | stat: -rw-r--r-- 3,088 bytes parent folder | download | duplicates (7)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
fuzzyocr for Debian
-------------------

--- config file

The main config file is installed in /etc/spamassassin/FuzzyOcr.cf.real

When the package is installed, there is 
a symlink  FuzzyOcr.cf -> FuzzyOcr.cf.real
(so, when the package is removed, but not purged, then the 
symlink is absent, and spamassassin does not try to 
initialize the plugin).

--- spamc/spamd

In the main config file, the settings for focr_logfile and
focr_digest_db do not make sense when an user is using spamc/spamd (as
I do).  Currently both are then disabled.  This way, FuzzyOcr works
out-of-the-box with spamc/spamd

It is still possible, though, for an user to use those features;
for example, I added into /home/debdev/.spamassassin/user_prefs
 focr_verbose 2
 focr_logfile /home/debdev/var/FuzzyOcr.log
 focr_enable_image_hashing 1
 focr_digest_db /home/debdev/var/FuzzyOcr.hashdb


 -- A Mennucc1 <mennucc1@debian.org>, Sun, 28 Sep 2008 09:26:50 +0200

This following is an upstream introduction to FuzzyOcr:

FuzzyOcr is a plugin for SpamAssassin which is aimed at unsolicited
bulk mail (also known as "Spam") containing images as the main content
carrier. Using different methods, it analyzes the content and
properties of images to distinguish between normal mails (Ham) and
spam mails. The methods mainly are:

    * Optical Character Recognition using different engines and settings
    * Fuzzy word matching algorithm applied to OCR results
    * Image hashing system to learn unique properties of known spam images
    * Dimension, size and integrity checking of images
    * Content-Type verification for the containing email 

For a brief description of features, resource aspects and scalability,
 see the detailed list below:

    * Matching and learning techniques
          o Flexible Optical Character Recognition interface
                + Official Support for gocr and ocrad
                + Generic support for TesserAct and others upcoming 
                    (planned for 3.5) 
          o Fuzzy word matching algorithm applied to OCR results
          o Recognition of duplicate (already processed) or similar images
                         using feature vectors (Hashing)
                + Efficient MLDBM database
                + Mysql Support (planned for 3.5) 
          o Dimension, size and integrity checking
          o Content-Type checking of containing email 

    * Resource saving techniques
          o Only scan mails which where not recognized yet as Ham or Spam 
               by other SpamAssassin rules or plugins (using score thresholds)
          o Optional skip of other scanning facilities once one scores 
                 already with a given threshold (planned for 3.5)
          o Mail skipping based on direct feature analysis
                 (Dimensions and file size)  (planned for 3.5) 

    * Safety measures
          o Configurable timeout against Denial of Service attacks against
              the third party tools
          o Context based word sets instead of simple lists to prevent
              false positives (planned for 3.5)