File: markov.txt

package info (click to toggle)
dspam 3.6.8-5etch1
  • links: PTS
  • area: main
  • in suites: etch
  • size: 4,372 kB
  • ctags: 1,457
  • sloc: ansic: 24,738; sh: 9,860; perl: 2,378; makefile: 546; sql: 327
file content (38 lines) | stat: -rw-r--r-- 1,963 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
To implement Markovian weighting, the following pieces must be configured:

1. The storage driver. Be sure and compile using Bill Yerazunis' CRM114
   Sparse Spectra driver (hash_drv). This is the only driver that is presently
   fast enough to handle the extra data generated by the tokenizer used.

   NOTE: If you plan on doing TEFT or TUM type training, you'll need a huge
     database. In dspam.conf, HashRecMax should be set to around 5000000 
     with a HashExtentSize of around 1000000. If you run into performance
     issues, you may consider increasing this or use csscompress after training

   NOTE: Bill has told me that TOE yields the best results on real-world
     email, however for initial training TEFT or a TUNE approach might
     be best.

2. The tokenizer. This is set via 'Features' and is defaulted to 'chained'. 
   Comment out chained and use 'sbph' instead. This implements Bill
   Yerazunis' (CRM114) Sparse Binary Polynomial Hashing tokenizer, which is
   used in Markovian weighting.

3. The value computing algorithm. This should be set to 'markov' which uses
   Markovian weighting. Comment out graham.

4. The combination algorithm (Algorithm). This should be set to 'naive' to
   act like CRM114 or you may consider 'burton', which gave me slightly
   better results. Comment out any existing algorithms.

5. Training. Do not corpus train. Rather, consider using something like 
   the included train.pl example script. You can emulate TUNE training if
   you really want to by setting your training mode to TOE and re-running
   the script until no errors are generated.

This implements the "standard" CRM114ish Markovian type discrimination, but
you could also mix and match different tokenizers and combination algorithms
if you wanted to play around. It's quite possible you may get better results
from using a different combo. The only thing that is certain is the value
computing algorithm should always be 'markov'.