File: quickref.txt

package info (click to toggle)
qsf 1.2.7-1.3
  • links: PTS
  • area: main
  • in suites: bullseye, buster, sid, stretch
  • size: 1,392 kB
  • ctags: 599
  • sloc: ansic: 9,981; sh: 816; awk: 17; makefile: 4
file content (881 lines) | stat: -rw-r--r-- 38,675 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
QSF(1)				 User Manuals				QSF(1)

NAME
       qsf - quick spam filter

SYNOPSIS
       Filtering:	qsf [-snrAtav] [-d DB] [-g DB]
			    [-L LVL] [-S SUBJ] [-H MARK] [-Q NUM]
			    [-X NUM]
       Training:	qsf -T SPAM NONSPAM [MAXROUNDS] [-d DB]
       Retraining:	qsf -[m|M] [-d DB] [-w WEIGHT] [-ayN]
       Database:	qsf -[p|D|R|O] [-d DB]
       Database merge:	qsf -E OTHERDB [-d DB]
       Allowlist query: qsf -e EMAIL [-m|-M|-t] [-d DB] [-g DB]
       Denylist query:	qsf -y -e EMAIL [-m -m|-M -M|-t] [-d DB] [-g DB]
       Help:		qsf -[h|V]

DESCRIPTION
       qsf  reads  a single email on standard input, and by default outputs it
       on standard output.  If the email is determined to be  spam,  an	 addi-
       tional header ("X-Spam: YES") will be added, and optionally the subject
       line can have "[SPAM]" prepended to it.

       qsf is intended to be used in a procmail(1) recipe, in a	 ruleset  such
       as this:

	       :0 wf
	       | qsf -ra

	       :0 H:
	       * X-Spam: YES
	       $HOME/mail/spam

       For  more examples, including sample procmail(1) recipes, see the EXAM-
       PLES section below.

TRAINING
       Before qsf can be used properly, it needs to be trained.	 A good way to
       train qsf is to collect a copy of all your email into two folders - one
       for spam, and one for non-spam.	Once you have done this, you  can  use
       the training function, like this:

	       qsf -aT spam-folder non-spam-folder

       This  will generate a database that can be used by qsf to guess whether
       email received in the future is spam or not.  Note  that	 this  initial
       training	 run  may  take a long time, but you should only need to do it
       once.

       To mark a single message as spam, pipe it to qsf with  the  --mark-spam
       or  -m  ("mark as spam") option.	 This will update the database accord-
       ingly and discard the email.

       To mark a single message as non-spam, pipe it to qsf with  the  --mark-
       nonspam	or  -M	("mark as non-spam") option.  Again, this will discard
       the email.

       If a message has been mis-tagged, simply send it to qsf as the opposite
       type,  i.e.  if it has been mistakenly tagged as spam, pipe it into qsf
       --mark-nonspam --weight=2 to  add  it  to  the  non-spam	 side  of  the
       database with double the usual weighting.

OPTIONS
       The qsf options are listed below.

       -d, --database [TYPE:]FILE
	      Use  FILE	 as the spam/non-spam database.	 The default is to use
	      /var/lib/qsfdb and, if that is not available  or	is  read-only,
	      $HOME/.qsfdb.  This option can also be useful if there is a sys-
	      tem-wide database but you do not want to	use  it	 -  specifying
	      your own here will override the default.

	      If   you	 prefix	  the  filename	 with  a  TYPE,	 of  the  form
	      btree:$HOME/.qsfdb, then this will specify what kind of database
	      FILE is, such as list, btree, gdbm, sqlite and so on.  Check the
	      output of qsf -V to see which database backends  are  available.
	      The default is to auto-detect the type, or, if the file does not
	      already exist, use list.	Note that TYPE is not  case-sensitive.

       -g, --global [TYPE:]FILE
	      Use   FILE   as	the   default	global	database,  instead  of
	      /var/lib/qsfdb.  If you also specify a database  with  -d,  then
	      this  "global"  database	will be used in read-only mode in con-
	      junction with the read-write database specified with -d.	The -g
	      option  can  be  used a second time to specify a third database,
	      which will also be used in read-only mode.  Again, the  filename
	      can  optionally  be  prefixed  with  a  TYPE which specifies the
	      database type.

       -P, --plain-map FILE
	      Maintain a mapping of all database tokens	 to  their  non-hashed
	      counterparts in FILE, one token per line.	 This can be useful if
	      you want to be able to list the contents of your database	 at  a
	      later  date,  for	 instance  to get a list of email addresses in
	      your allow-list.	Note that using this option may slow qsf down,
	      and  only	 entries  written to the database while this option is
	      active will be stored in FILE.

       -s, --subject
	      Rewrite the Subject line of any email that turns out to be spam,
	      adding "[SPAM]" to the start of the line.

       -S, --subject-marker SUBJECT
	      Instead  of  adding "[SPAM]", add SUBJECT to the Subject line of
	      any email that turns out to be spam.  Implies -s.

       -H, --header-marker MARK
	      Instead of setting the X-Spam header to "YES", set it to MARK if
	      email  turns  out	 to be spam.  This can be useful if your email
	      client can only search all headers for a string, rather than one
	      particular  header (so searching for "YES" might match more than
	      just the output of qsf).

       -n, --no-header
	      Do not add an X-Spam header to messages.

       -r, --add-rating
	      Insert an additional header X-Spam-Rating which is a  rating  of
	      the  "spamminess"	 of  a message from 0 to 100; 90 and above are
	      counted as spam, anything under 90 is not considered  spam.   If
	      combined with -t, then the rating (0-100) will be output, on its
	      own, on standard output.

       -A, --asterisk
	      Insert an additional  header  X-Spam-Level  which	 will  contain
	      between 0 and 20 asterisks (*), depending on the spam rating.

       -t, --test
	      Instead  of  passing  the message out on standard output, output
	      nothing, and exit 0 if the message is not spam, or exit 1 if the
	      message is spam.	If combined with -r, then the spam rating will
	      be output on standard output.

       -a, --allowlist
	      Enable the allow-list.  This causes the email addresses given in
	      the  message's  "From:" and "Return-Path:" headers to be checked
	      against a list; if either	 one  matches,	then  the  message  is
	      always  treated  as  non-spam,  regardless  of  what  the	 token
	      database says. When specified with  a  retraining	 flag,	-a  -m
	      (mark  as	 spam) will remove that address from the allow-list as
	      well as marking the message as spam, and -a  -M  (mark  as  non-
	      spam) will add that address to the allow-list as well as marking
	      the message as non-spam.	The idea is that you add all  of  your
	      friends  to the allow-list, and then none of their messages ever
	      get marked as spam.

       -y, --denylist
	      Enable the deny-list.  This causes the email addresses given  in
	      the  message's  "From:" and "Return-Path:" headers to be checked
	      against a second list; if either one matches, then theh  message
	      is  always  treated  as spam.  Training works in the same way as
	      with -a, except that you must specify -m or -M twice  to	modify
	      the  deny-list  instead  of the allow-list, and with the reverse
	      syntax: -y -m -m (mark as spam) will add	that  address  to  the
	      deny-list,  whereas -y -M -M (mark as non-spam) will remove that
	      address from the deny-list.  This	 double	 specification	is  so
	      that  the	 usual retraining process never touches the deny-list;
	      the deny-list should be carefully maintained rather  than	 auto-
	      matically generated.

	      Normally you would not need to use the deny-list.

       -L, --level, --threshold LEVEL
	      Change  the  spam	 scoring threshold level which must be reached
	      before an email is classified as spam.  The default is 90.

       -Q, --min-tokens NUM
	      Only give a score if more than NUM tokens are found in the  mes-
	      sage  -  otherwise the message is assumed to be non-spam, and it
	      is not modified in any way.  The	default	 is  0.	  This	option
	      might  be	 useful if you find that very short messages are being
	      frequently miscategorised.

       -e, --email, --email-only EMAIL
	      Query or update the  allow-list  entry  for  the	email  address
	      EMAIL.   With no other options, this will simply output "YES" if
	      EMAIL is in the allow-list, or "NO" if it is not.	 With  -t,  it
	      will  not output anything, but will exit 0 (success) if EMAIL is
	      in the allow-list, or 1 (failure) if it  is  not.	 With  the  -m
	      (mark-spam) option, any previous allow-list entry for EMAIL will
	      be removed. Finally, with the -M	(mark-nonspam)	option,	 EMAIL
	      will be added to the allow-list if it is not already on it.

	      If  EMAIL is just the word MSG on its own, then an email will be
	      read from standard input, and the email addresses given  in  the
	      "From:" and "Return-Path:" headers will be used.

	      Using -e automatically switches on -a.

	      If  you also specify -y, then the deny-list will be operated on.
	      Remember that -m and -M are reversed with the deny-list.

	      If you specify an email address of  the  form  @domain  (nothing
	      before  the  @),	then  the  whole  domain will be allow or deny
	      listed.

       -v, --verbose
	      Add extra X-QSF-Info headers to any filtered  email,  containing
	      error  messages  and  so on if applicable.  Specify -v more than
	      once to increase verbosity.

       -T, --train SPAM NONSPAM [MAXROUNDS]
	      Train the database using the two mbox folders SPAM and  NONSPAM,
	      by testing each message in each folder and updating the database
	      each time a message is miscategorised.   This  is	 done  several
	      times, and may take a while to run.  Specify the -a (allow-list)
	      flag to add every sender in the NONSPAM folder  to  your	allow-
	      list  as a side-effect of the training process.  If MAXROUNDS is
	      specified, training will end after this number of rounds if  the
	      results  are  still not good enough. The default is a maximum of
	      200 rounds.

       -m, --mark-spam
	      Instead of passing the message out on standard output, mark  its
	      contents	as  spam  and update the database accordingly.	If the
	      allow-list (-a) is enabled, the message's "From:"	 and  "Return-
	      Path:"  addresses are removed from the allow-list.  If the deny-
	      list (-y) is enabled and you specify  -m	twice,	the  message's
	      addresses are added to the deny-list instead.

       -M, --mark-nonspam
	      Instead  of passing the message out on standard output, mark its
	      contents as non-spam and update the  database  accordingly.   If
	      the  allow-list  (-a)  is	 enabled,  the	message's  "From:" and
	      "Return-Path:" addresses are added to the allow-list (see the -a
	      option above).  If the deny-list (-y) is enabled and you specify
	      -M twice, the message's addresses are removed from the deny-list
	      instead.

       -w, --weight WEIGHT
	      When  marking  as	 spam  or non-spam, update the database with a
	      weighting of WEIGHT per token instead of the default of 1.  Use-
	      ful when correcting mistakes, eg a message that has been mistak-
	      enly detected as spam should  be	marked	as  non-spam  using  a
	      weighting	 of  2, i.e. double the usual weighting, to counteract
	      the error.

       -D, --dump [FILE]
	      Dump the contents of the database as a platform-independent text
	      file, suitable for archival, transfer to another machine, and so
	      on.  The data is output on stdout or into the given FILE.

       -R, --restore [FILE]
	      Rebuild the database from scratch from the text file  on	stdin.
	      If  a  FILE  is  given,  data is read from there instead of from
	      stdin.

       -O, --tokens
	      Instead of filtering, output a list of the tokens found  in  the
	      message read from standard input, along with the number of times
	      each token was found.  This is only useful if you	 want  to  use
	      qsf  as a general tokeniser for use with another filtering pack-
	      age.

       -E, --merge OTHERDB
	      Merge the OTHERDB database into the current database.  This  can
	      be  useful  if  you want to take one user's mailbox and merge it
	      into the system-wide one, for instance (this would be  done  by,
	      as  root,	 doing	qsf -d /var/lib/qsfdb -E /home/user/.qsfdb and
	      then removing /home/user/.qsfdb).

       -B, --benchmark SPAM NONSPAM [MAXROUNDS]
	      Benchmark the training process using the two mbox	 folders  SPAM
	      and  NONSPAM.  A temporary database is created and trained using
	      the first 75% of the messages  in	 each  folder,	and  then  the
	      entire  contents	of each folder is tested to see how many false
	      positives and false negatives occur. Some timing information  is
	      also displayed.

	      This can be used to decide which backend is best on your system.
	      Use -d to select a backend, eg qsf -B spam  nonspam  -d  GDBM  -
	      this  will  create  a temporary database which is removed after-
	      wards.

	      The exception to	this  is  the  MySQL  backend,	where  a  full
	      database	    specification      must	 be	 given	   (-d
	      MySQL:database=db;host=localhost;...)  and  the  database	 table
	      given will not be wiped beforehand or dropped afterwards.

	      As  with	-T,  if MAXROUNDS is specified, training will never be
	      done for more than this number of rounds; the default is 200.

       -h, --help
	      Print a usage message on standard output and exit	 successfully.

       -V, --version
	      Print   version  information,  including	a  list	 of  available
	      database backends, on standard output and exit successfully.

DEPRECATED OPTIONS
       The following options are  only	for  use  with	the  old  binary  tree
       database backend or old databases that haven't been upgraded to the new
       format that came in with version 1.1.0.

       -N, --no-autoprune
	      When marking as spam or nonspam, never automatically  prune  the
	      database.	 Usually the database is pruned after every 500 marks;
	      if you would rather --prune manually, use -N  to	disable	 auto-
	      matic pruning.

       -p, --prune
	      Remove  redundant	 entries  from	the database and clean it up a
	      little.  This is	automatically  done  after  several  calls  to
	      --mark-spam  or --mark-nonspam, and during training with --train
	      if the training takes a large number of  rounds,	so  it	should
	      rarely be necessary to use --prune manually unless you are using
	      -N / --no-autoprune.

       -X, --prune-max NUM
	      When the database is being pruned, no more than NUM entries will
	      be  considered  for  removal.  This is to prevent CPU and memory
	      resources being taken over.  The default is 100,000 but in  some
	      circumstances  (if  you  find  that pruning takes too long) this
	      option may be used to reduce it to a more manageable number.

FILES
       /var/lib/qsfdb
	      The default (system-wide) spam database.	If you wish to install
	      qsf  system-wide,	 this  should  be read-only to everyone; there
	      should be one user with write access who	can  update  the  spam
	      database	with  qsf  --mark-spam	and  qsf  --mark-non-spam when
	      necessary.

       /var/lib/qsfdb2
	      A second, read-only, system-wide database. This  can  be	useful
	      when  installing	qsf  system-wide  and  using  third-party spam
	      databases; the first global database can be updated with system-
	      specific	changes,  and this second database can be periodically
	      updated when the third-party spam database is updated.

       $HOME/.qsfdb
	      The default spam database	 for  per-user	data.	Users  without
	      write  access  to	 the system-wide database will have their data
	      written here, and the two databases will be read together.   The
	      per-user	database  will	be  given a weighting equivalent to 10
	      times the weighting of the global database.

NOTES
       Currently, you cannot use qsf to check for spam while the  database  is
       being  updated.	 This  means  that while an update is in progress, all
       email is passed through as non-spam.

       There is an upper size limit  of	 512Kb	on  incoming  email;  anything
       larger  than this is just passed through as non-spam, to avoid tying up
       machine resources.

       The plaintext  token  mapping  maintained  by  --plain-map  will	 never
       shrink,	only  grow.   It  is intended for use by housekeeping and user
       interface scripts that, for instance, the user  can  use	 to  list  all
       email addresses on their allow-list.  These scripts should take care of
       weeding out entries for tokens that are no longer in the database.   If
       you  have no such scripts, there is probably no point in using --plain-
       map anyway.

       Avoid using the deny-list (-y) in any automated retraining, as  it  can
       be cause the filter to reject mail unnecessarily.  In general the deny-
       list is probably best left unused unless explicitly  required  by  your
       particular setup.

       If  both	 the  allow-list  and  the  deny-list  are enabled, then email
       addresses will first be checked against the deny-list, then the	allow-
       list, then the domain of the email address will be checked for matching
       "@domain" entries in the deny-list and then in the allow-list.

EXAMPLES
       To filter all of your mail through qsf, with the allow-list enabled and
       the  "spam  rating"  header  being  added, add this to your .procmailrc
       file:

	       :0 wf
	       | qsf -ra

       If you want qsf to add "[SPAM]" to the subject line of any messages  it
       thinks are spam, do this instead:

	       :0 wf
	       | qsf -sra

       To  automatically mark any email sent to spambox@yourdomain.com as spam
       (this is the "naive" version):

	       :0 H
	       * ^To:.*spambox@yourdomain.com
	       | qsf -am

       To do the same, but cleverly, so that  only  email  to  spambox@yourdo-
       main.com	 which	qsf  does  NOT already classify as spam gets marked as
       spam in the database (this  stops  the  database	 getting  too  heavily
       weighted):

	       # If sent to spambox@yourdomain.com:
	       :0
	       * ^To:.*spambox@yourdomain.com
	       {
		  :0 wf
		  | qsf -a

		  # The above two lines can be skipped if you've
		  # already piped the message through qsf.

		  # If the qsf database says it's not spam,
		  # mark it as spam!
		  :0 H
		  * ^X-Spam: NO
		  | qsf -am
	       }

       Remove the -a option in the above examples if you don't want to use the
       allow-list.

       A more complicated filtering example - this will only run qsf  on  mes-
       sages  which  don't  have a subject line saying "your <something> is on
       fire" and which don't have a sender address  ending  in	"@foobar.com",
       meaning	that  messages	with  that subject line OR that sender address
       will NEVER be marked as spam, no matter what:

	       :0 wf
	       * ! ^Subject: Your .* is on fire
	       * ! ^From: .*@foobar.com
	       | qsf -ra

       For more on  procmail(1)	 recipes,  see	the  procmailrc(5)  and	 proc-
       mailex(5) manual pages.

       A couple of macros to add to your .muttrc file, if you use mutt(1) as a
       mail user agent:

	       # Press F5 to mark a message as spam and delete it
	       macro index <f5> "<pipe-message>qsf -am\n<delete-message>"
	       macro pager <f5> "<pipe-message>qsf -am\n<delete-message>"

	       # Press F9 to mark a message as non-spam
	       macro index <f9> "<pipe-message>qsf -aM\n"
	       macro pager <f9> "<pipe-message>qsf -aM\n"

       Again, remove the -a option in the above examples if you don't want  to
       use the allow-list.

       Note,  however, that the above macros won't work when operating on mul-
       tiple tagged messages. For that, you'd need something like this:

	       macro  index  <f5>   ":set   pipe_split\n<tag-prefix><pipe-mes-
	      sage>qsf -am\n<tag-prefix><delete-message>\n:unset pipe_split\n"

       If you use qmail(7), then to get procmail working with it you will need
       to  put	a  line	 containing just DEFAULT=./Maildir/ at the top of your
       ~/.procmailrc file, so that procmail delivers to	 your  Maildir	folder
       instead	of  trying  to	deliver to /var/spool/mail/$USER, and you will
       need to put this in your ~/.qmail file:

	       | preline procmail

       This will cause all your mail to be delivered via procmail  instead  of
       being delivered directly into your mail directory.

       See the qmail(7) documentation for more about mail delivery with qmail.

       If you use postfix(1), you can set up a system-wide mail filter by cre-
       ating a user account for the purpose of filtering mail, populating that
       account's .qsfdb, and then creating a shell  script,  to	 run  as  that
       user, which runs qsf on stdin and passes stdout to sendmail(8).

       Doing  this  requires  some knowledge of postfix configuration and care
       needs to be taken to avoid mail loops.  One qsf user's  full  HOWTO  is
       included in the doc/ directory with this package.

THE ALLOW-LIST
       A  feature called the "allow-list" can be switched on by specifying the
       --allowlist or -a option.  This causes messages' "From:"	 and  "Return-
       Path:"  addresses  to be checked against a list of people you have said
       to allow all messages from, and if  a  message's	 "From:"  or  "Return-
       Path:"  address is in the list, it is never marked as spam.  This means
       you can add all your friends to an "allow-list" and qsf will then never
       mis-file	 their	messages - a quick way to do this is to use -a with -T
       (train); everyone in your non-spam folder who has  sent	you  an	 email
       will be added to the allow-list automatically during training.

       You  can	 manually  add and remove addresses to and from the allow-list
       using the -e (email) option. For instance, to add  foo@bar.com  to  the
       allow-list, do this:

	       qsf -e foo@bar.com -M

       To remove bad@nasty.com from the allow-list, do this:

	       qsf -e bad@nasty.com -m

       And  to	see whether someone@somewhere.com is in the allow-list or not,
       just do this:

	       qsf -e someone@somewhere.com

       In general, you probably always	want  to  enable  the  allow-list,  so
       always  specify	the -a option when using qsf.  This will automatically
       maintain the allow-list based on what you classify as spam or non-spam.

       The  only  times	 you might want to turn it off are when people on your
       allow-list are prone to getting viruses or if a virus is causing	 email
       to  be sent to you that is pretending to be from someone on your allow-
       list.

BACKUP AND RESTORE
       Because the database format is platform-specific, it is a good idea  to
       periodically  dump the database to a text file using qsf -D so that, if
       necessary, it can be transferred to another machine and	restored  with
       qsf -R later on.

       Also  note  that	 since the actual contents of email messages are never
       stored in the database (see TECHNICAL DETAILS), you  can	 safely	 share
       your  qsf  database with friends - simply dump your database to a file,
       like this:

	       qsf -D > your-database-dump.txt

       Once you have sent your-database-dump.txt to another person,  they  can
       do this:

	       qsf -R < your-database-dump.txt

       They will then have an identical database to yours.

TECHNICAL DETAILS
       When  a message is passed to qsf, any attachments are decoded, all HTML
       elements are removed, and the message  text  is	then  broken  up  into
       "tokens",  where	 a  "token"  is	 a  single word or URL.	 Each token is
       hashed using the MD5 algorithm (see below for why), and	that  hash  is
       then used to look up each token in the qsf database.

       For  full  details  of  which parts of an email (headers, body, attach-
       ments, etc) are used to calculate the spam rating, see the TOKENISATION
       section below.

       Within the database, each token has two numbers associated with it: the
       number of times that token has been seen in spam,  and  the  number  of
       times  it has been seen in non-spam.  These two numbers, along with the
       total number of spam and non-spam messages seen, are then used to  give
       a  "spamminess"	value  for  that  particular token.  This "spamminess"
       value ranges from "definitely not spammy" at  one  end  of  the	scale,
       through "neutral" in the middle, up to "definitely spammy" at the other
       end.

       Once a "spamminess" value has been calculated for all of the tokens  in
       the  message, a summary calculation is made to give an overall "is this
       spam?"  probability rating for the message.  If the overall probability
       is 0.9 or above, the message is flagged as spam.

       In  addition  to	 the probability test is the "allow-list".  If enabled
       (with the -a option), the whole probability check  is  skipped  if  the
       sender  of  the message is listed in the allow-list, and the message is
       not marked as spam.

       When training the database, a  message  is  split  up  into  tokens  as
       described  above,  and  then the numbers in the database for each token
       are simply added to: if you tell qsf that a message is  spam,  it  adds
       one  to	the "number of times seen in spam" counter for each token, and
       if you tell it a message is not spam, it adds one  to  the  "number  of
       times  seen  in	non-spam"  counter  for	 each token.  If you specify a
       weight, with -w, then the number you specify is added instead of one.

       To stop the database growing uncontrollably, the database  keeps	 track
       of  when	 a  token  was	last used.  Underused tokens are automatically
       removed from the database.  (The old method was to  "prune"  every  500
       updates).

       Finally,	 the  reason  MD5  hashes were used is privacy.	 If the actual
       tokens from the messages, and the actual email addresses in the	allow-
       list,  were  stored,  you could not share a single qsf database between
       multiple users because bits of everyone's  messages  would  be  in  the
       database - things like emailed passwords, keywords relating to personal
       gossip, and so on.  So a hash is stored instead.	 A hash is a "one-way"
       function;  it  is  easy to turn a token into a hash but very hard (some
       might say impossible) to turn a hash back into the token	 that  created
       it.  This means that you end up with a database with no personal infor-
       mation in it.

TOKENISATION
       When a message is broken up into tokens, various parts of  the  message
       are treated in different ways.

       First,  all header fields are discarded, except for the important ones:
       From, Return-Path, Sender, To, Reply-To, and Subject.

       Next, any MIME-encoded attachments are decoded.	Any attachments	 whose
       MIME type starts with "text/" (i.e. HTML and text) are tokenised, after
       having  any  HTML  tags	stripped.   Any	 non-textual  attachments  are
       replaced	 with their MD5 hash (such that two identical attachments will
       have the same hash), and that hash is then used as a token.

       In addition to single-word tokens from textual message parts, qsf  adds
       doubled-up  tokens  so that word pairs get added to the database.  This
       makes the database a bit bigger (although the automatic	pruning	 tends
       to take care of that) but makes matching more exact.

SPECIAL FILTERS
       As  well as using the textual content of email to detect spam, qsf also
       uses special filters which  create  "pseudo-tokens"  based  on  various
       rules.	This  means that specific patterns, not just individual words,
       can be used to determine whether a message is spam or not.

       For example, if a message contains lots of words with  multiple	conso-
       nants,  like  "ashjkbnxcsdjh",  then each time a word like that is seen
       the special token ".GIBBERISH-CONSONANTS." is  added  to	 the  list  of
       tokens  found  in the message.  If it turns out that most messages with
       words that trigger this filter rule are spam, then other messages  with
       gibberish  consonant strings will be more likely to be flagged as spam.

       Currently the special filters are:

       GTUBE  Flags	 any	  message      containing      the	string
	      XJS*C4JDBQADN1.NSBN3*2IDNEN*GTUBE-STANDARD-ANTI-UBE-TEST-
	      EMAIL*C.34X as spam - useful for testing that your qsf installa-
	      tion is working.

       ATTACH-SCR

       ATTACH-PIF

       ATTACH-EXE

       ATTACH-VBS

       ATTACH-VBA

       ATTACH-LNK

       ATTACH-COM

       ATTACH-BAT
	      Adds a token for every attachment whose filename ends in ".scr",
	      ".pif", ".exe",  ".vbs",	".vba",	 ".lnk",  ".com",  and	".bat"
	      respectively (these are often viruses).

       ATTACH-GIF

       ATTACH-JPG

       ATTACH-PNG
	      Adds a token for every attachment whose filename ends in ".gif",
	      ".jpg" or ".jpeg", and ".png" respectively.

       ATTACH-DOC

       ATTACH-XLS

       ATTACH-PDF
	      Adds a token for every attachment whose filename ends in ".doc",
	      ".xls",  or  ".pdf"  respectively (these tend to indicate a non-
	      spam email).

       SINGLE-IMAGE
	      Adds a token if the message contains exactly one attached image.

       MULTIPLE-IMAGES
	      Adds  a  token  if  the  message contains more than one attached
	      image.

       GIBBERISH-CONSONANTS
	      Adds a token for every word found that has  multiple  consonants
	      in  a  row,  as described above.	Spam often contains strings of
	      gibberish.

       GIBBERISH-VOWELS
	      Adds a token for every word found that has multiple vowels in  a
	      row, eg "aeaiaiaeeio".

       GIBBERISH-FROMCONS
	      Like GIBBERISH-CONSONANTS, but only for the "From:" and "Return-
	      Path:" addresses on their own.

       GIBBERISH-FROMVOWL
	      Like GIBBERISH-VOWELS, but only for  the	"From:"	 and  "Return-
	      Path:" addresses on their own.

       GIBBERISH-BADSTART
	      Adds  a  token  for  every word that starts with a bad character
	      such as %.

       GIBBERISH-HYPHENS
	      Adds a token for every word with	more  than  three  hyphens  or
	      underscores in it.

       GIBBERISH-LONGWORDS
	      Adds  a  token for every word with over 30 characters in it (but
	      less than 60).

       HTML-COMMENTS-IN-WORDS
	      Adds a token for every HTML comment found in  the	 middle	 of  a
	      word.   Spam  often  contains  HTML  inside  words,  like	 this:
	      w<!--dsgfhsdgjgh-->ord

       HTML-EXTERNAL-IMG
	      Adds a token for every HTML <img> (image) tag  found  that  con-
	      tains :// (i.e.  it refers to an external image).

       HTML-FONT
	      Adds a token for every HTML <font> tag found.

       HTML-IP-IN-URLS
	      Adds a token for every URL found containing an IP address.

       HTML-INT-IN-URL
	      Adds  a  token  for every URL found containing an integer in its
	      hostname.

       HTML-URLENCODED-URL
	      Adds a token for every URL found containing  a  %	 sign  in  its
	      hostname.

       Normally, filters will just cause a token to be added, and these tokens
       are processed by the normal weighting  algorithm.   However  the	 GTUBE
       filter  will  immediately  flag any matching message as spam, bypassing
       the token matching.

DATABASE BACKENDS
       The inbuilt "list" database backend will not  necessarily  provide  the
       best performance, but is provided because using it requires no external
       libraries.

       If, when qsf was compiled, the correct libraries were  available,  then
       it  will be possible to use qsf with alternative database backends.  To
       find out which backends you have available, run qsf -V (capital V)  and
       read  the  second  line of output.  To see how well a backend performs,
       collect some spam and non-spam and use qsf -d BACKEND -B	 SPAM  NONSPAM
       (see the entry for -B above).

       Some  people  find  that	 they get the best performance out of the gdbm
       backend; this is a library that is widely available on many systems.

       To efficiently share a qsf database across multiple machines,  you  may
       find the MySQL backend useful.  However, using it is a little more com-
       plicated.

       To use the MySQL backend you will need  to  create  a  table  with  the
       fields  key1,  key2,  token,  value1,  value2  and  value3.  The token,
       value1, value2, and value3 fields must be VARCHAR(64), BIGINT  or  INT,
       and  BIGINT  or	INT respectively, and indexing on the token field is a
       good idea. The key1 and key2 fields can be anything, but they  must  be
       present.

       For example:

		USE mydatabase;
		CREATE TABLE qsfdb (
		  key1	    BIGINT UNSIGNED NOT NULL,
		  key2	    BIGINT UNSIGNED NOT NULL,
		  token	    VARCHAR(64) DEFAULT '' NOT NULL,
		  value1    INT UNSIGNED NOT NULL,
		  value2    INT UNSIGNED NOT NULL,
		  value3    INT UNSIGNED NOT NULL,
		  PRIMARY KEY (key1,key2,token),
		  KEY (key1),
		  KEY (key2),
		  KEY (token)
		);

       The  key1  and  key2 fields allow you to have multiple qsf databases in
       one table, by specifying different key1 and key2 values on  invocation.

       Instead	of specifying a database file with the --database / -d option,
       you must specify either a specification string as described  below,  or
       the name of a file containing such a string on its first line.

       The specification string is as follows:

		database=DATABASE;host=HOST;port=PORT;
		user=USER;pass=PASS;table=TABLE;
		key1=KEY1;key2=KEY2

       This string must be all on one line, with no spaces.

       DATABASE
	      is the name of the MySQL database.

       HOST   is the hostname of the database server (eg "localhost").

       PORT   is the TCP port to connect on (eg 3306).

       USER   is the username to connect with.

       PASS   is the password to connect with.

       TABLE  is  the  database	 table to use.	If a table with this name does
	      not exist when qsf is called in update or training mode, then it
	      will be created if permissions allow this to be done.

       KEY1   is the value to use for the key1 field.

       KEY2   is the value to use for the key2 field.

       Since  command  lines  can  be seen in the process list, it is probably
       best to specify a filename (eg qsf -d  mysql:qsfdb.spec)	 and  put  the
       specification string inside that file.

TROUBLESHOOTING
       If  you	have  problems	with qsf, please check the list below; if this
       does not help, go to the qsf home  page	and  investigate  the  mailing
       lists, or email the author.

       Nothing is being marked as spam.
	      First,  use the -r option to switch on the X-Spam-Rating header,
	      and check that this header appears in email passed through  qsf.
	      If  it  does not, then it is likely that qsf is not being run at
	      all - check your configuration of procmail(1) or its equivalent.

	      If  you  are  seeing X-Spam-Rating headers, and different emails
	      have different scores, then you may simply need to retrain  your
	      database a little more.  Take more spam email and pass it to qsf
	      -m.

	      If you are seeing X-Spam-Rating headers but they	all  give  the
	      same spam rating, then the most likely reason is that qsf is not
	      reading any database.  Make sure that whatever is processing the
	      email  has  read permissions on /var/lib/qsfdb and/or ~/.qsfdb -
	      and make sure  that,  if	you  are  using	 ~/.qsfdb,  what  your
	      database	creator thought was ~ ($HOME) is the same as it is for
	      whatever is processing the email.

       Retraining sometimes takes a very long time.
	      With the obtree backend or  2-column  MySQL  or  SQLite  tables,
	      every 500th retrain (-m or -M), the database is pruned.  On some
	      systems this may take  some  time,  and  during  this  time  the
	      database	is locked (except when using the MySQL or SQLite back-
	      ends).  If you constantly do a lot of  retraining	 and  want  to
	      avoid this, then use the -N option to suppress auto-pruning, and
	      then have a cron(8) job or something run a manual prune (qsf -p)
	      every now and again.

       Running qsf from procmail fails with an error.
	      If  you  can run qsf from the command line, but in your procmail
	      log file you get errors about "qsf: cannot execute binary file",
	      then  contact your system administrator for help. It may be that
	      incoming email is handled by a different server to the  one  you
	      normally	shell  into, and either they are of a different archi-
	      tecture or operating system, or the mail server is not permitted
	      to execute user-owned binaries.

ACKNOWLEDGEMENTS
       The  following  people have contributed suggestions, comments, patches,
       and testing:

	      Tom Parker <http://www.bits.bris.ac.uk/palfrey/>
	      Dr Kelly A. Parker
	      Vesselin Mladenov <http://www.antipodes.bg/>
	      Glyn Faulkner
	      Mark Reynolds
	      Sam Roberts
	      Scott Allen
	      Karsten Kankowski
	      M. Kolbl
	      Micha Holzmann
	      Jef Poskanzer <http://www.acme.com/jef/>
	      Clemens Fischer <http://ino-waiting.gmxhome.de/>
	      Nelson A. de Oliveira
	      Michal Vitecek
	      Tommy Pettersson <http://www.lysator.liu.se/~ptp/>

AUTHOR
       The author:

	      Andrew Wood <andrew.wood@ivarch.com>
	      http://www.ivarch.com/

       Project home page:

	      http://www.ivarch.com/programs/qsf/

BUGS
       If you find any bugs, please contact the author, either by email or  by
       using the contact form on the web site.

SEE ALSO
       procmail(1), procmailrc(5), procmailex(5)

       Someone	has  written a guide to using qsf with KMail that can be found
       at:
       http://www.softwaredesign.co.uk/Information.SpamFilters.html

LICENSE
       This is free software, distributed under the ARTISTIC 2.0 license.

Linux				  August 2007				QSF(1)