File: ChangeLog

package info (click to toggle)
spamprobe 1.4d-15
  • links: PTS
  • area: main
  • in suites: bookworm, sid
  • size: 2,516 kB
  • sloc: cpp: 15,044; sh: 823; ansic: 675; makefile: 274; ruby: 178; lisp: 73
file content (1469 lines) | stat: -rw-r--r-- 56,816 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
2006-11-16  Brian Burton  <brian@burton-computer.com>

	* Released as 1.4d

	* configure.ac: Added ability to selectively disable image
	  processing using --without-gif, --without-jpeg, and/or
	  --without-png.

	* src/spamprobe/spamprobe.cc (set_headers): Added ability to
	  selectively ignore individual headers using -H-headername.

	* src/includes/Ptr,Ref,Array.h: Restored missing <cassert> include.

	* src/parser/PngParser.cc (tokenizeImage): added basic tokens from
	  PNG images.

2006-11-16  Brian Burton  <brian@localhost.localdomain>

	* src/parser/PngParser.cc (PngParser): Stub for PNG parsing using
	  libpng.

	* src/parser/JpegParser.cc (tokenizeMarker): Preliminary
	  implementation of jpeg parsing using jpeglib.

	* configure.ac: Auto detect of either libungig or libgif depending
	  on which one is available.

2007-01-04  Brian Burton  <brian@burton-computer.com>

	* Released as 1.4c

	* spamprobe.1: Modified man page to remove unnecessary informaton
	  and make it more conformant with man page conventions.

	* src/spamprobe/spamprobe.cc (process_extended_options): added
	  ignore-body option.

	* src/parser/HeaderPrefixList.cc (HeaderPrefixList::addHeaderPrefix):
	  Forced header prefixes and names to lower case instead of
	  relying on an assert to enforce the restriction.

	* src/database/FrequencyDBImpl_hash.cc (hash::FrequencyDBImpl_hash): 
	  Disabled experimental hash database auto-cleaning.

	* src/includes/Ref.h: Removed cassert include.

	* src/spamprobe/spamprobe.cc (process_extended_options): Added
	  whitelist option to allow use of SP as a bayesian white list in
	  conjunction with other filters.

2006-02-17  Brian Burton  <brian@burton-computer.com>

	* Released as 1.4b

	* src/parser/TraditionalMailMessageParser.cc (parseBody): Fixed
	  bug reported by Jphn Chandler that prevented tokens from being
	  extracted from headers of messages with no body.

	* src/input/SimpleMultiLineStringCharReader.cc (class
	  SimpleMultiLineStringCharReaderPosition): Fixed crash bug
	  reported by David Rosen when a missing base64 encoded body was
	  parsed.

2006-01-30  Brian Burton  <brian@burton-computer.com>

	* Released as 1.4a

2006-01-28  Brian Burton  <brian@burton-computer.com>

	* src/includes/LRUCache.h (LRUCache): Reimplemented using STL list
	  class.  Map uses Node ptr as key to avoid having two copies of
	  the key in memory.  Changed iterators to use STL style syntax.

	* src/parser/PhrasingTokenizer.cc (compactChars): Improved
	  efficiency.

2006-01-25  Brian Burton  <brian@burton-computer.com>

	* src/parser/MimeDecoder.cc (next_char64): Fixed potential array
	  bounds overflow bug in base64 decoding.  Thanks to Chris Ross
	  for the bug report.

        * src/spamprobe.cc: (and various other files) Added
	  min_phrase_chars option that causes parser to keep adding tokens
	  for phrases until they are at least min_phrase_chars long
	  instead of stopping at phrase word limit.  This might be useful
	  for catching "v i a g r a" as a single term.  So far though
	  experiments with using this option are not very promising.
	
	* src/spamprobe/Command_auto_train.cc (execute): Added LOG
	  sub-command to log each processed message to stdout along with
	  whether or not it had been scored successfully prior to
	  training.  This is useful for experiments to determine how fast
	  SP can learn.

	* src/spamprobe/Command_receive.cc (createScoreCommand): Fixed
	  documentation bug and improved online help for score command.
	  Thanks to Chris Ross for the bug report.

2006-01-03  Brian Burton  <brian@burton-computer.com>

	* src/parser/MailMessage.cc (MailMessage): Removed redundant
	  bounds check.

	* src/utility/MultiLineString.cc (line): Added bounds check.

	* src/parser/HtmlTokenizer.cc (tokenize): Added TempPtr to ensure
	  reader and receiver are cleared between runs.

	* src/utility/MultiLineSubString.cc (m_target): Fixed boundary
	  conditions if passed in indexes are outside bounds of target.

2005-12-28  Brian Burton  <brian@burton-computer.com>

        * Released as 1.4.

	* src/includes/LRUPtrCache.h (LRUPtrCache): Fixed clear() bug that
	  caused -P command line option to crash on some architectures due
	  to an invalid delete.

	* src/database/FrequencyDBImpl_bdb.cc (openDatabase): Fixed
	  relative path database opening bug in CDB mode.  (Thanks to
	  Nicolas Duboc for report and suggested fix).

	* src/spamprobe/AbstractCommand.cc (openDatabase): All non-read
	  only commands will now create the database directory if it's
	  missing when they open the database for the first time and
	  the -c option was specified on the command line.

	* src/database/HashDataFile.cc (close): Replaced clear() with
	  erase().

	* src/spamprobe/Command_help.cc (printCommandLineOptions): Added
	  comprehensive command line option help.

	* src/input/IstreamCharReader.cc (forward): Now uses rdbuf
	  directly when reading characters and seeking to eliminate extra
	  overhead of istream class.

	* src/spamprobe/spamprobe.cc (main): -V option now returns exit
	  code 0 instead of 1.

2005-12-25  Brian Burton  <brian@burton-computer.com>

	* src/database/FrequencyDBImpl_cache.cc (writeWord): Modified
	  database cache to keep modified/unmodified terms within cache
	  size limit.
	  (flush): number of records written to disk per transaction
	  limited to avoid problem with PBL using excessive amounts of
	  memory to write a large cache to disk if the database is large.

2005-12-23  Brian Burton  <brian@burton-computer.com>

	* src/database/DatabaseConfig.cc (createDatabaseImpl): Restored
	  use of database cache.

2005-12-21  Brian Burton  <brian@burton-computer.com>

	* src/includes/Ref.h (T>): Eliminated RefObject base class.

2005-12-18  Brian Burton  <brian@burton-computer.com>

	* src/utility/util.cc (to_7bits): Removed use of string::+=

	* src/input/LineReader.cc (forward): Removed use of string::+=

	* src/utility/MultiLineString.cc (parseText): Fixed bug that
	  caused each succeeding line in decoded text to contain all prior
	  lines as well.  Thanks to Nico for catching that one.
	  (parseText): More efficient algorithm for breaking string into
	  multiple lines without use of string::+=

	* src/input/SimpleMultiLineStringCharReader.cc (class
	  SimpleMultiLineStringCharReaderPosition): Refactored code into
	  position object.
	  (ensureCharReady): Cached calls to m_target->line() based on
	  profiler results showing they were consuming too much time
	  during parsing.

2005-12-17  Brian Burton  <brian@burton-computer.com>

	* src/parser/HtmlTokenizer.cc (processTagBody): Restored support
	  for -o suspicious-tags option.

	* src/parser/HtmlTokenizer.cc (processTagBody): Added tag specific
	  prefixes when parsing HTML tags.  Left in the old prefix (U_)
	  even though it collides with url terms for backward
	  compatibility with people who used -h option.  The compatibility
	  code should be removed after a few months.

2005-12-16  Brian Burton  <brian@burton-computer.com>

	* src/parser/MaildirMailMessageReader.cc (readMessage): Fixed
	  skipping of hidden files and sorting of files in cur and new.

2005-12-15  Brian Burton  <brian@burton-computer.com>

	* Released as 1.3x3.

	* Added support for maildir directories to all file based commands.

2005-12-13  Brian Burton  <brian@burton-computer.com>

	* src/spamprobe/AbstractMessageCommand.cc (processStream): Improved
	  auto-purge support to work for both token and mime streams and to
	  perform a final purge after processing all messages.

	* src/spamprobe/Command_auto_train.cc (execute): Added support for
	  auto-purge (-P command line option).

2005-12-11  Brian Burton  <brian@burton-computer.com>

	* src/spamprobe/Command_create_config.cc (execute): Added
	  create-config command to write a new config file based on the
	  current configuration.

	* src/spamprobe/Command_create_db.cc (execute): Added create-db
	  command to auto-create a database if none is present.

	* src/spamprobe/spamprobe.cc: Moved code from spamprobe.cc into
	  separate strategy objects for each supported commands.

	* Added help command to print a list of all available commands
	  and (optionally) also provide a verbose description of any
	  named command.

	* Config file is not automatically generated if missing since that
	  caused some confusion for users who don't use config files.

2005-12-09  Brian Burton  <brian@burton-computer.com>

	* src/includes/Buffer.h (class Buffer): Added assertions and sanity
	  checks.  Made reset() exception safe.

	* src/spamprobe/SpamFilter.cc (getSortedTokens): Removed use of
	  qsort().  Now sorting with std::sort().

	* Removed unnecessary uses of NewArray<T>.  Now its only used by
	  Buffer<T>.

	* Removed old RCPtr<T> in favor of new Ref<T>.  This affected lots
	  of classes in all modules.

2005-12-02  Brian Burton  <brian@burton-computer.com>

	* Changes below were actually made over the last few weeks but I'm
	  catching up on previous changes that I hadn't added to
	  ChangeLog.

	* Fixed include <ostream> that didn't work with older gcc versions.

	* Added preliminary gif parser support using libungif.  configure
	  attempts to auto-detect libungif id present and uses it to
	  extract terms from information about any gifs in the message.  I
	  used gifs first since those seem to be the most common format
	  used in spams.

	* Added -f command line option.  -d option reloads config file.

	* Moved spamprobe app code into its own directory. Added copyright
	  notices to hdl source.

	* Restored deleted lock file code.

	* Added DatabaseConfig.

	* Added FilterConfig

	* Now generates config file if none present.

	* Spamprobe has a config file!

	* Added HDL code with validation.

	* Refactored source code into multiple directories and
	  non-installed libraries for better code structure and
	  organization.

	* Removed broken (never worked right) BNR code.

	* Removed obsolete data conversion utility left over from version
	  0.6 upgrade.

2005-06-23  Brian Burton  <brian@burton-computer.com>

	* SimpleTokenizer.cc (isLetterChar): Fixed broken -8 command line
	  option that was causing 8 bit characters to be treated as word
	  boundaries.

2005-06-22  Brian Burton  <brian@burton-computer.com>

	* Released version 1.2.

2005-03-29  Brian Burton  <brian@burton-computer.com>

	* spamprobe.cc (cleanup_database): Added ability to specify
	  multiple counts and ages for the cleanup command.  This allows
	  more efficient use of multiple criteria for cleaning the
	  database.

2005-03-28  Brian Burton  <brian@burton-computer.com>

	* FrequencyDBImpl_hash.h (class FrequencyDBImpl_hash): Changed
	  default hash file size to 32 megs.

	* FrequencyDB.cc: SP now defaults to using hash data file format
	  if neither PBL nor Berkeley DB are available.
	  (createDB): Now auto-detects database type based on files
	  in database directory if possible.

	* FrequencyDBImpl_null.h (class FrequencyDBImpl_null): Added
	  "null" database instance to avoid null pointer issues when
	  command line arguments are invalid.

	* spamprobe.cc (quick_close): removed code for closing the
	  database since it created a race condition that could corrupt
	  memory and crash out in ::delete.
	  (main): changed usage/version printing again to avoid crashes
	  when invalid command line used with -V option.

2005-03-26  Brian Burton  <brian@burton-computer.com>

	* FrequencyDBImpl_hash.cc (initializeHeaderRecords): Added a
	  header record to hash data files to identify file format and
	  version.

2005-03-24  Brian Burton  <brian@burton-computer.com>

	* spamprobe.cc: Applied usage message/version reporting fix
	  supplied by Chris Ross.

	* FrequencyDBImpl_split.cc (open): Removed addition of .hash
	  suffix to hash file name.  The suffix is now added automatically
	  by the FrequencyDBImpl_hash class.

	* FrequencyDBImpl_hash.cc: Lots of improvements such as hash
	  collision detection and mitigation (tries next array element).
	  Factored out hash file code into a new class (HashDataFile).
	  Added code to rehash the file whenever cleanup is run.  Hash
	  data file size is now selectable in 1 MB increments instead of
	  the old use of powers of two.  Actual number of elements in hash
	  table is now based on a prime number that yields as close to the
	  target file size as possible.

	* FrequencyDB.cc: Changed hash: db prefix to use a pure hash file
	  instead of a split file.  Added split: prefix for when that is
	  more desirable.

2004-11-19  Brian Burton  <brian@burton-computer.com>

	* spamprobe.cc: Added verbose mode as a less overwhelming
	  alternative to existing debug mode.  Using -v once triggers
	  verbose mode.  Twice triggers debug mode.
	  (auto_train): Added auto-train command to improve initial
	  training for new users.

2004-11-13  Brian Burton  <brian@burton-computer.com>

	* FrequencyDB.cc (class InterruptTest): Attempted to make shutdown
	  due to user interrupts cleaner by using a guard object in each
	  method that calls the database.  If an interrupt is requested
	  during a database operation it will be noted and an exception
	  thrown after the call completes. Multiple interrupts will fall
	  back to the default signal handler and shut down the process
	  more forcefully.

2004-11-12  Brian Burton  <brian@burton-computer.com>

	* README.txt: Fixed -s command line option doc.

	* spamprobe.1: Fixed -s command line option doc.

	* spamprobe.cc (process_mime_stream): Added support for -Y option
	  to suppress content-length support in mailboxes.

	* MailMessageReader.cc (readMessage): Added support for MIME's
	  Content-Length header as a way of bypassing embedded From_ lines
	  inside of a message.  Only supported in outermost headers since
	  attachment bodies are already delimited using MIME boundaries.

	* MultiLineString.cc (appendToLastLine): Added ability to append
	  to last line in string.

	* IstreamCharReader.cc (createMark): Added ability to mark
	  and return when underlying stream is seekable.

	* spamprobe.cc (process_mime_stream): Added support for MBX file
	  format.  Added support for ignoring From_ line in mbox files.

2004-11-11  Brian Burton  <brian@burton-computer.com>

	* Replaced uses of string::clear() with string::erase().

2004-11-07  Brian Burton  <brian@burton-computer.com>

	* MimeDecoder.cc (decodeHeaderString): Fixed memory bug.

	* Proximity phraser is history.  It never performed well in
	  experiments anyway.

	* All header terms are now prefixed instead of having some that
	  did not receive a prefix.

	* Fully integrated new parser and removed code for old parser.
	  All headers are now run through the MIME decoder since the RFC
	  says the encoding can apply to more than just Subject.

2004-11-01  Brian Burton  <brian@burton-computer.com>

	* FrequencyDBImpl_cache.cc; Cache size is now limited to a maximum
	  number of terms and is automatically flushed when the size is
	  exceeded.  Uses LRUPtrCache instead of just a map so that the
	  most recently used terms can be kept in memory instead of being
	  periodically flushed.

2004-10-31  Brian Burton  <brian@burton-computer.com>

	* Added new email parsing implementation based on the experimental
	  C# implementation.  This parser does less byte twiddling and
	  parses most emails in a single pass over the message.  Many of
	  the old parsing related command line options are not yet enabled
	  but the standard processing of mbox files and scoring with basic
	  paramaters is working well.

2004-10-14  Brian Burton  <brian@burton-computer.com>

	* spamprobe.cc (main): Added exec and exec-shared commands.
	  (import_words): modified import command to allow negative values
	  to be specified in the import file.

	* Applied patches for configure.in and aclocal.m4 contributed by
	  Siggy Brentrup for debian compatibility.

2004-04-24  Brian Burton  <brian@burton-computer.com>

	* FrequencyDBImpl_pbl.cc: Invokes new WordData methods to allow
	  storing data in big endian format.

	* WordData.h: Added optional support for storing counts/flags
	  in big endian order for data portability.

2004-02-05  Brian Burton  <brian@burton-computer.com>

	* MimeLineReader.cc (readMBXFileHeader): UW IMAP MBX file format
	  is now auto detected from the first line of the mailbox file.

	* spamprobe.cc (process_extended_options): Removed -o imap-mbx
	  option.

2004-02-04  Brian Burton  <brian@burton-computer.com>

	* spamprobe.cc (process_extended_options): Added -o imap-mbx
	  option to process files as WU-IMAP MBX files rather than mbox
	  files.

	* MimeLineReader.cc (readLine): Added support for WU-IMAP MBX file
	  format.

2004-02-02  Brian Burton  <brian@burton-computer.com>

	* Released as 0.9h.

2004-01-26  Brian Burton  <brian@burton-computer.com>

	* spamprobe.cc (process_stream): Added -o tokenized option
	  to allow people to use an external tokenizer with spamprobe.

2004-01-22  Brian Burton  <brian@burton-computer.com>

	* SpamFilter.cc (scoreToken): Reduced sorting overhead by
	  pre-computing and integer sort value with sorting priorities
	  reflected in the value.  This eliminates several calculations
	  inside of the sort routine.

2004-01-21  Brian Burton  <brian@burton-computer.com>

	* SpamFilter.cc (computeRatio): Capped ratios in calculations to
	  within MIN_PROB and MAX_PROB.  Widened that range.  This avoids
	  problems with div/0 and makes it easier to sort terms.

2004-01-20  Brian Burton  <brian@burton-computer.com>

	* spamprobe.cc (dump_words): dump command can now optionally
	  accept a regular expression as an argument and will only dump
	  terms matching the regular expression.
	  (purge_terms): Added purge-terms command to purge from the
	  database all terms matching a regular expression.

2004-01-17  Brian Burton  <brian@burton-computer.com>

	* Released as 0.9g2.

	* spamprobe.cc (main): Fixed bug in command line processing.
	  Thanks to Jem for bug report.

	* Released as 0.9g.

2004-01-16  Brian Burton  <brian@burton-computer.com>

	* spamprobe.cc (train_on_message): Code simplified.  Eliminated
	  redundant recalculation of scores.
	  (train_on_message): Timestamps are now longer updated by
	  train-spam and train-good commands.  They are still updated by
	  train command.
	  (main): Fixed assertion if -P option is specified in a read only
	  operation.

2004-01-14  Brian Burton  <brian@burton-computer.com>

	* spamprobe.cc (main): Added -C command line option to allow users
	  to specify their own min word count.

	* SpamFilter.cc (SpamFilter): Set default minimum word count back
	  to 5 (was 3).

	* spamprobe.cc (process_extended_options): Removed "alt-score"
	  from -o options list because it distributes scores poorly.  New
	  formula achieves the same end with better accuracy.  Added
	  "orig-score" option to allow people to continue using the old
	  formula.  Added "honor-xstatus-header" option for people whose
	  mail server uses X-Status: rather than Status: for the deleted
	  flag.
	  (main): Added -l command line option to allow people to set
	  their own spam threshold if they don't like the default value.

	* SpamFilter.cc (scoreMessage): Added a new scoring formula based
	  on Paul's but taking the nth root of spam and good probabilities
	  to produce more evenly distributed scores.  Lowered the spam
	  threshold to 0.6 to keep accuracy about the same as the original
	  formula.  Highest score seen for a ham so far in tests is 0.44
	  so 0.6 seems safe.  Made the new formula the default instead of
	  Paul's.

2004-01-12  Brian Burton  <brian@burton-computer.com>

	* Released as 0.9f
	
	* spamprobe.cc (set_headers): Added -H+name command line option to
	  allow users to specifically add inidividual headers to the list
	  of headers to process.
	  (process_extended_options): Added -o option with graham and
	  honor-status-header options.

2004-01-09  Brian Burton  <brian@burton-computer.com>

	* spamprobe.cc (edit_term): Removed validity check from edit term
	  command since it made it impossible to edit terms from headers.
	  (dump_message_words): Added "tokenize" command to allow a user
	  to see all of the terms in a message and their scores.

	* What follows is a collection of changes not added here as they
	  were made:

	* util.h (num_to_string3): Added function to produce a three digit
	  zero padded number.

	* spamprobe.cc (train_on_message): Added option to have train mode
	  try to keep the spam/good counts balanced to minimize skewing
	  results towards whichever type we've seen the most.

	* SpamFilter.cc (SpamFilter): Improved "extended top terms array"
	  logic to make the minimum distance from mean for the array
	  settable by caller of SpamFilter.  Added ability to set a
	  minimum size for the top terms array.

	* RegularExpression.cc (removeMatch): Added method for removing a
	  matched substring from the text.
	  (replaceMatch): Added method for replacing a matched substring
	  in the text.

	* PhraseBuilder.h (class PhraseBuilder): Added ability to limit
	  the maximum length of a phrase so that the filter can use more
	  words per phrase without filling the database (i.e. min 2 and
	  max 8 words per phrase but limit phrases to max of 20
	  characters).

	* MessageFactory.cc (addIPAddressTerm): Added a new logical term
	  for IP addresses found in a message.
	  (isSuspiciousTag): Added support for processing just
	  "suspicious" HTML tags (suggested by Paul Graham).
	  (processUrls): Added special prefix for terms found in URLs.
	  (addHeadersToMessage): Added support for processing arbitrary
	  headers.

	* Message.cc (getAllTokensCount): Added AllTokensCount property
	  (total within document count of all terms).

	* FrequencyDB.h (class FrequencyDB): Added MessageCount property.

2003-09-10  Brian Burton  <brian@burton-computer.com>

        * Released as 0.9e.
	
	* spamprobe.cc (print_terms): Changed -T output to include overall
	  good/spam database counts of each term.

	* SpamFilter.cc (token_qsort_criterion): Modified token sorting
	  algorithm to improve selection of top terms for scoring.
	  Changes appear to reduce the chances of false positives.  The
	  new criteria are: higher distance from mean to 5 decimal places,
	  higher within document frequency div 3 (to make less selective),
	  less spammy score, higher count in database, and (final tie
	  break) alphabetical.  The wdf div helps to make a small
	  difference in wdf to be less significant.

	* MessageFactory.h (class MessageFactory): Added
	  useProximityPhraser().

	* ProximityPhraseBuilder.h: Added "proximity" phrase builder that
	  stores distances between words instead of phrases themselves.
	  Not nearly as effective as phrases so far.

	* AbstractPhraseBuilder.h: Added abstract super class for
	  PhraseBuilder to allow plugging in different kinds of phrasers.

2003-09-04  Brian Burton  <brian@burton-computer.com>

	* FrequencyDBImpl_pbl.cc (sweepOutOldTerms): Changed to commit
	  based on number of records deleted instead of number of records
	  scanned.
	  (getWord): Changed to handle retrieval of current record
	  properly.

2003-09-03  Brian Burton  <brian@burton-computer.com>

	* FrequencyDBImpl_pbl.h (class FrequencyDBImpl_pbl): Peter Graf
	  contributed a patch to switch over to using PBL's key files
	  instead of ISAM.  This change cuts disk space usage by a factor
	  of 2 and seems to provide a comparable speed improvement as
	  well.

2003-09-01  Brian Burton  <brian@burton-computer.com>

	* FrequencyDBImpl_pbl.cc (beginTransaction): Fixed some broken
	  assertions.

	* spamprobe.cc (train_test): Added train-test message to
	  facilitate testing train mode.  Reads a line at a time from
	  stdin.  Each line contains a message type (spam/good) and a file
	  name.  SP then reads the file and does a train-spam or
	  train-good on the message.  Great for quickly building a
	  database from a lot of known emails using train mode.

	* Released as 0.9d.

	* Fixed configure to remove default -Wno-deprecated.

2003-08-30  Brian Burton  <brian@burton-computer.com>

	* LockFD.h (class LockFD): Changed SHARED to SHARED_LOCK to fix
	  compile problems on solaris 2.6.  Thanks to Cornell Binder for
	  bug report.

        * Released as 0.9c.

	* README.txt: Updated for release 0.9c.

2003-08-29  Brian Burton  <brian@burton-computer.com>

	* FrequencyDBImpl_split.cc (open): Modified to be compatible with
	  PBL in place of BDB for btree portion of database.

	* FrequencyDBImpl_cache.cc (flush): Performs all writes to it's
	  impl db using a transaction for safety.  Note that the cache
	  itself does not support transactions but only utilizes it's
	  impl's support for them (bug?).

	* FrequencyDB.h (class FrequencyDB): Added beginTransaction() and
	  endTransaction() methods for impls that support transaction
	  semantics (currently only PBL). Also added createDB() static
	  method to allow other classes to create impl dbs without knowing
	  what type they are creating.

	* FrequencyDBImpl.h (class FrequencyDBImpl): Added
	  beginTransaction() and endTransaction() empty default
	  implementations.

	* FrequencyDBImpl_pbl.h (class FrequencyDBImpl_pbl): Added support
	  for Peter Graf's PBL (The Program Base Library) ISAM database as
	  an optional replacement for Berkeley DB.  PBL offers transaction
	  semantics without all of the complicated background processing
	  of BDB but none of the locking.  Since SP does its own locking
	  that should be fine.  PBL files appear to be larger than BDB
	  files by a significant margin.  PBL can be downloaded here:
	  http://mission.base.com/peter/source/

	* FrequencyDBImpl_bdb.cc (writeWord): Any word with zero counts
	  can now be deleted on write.  Previously __* terms were kept but
	  that's not really necessary and this will clear out redundant
	  empty digests.

	* spamprobe.cc (quick_close): Fixed potential infinite loop when
	  processing signals.

	* FrequencyDBImpl_bdb.cc: Improved error checking and reporting.
	  Made use of environment a compile time option controlled by
	  --enable-cdb passed to configure at build time.
	  (writeWord): Removed load/compare of existing record to speed
	  up writes to database except when in debug mode.
	  (flush): Added call to db->sync() during flush().

2003-08-23  Brian Burton  <brian@burton-computer.com>

	* spamprobe.cc (process_test_cases): Added some more test cases.
	  Changed AUTO_PURGE_JUNK_COUNT to 2 instead of 4.

	* SpamFilter.cc (token_qsort_criterion): When selecting top terms
	  now assigns terms to "bands" of roughly 0.005% rather than
	  sorting on raw probability.  This helps to prevent almost
	  equally significant good terms from being overshadowed and
	  excluded by only slightly more significant spam terms and should
	  reduce number of false positives.

	* PhraseBuilder.h (class PhraseBuilder): Dynamically resizes
	  buffer now rather than using a fixed size buffer.  Supports min
	  as well as max number of words in phrases.

	* MimeMessageReader.cc (unquoteText): Now converts _ to space
	  in quoted headers (thanks Junior for bug report!).

	* MimeHeader.cc (getFieldName): Added accessor for field names.

	* MessageFactory.cc (setMinPhraseLength): Phrases can now have a
	  minimum length as well as a maximum length.
	  (addHeadersToMessage): Improved header processing uses prefixes
	  for all headers, not just a subset of them.  Better recognition
	  of ignored headers.
	  (getHeaderPrefix): Creates a prefix for any header with escaping
	  of non alphanumeric characters.
	  (addHeaderToMessage): terms from headers are only stored with
	  prefixes now instead of both prefixed and unprefixed.

2003-08-13  Brian Burton  <brian@burton-computer.com>

	* MimeMessageReader.cc (unquoteText): Added RFC 1522 support for _
	  as space in headers.  Thanks jxz.

2003-08-07  Brian Burton  <brian@burton-computer.com>

	* Released as 0.9b.

	* MessageFactory.cc (addHeadersToMessage): Modifed header
	  processing to decode RFC2047 encoded headers.  Thanks to Junior
	  for the suggestion!

	* MimeMessageReader.cc (decodeHeader): Added method for decoding
	  mime encoded headers.

	* FrequencyDBImpl_bdb.cc (open): If berkeley db environment files
	  cannot be opened but the database is running in read only mode
	  we carry on without any environment.  This allows shared
	  database directories to be kept purely read only for users.

	* SpamFilter.cc (lock): locking now removes colon prefixes from
	  database filenames when creating filename for lock file.  This
	  is done by nuking up to the last : so it will break windows
	  paths that include a drive letter.

2003-08-02  Brian Burton  <brian@burton-computer.com>

	* Released as 0.9a.

2003-08-01  Brian Burton  <brian@burton-computer.com>

	* Modified FrequencyDBImpl to accept file mode as an argument and
	  use that mode when creating database related files.  This allows
	  shared and private dbs to have different modes.

2003-07-29  Brian Burton  <brian@burton-computer.com>

	* Added rebuilddb to contrib directory.  This script from David
	  A. Lee automatically rebuilds your .spamprobe directory to
	  reclaim any space left unused by berkeley db.

2003-07-28  Brian Burton  <brian@burton-computer.com>

	* FrequencyDBImpl_hash.cc (open): Removed obsolete reference to
	  MAP_FILE because it broke compilation on solaris 9 systems.

2003-07-27  Brian Burton  <brian@burton-computer.com>

	* Released as 0.9-dev-6.

2003-07-26  Brian Burton  <brian@burton-computer.com>

	* SpamFilter.cc (lock): Global lock file only used for commands
	  that write to the database.  Using berkeley environment allows
	  reads to coexist safely with writes.

2003-07-25  Brian Burton  <brian@burton-computer.com>

	* FrequencyDB.cc (addWord): addWord() preserves flags if word
	  already in database or sets them to specified value if the word
	  is new.

	* FrequencyDBImpl_cache.cc (close): no longer flushes
	  automatically.  This allows SpamFilter to be closed
	  quickly if necessary.  Caller must now specifically
	  flush() before closing.

	* SpamFilter.cc (close): SpamFilter now can be closed in flush
	  mode or "abandon writes mode" so that cleanup code can avoid
	  writes if the user interrupted the program with ^C or kill.

	* spamprobe.cc (close_on_exit): Added code to close database on
	  exit to ensure that berkeley db gets a chance to remove its
	  locks.  Without this using ^C on one process could cause the
	  next SP process to hang when it tried to write because the
	  killed procs locks were still in the envronment (db_recover
	  could be used to clear them but that's a pain).
	  (import_words): import/export now include flags as well as
	  counts for each word so that timestamps can be preserved.
	  (train_on_message): Increased min message count for training
	  from 500 to 1500 to help ensure sufficient number of messages
	  for people using train from the beginning.

	* SpamFilter.cc (lock): Added locking code to SpamFilter to ensure
	  locks are performed uniformly no matter what database is used.
	  Databases can still perform their own locking if needed.  This
	  solved the weakness of berkeley db's concurrent data store
	  locking when performing read-update-write of terms (lack of
	  write locks while record being updated could cause counts to be
	  incorrect even though database was not corrupted).

	* spamprobe.cc (main): Added -R option to return 0 if message was
	  spam and 1 otherwise.  Based on patch from jxz@uol.com.br.

	* FrequencyDBImpl_dbm.h: Removed locking code.  Locks now
	  at spamprobe.cc level.

2003-07-24  Brian Burton  <brian@burton-computer.com>

	* FrequencyDBImpl_bdb.cc (open): removed lock file code and
	  replaced it with use of a berkeley db environment and the
	  berkeley db concurrent data store to provide more concurrency
	  and better compatibility with other berkeley db routines.

	* RegularExpression.cc (class RegularExpressionImpl): Fixed (yet
	  another) regexec() crash bug.  Have to convert 8 bit chars to 7
	  bit before calling regexec() or it might crash on certain
	  sequences of 8 bit characters.
	  
	* Released as 0.9-dev-5 unstable package.

	* spamprobe.cc (main): Temporarily disabled shared (read only)
	  locks in commands that used them as experiment to see if it
	  eliminates database corruption in berkeley db databases.

	* MimeLineReader.cc: Using safe_char() to auto convert non-space
	  control chars to spaces.

2003-06-29  Brian Burton  <brian@burton-computer.com>

	* Added spamprobe-howto.html to contrib directory.  Thanks to
   	  Herman Oosthuysen.

	* MessageFactory.cc (assignDigestToMessage): Added getMD5Digest()
	  call to top of function to fix an assertion thrown when messages
	  had digests in their headers.

	* spamprobe.cc (main): Added setlocale() call (thanks to Junior
	  (don't know his name) for the suggestion) to fix tolower()
	  problems with accented characters in eight bit mode.

	* FrequencyDBImpl_bdb.cc (writeWord): optimized writes to berkeley
	  db databases.  Deletes records when their counts were going to
	  be written as zero to make purge 0 unnecessary.
	  (sweepOutOldTerms): Added code to remove MD5 records if they
	  have a count of zero.  Removed code that wrote every record back
	  to the database (left over from mark and sweep days) for better
	  performance.

2003-05-20  Brian Burton  <brian@galileo.burton-computer.com>

	* spamprobe.cc (main): Cleaned up version printing (-V).
	  (main): summarize, find-good, and find-spam now print filename if
	  processing a file instead of stdin.

	* FrequencyDBImpl_bdb.cc (open): Switched back to using a separate
	  lock file for berkeley db databases to avoid a possible
	  race condition.

2003-03-14  Brian Burton  <brian@galileo.burton-computer.com>

	* FrequencyDBImpl_cache.cc: Added feedback about whether a term is
	  from shared db to CacheEntry so that migration can be avoided if
	  counts don't change.  This prevents terms from moving into the
	  private database if their time stamp changed but their counts
	  did not as might happen when running in training mode.

	* FrequencyDBImpl_dual.cc (readWord): Added readWord()
	  implementation that gives a hint about whether or not the counts
	  came from the shared database.

2003-03-11  Brian Burton  <brian@burton-computer.com>

	* FrequencyDBImpl_hash.cc (readWord): Fixed return value to allow
	  proper operation with shared database.

2003-03-09  Brian Burton  <brian@burton-computer.com>

	* Message.cc (addToken): Removed duplication when adding prefixed terms.
	  Previously term was added both with and without prefix.  Now only
	  prefixed form is added.

	* spamprobe.cc (main): Fixed -V option.
 	  (train_on_message): Added current score test when training.
	  train-spam and train-good now use existing digest if any.

2003-03-08  Brian Burton  <brian@burton-computer.com>

	* FrequencyDBImpl_hash.cc (setSize): Changed from using mod prime to
	  a bit mask for computing array indexes.  Hash size can be specified
	  as number of bits in the range 33-63.  Half that number of bits will
	  be used as a mask.  The doubling allows file size to increase by 
	  smaller increments than doublings.  The most reasonable hash values
	  will be in the range 38 (4 MB) - 44 (32 MB).  File sizes in this
	  range are roughly:

	        size    megabytes   terms
	         38        4	     512k
	         39        6	     768k
	         40        8	    1024k
	         41       12	    1536k   (default)
	         42       16	    2048k
	         43       24	    3072k
	         44       32        4096k

	* spamprobe.cc (process_stream): Train mode now updates timestamps
	  of terms in messages that don't need to be classified.  This
	  prevents terms that are actually being used to score messages
	  from expiring.

	* FrequencyDB.cc (touchMessage): Added new method to update timestamp
	  of terms in a message so that train commands can keep terms from
	  expiring.

2003-03-03  Brian Burton  <brian@burton-computer.com>

	* spamprobe.cc (process_stream): Added train, train-good, and
	  train-spam commands for building database with a minimum number
	  of emails for better performance and less disk usage.

2003-02-28  Brian Burton  <brian@burton-computer.com>

	* MimeMessageReader.cc (readText): Fixed bug that caused rfc822
	  attachments to be treated as text rather than parsed into their
	  own mime parts.  As a result base64 encoded attachments in
	  embedded messages wound up being tokenized rather than decoded
	  or ignored based on their mimetype.

	* spamprobe.cc (main): Added support for combining multiple test
	  cases on the command line.  Added counts command to print out
	  total message counts.

	* SpamFilter.cc: Cleaned up code for computing score to share more
	  code.  Fixed bug that ignored terms with wdf of 1.  Added
	  support for wdf to alt1 scoring method.  CHanged spam threshold of
	  alt1 method.

2003-02-26  Brian Burton  <brian@burton-computer.com>

	* MessageFactory.cc (removeHTMLFromText): Added test for each tag
	  to determine if it should add a space in its place.  Previously
	  text like: j<br>u<br>n would be treated as "jun" instead of "j u
	  n".  Prevents words from being combined if only space tags
	  separated them.

	* configure.in: Added test for mmap.

	* MessageFactory.cc (addWordToMessage): Fixed bug that did not
	  prefix word parts in prefixed headers.

	* spamprobe.cc (main): Added Received header to list of headers
	  stored with a prefix.  First Received header and subsequent ones
	  stored with different prefix to maybe detect falsified received
	  headers.  First Received header should generally be more
	  trustworthy since it comes from your own mail server rather than
	  from the sending mail server or relay.
	  (main): Added From header to list of headers stored with prefix
	  since some spammers seem to use the same from line repeatedly - I
	  guess they think we trust their "brand".

	* MessageFactory.cc (MessageFactory): Changed minimum word length
	  to 1.

	* README.txt: Updated readme for -P option.

	* spamprobe.1 (Content-Length): updated manpage for -P option.

	* spamprobe.cc (main): Added -P command line option to automatically
	  purge terms with total count <= 4 after specified number of messages.

2003-02-07  Brian Burton  <brian@burton-computer.com>

	* Added new FrequencyDBImpl_hash class and made assorted changes
	  to FrequencyDB and other impl classes to support it.  The hash
	  impl uses a fixed size array and Bob Jenkin's hash function to
	  provide an efficient though somewhat inaccurate database.  Based
	  on the database structure in CRM114's mailfilter program.  The
	  impl supports all the semantics of the other impls including
	  cleanup, dump, import, export, etc.

2003-02-06  Brian Burton  <brian@burton-computer.com>

	* FrequencyDBImpl_bdb.cc (open): Modified berkeley db
	  implementation to lock the actual database file instead of
	  creating a separate lock file.  This should work much more
	  smoothly with shared databases than the lock file did.  Chose
	  not to use BDB's own locking environment because it seemed hard
	  to get right and prone to lock ups.

	* LockFD.h (class LockFD): Added LockFD class to handle locking
	  an arbitrary file descriptor.

	* LockFile.cc (lock): Changed LockFile to use a LockFD object
	  instead of calling fcntl() directly.

2003-01-30  Brian Burton  <brian@burton-computer.com>

	* Updated version of spamprobe.el from Dave Pearson's web site.

	* Added README-mta-mda-mua.txt graciously contributed by Anto Veldre.

	* MessageFactory.cc (removeHTMLFromText): replaces all whitespace
	  with space characters to avoid wierd crash in regex routines on
	  RedHat 8 systems.

2003-01-28  Brian Burton  <brian@burton-computer.com>

	* spamprobe.cc (main): Added -M option to force a single message
	  per file (ignores content-length and From).

	* Fixed manpage.

	* Changed lock file mode to 0666 instead of 0600 so that shared
	  locks will work better.  TODO: need to eliminate the need for
	  the lock file altogether.

2002-12-29  Brian Burton  <brian@burton-computer.com>

	* MessageFactory.cc (expandCharsInURL): Added decoding of %xx encoded
	  characters in URLs.  Using this to prevent spammers from slipping
	  URLs through unchallenged by encoding them completely as hex.

2002-12-26  Brian Burton  <brian@burton-computer.com>

	* spamprobe.cc (main): SpamProbe now stores words and phrases in
	  the to, cc, and subject headers both normally and with a special
	  prefix to improve accuracy since some words are spammier in the
	  subject than in the message body.
	  (main): Added -p option to limit number of words per phrase.
	* Released version 0.8

2002-11-12  Brian Burton  <brian@burton-computer.com>

	* spamprobe.cc (classify_message): spam and good commands now
	  count the words from messages multiple times if necessary to
	  ensure that they are recalled correctly.  receive command does
	  not do this since it's decisions are not as reliable as manual
	  ones.  This is intended to improve overall accuracy by
	  maximizing recall and making it harder for "spams of the future"
	  to slip through the cracks because of their low word counts.

2002-10-28  Brian Burton  <brian@burton-computer.com>

	* spamprobe.cc (import_words): Fixed broken import command.

2002-10-27  Brian Burton  <brian@burton-computer.com>

	* spamprobe.cc (process_stream): Added summarize command to print
	  find-good style output for every message whether good or spam.

2002-10-26  Brian Burton  <brian@burton-computer.com>

	* MimeMessageReader.cc (readNextHeader): Uses inexact content
	  length in case the mbox has incorrect content-length values.

	* spamprobe.cc (process_stream): score and receive now print
	  message digest along with the score.
	  (main): all commands except receive look for digest in
	  X-SpamProbe header

	* MessageFactory.cc (assignDigestToMessage): message digest now
	  taken from header if available.

2002-10-24  Brian Burton  <brian@burton-computer.com>

	* spamprobe.cc (dump_words): flags now show all 8 digits in dump
	(process_stream): receive mode supports -T option

2002-10-22  Brian Burton  <brian@burton-computer.com>

	* WordData.h (class WordData): Modified database to store a
	  16 bit time stamp (days since August 12, 2002) instead of
	  using sweep count for database cleanup.

2002-10-20  Brian Burton  <brian@burton-computer.com>

	* MessageFactory.cc (addTextToMessage): Removed the to_lower() to
	  avoid unecessary string copying.  Made regex's case insensitive
	  so that they are not needed.

	* spamprobe.cc (import_words): Uses regular expression to parse
	  import lines instead of hard coded logic.

	* configure.in: Added test to verify existence of regex.h on
	  target system.

	* MimeHeader.cc (isFromLine): Uses regular expression to detect
	  From lines instead of the hard coded scans.

	* RegularExpression.h (class RegularExpression): Added
	  RegularExpression class as a front-end for POSIX regular
	  expression library.

	* MessageFactory.cc: Uses regular expressions instead of hardcoded
	  logic to detect html tags in pages and find urls inside of tags.

	* README.txt (including): Aded --enable-assert to configure script
	  so that assertions are off by default but can still be enabled
	  for debugging purposes.

2002-10-16  Brian Burton  <brian@burton-computer.com>

	* spamprobe.1: Changed version to just 0.7 so I don't have to keep
	  it up to date constantly.

	* contrib/spamprobe.el: Updated to latest version of spamprobe.el
	  Thanks Dave!

	* spamprobe.cc (main): Added --enable-8bit option to configure
	  script.

2002-10-15  Brian Burton  <brian@burton-computer.com>

	* configure.in (have_database): Moved berkeleydb tests into a
	  common spot.  Added -ldb3 to the list of libraries to check.

	* Switched to autoconf generated Makefile instead of the manual
	  one.  The original makefile is now named Makefile.orig.

	* Moved md5 files out of thirdparty and into the top level
	  directory.

2002-10-14  Brian Burton  <brian@burton-computer.com>

	* countscores.rb (goods): Changed to accomodate change to score
 	  output.

	* spamprobe.cc (process_stream): score command prints in same
	  format as receive command to simplify using it in procmailrc.
	  (find_message): Prints subject of message to make output more
	  human understandable.

	* SpamFilter.cc (normalScoreMessage): Fixed NAN bug if inner and
	  outer both approx. 0.  Returns 0.5 in that case to be safe.

	* spamprobe.cc (main): Added command name validation and support
	  for shared locks for read only commands.  Moved database lock
	  acquisition into the FrequencyDBImpl classes.

2002-10-11  Brian Burton  <brian@burton-computer.com>

	* FrequencyDBImpl_dual.h (class FrequencyDBImpl_dual): Added new
	  database impl class that uses a shared read only database and a
	  private read-write one.  Also added -D option to program to
	  allow user to specify the shared db dir and made numerous
	  changes to other classes to put this new option into effect.

	* spamprobe.cc (main): Modified to process multiple mboxes much
	  faster by opening and closing the database only once instead of
	  once per file.

	* Released SpamProbe-0.7d

	* spamprobe.cc (main): Added purge and edit-term commands.

	* FrequencyDBImpl_bdb.cc (sweepOutJunk): Added purge mode.

2002-10-08  Brian Burton  <brian@burton-computer.com>

	* FrequencyDB.h (class FrequencyDB, class FrequencyDBImpl*): added
	  sweepOutJunk method for use by cleanup function.

	* spamprobe.cc (cleanup_database): Added cleanup command to do a
	  mark and sweep database cleanup.

	* WordData.h (class WordData): Promoted WordData to its own
	  class so that the cleanup function could be implemented.

2002-10-06  Brian Burton  <brian@burton-computer.com>

	* MimeMessageReader.cc (getMD5Digest): Replaced sprintf call with
	  hex_digit() util.cc function call.

	* spamprobe.cc (import_words): import and export now use the
	  encode_string and decode_string functions from util.cc to
	  properly handle non-printable characters.
	  (main): Added -X option to rely almost exclusively on terms with
	  distance from mean >= 0.4 and allow word repeats of 5
	  (equivalent to -w 5 -r 5 -x)

2002-10-02  Brian Burton  <brian@burton-computer.com>

	* SpamFilter.cc (computeRatio): Fixed bug that returned word score
	  of 0.5 for messages which had 0 in one count.  Only happened if
	  corresponding message count was also zero.

	* FrequencyDB.cc: Removed uses of message id as database key.

	* spamprobe.cc (dump_words): spamprobe dump now prints word
	  probabilities in addition to counts.
	  (main): Added -x command line option to allow top terms array to
	  extend past size limit if there are more significant terms than
	  can fit.

2002-09-20  Brian Burton  <brian@burton-computer.com>

	* SpamFilter.cc (scoreMessage): Relaxed the maximum top terms
	  array size limit to allow more terms to be used if their
	  distance from the mean is at least 0.4.  This way emails with
	  many good and spam words get a more accurate evaluation since
	  the good terms don't squeeze all of the spammy words out.  Seems
	  to yield a slight improvement on new, difficult spams without
	  increasing false positives.

2002-09-19  Brian Burton  <brian@burton-computer.com>

	* NewPtr.h (class NewPtr): Added NewPtr class to use in place of
	  auto_ptr.  I'd rather follow the standard but some older
	  versions of g++ came with a broken auto_ptr.

	* FrequencyDBImpl_bdb.cc (open): Added #if condition to handle the
	  gratuitous api change made by SleepyCat to the open() function.

2002-09-17  Brian Burton  <brian@burton-computer.com>

	* Released 0.7b with better mbox support, domain name break down,
	  and md5 digests for message identification.

	* spamprobe.cc (import_words): Changed constructor arguments as
	  suggested by Xavier Nodet to work around problem with MSVC.
	  (set_headers): Added -H none command line option to ignore
	  headers when scoring a message.

	* FrequencyDB.cc and lots of other files: Added MD5 digest as
	  unique identifier for emails instead of using message-id.  For
	  now message-id is still used if digest not found but eventually
	  will remove it since digest is better identifier anyway.

2002-09-16  Brian Burton  <brian@burton-computer.com>

	* SpamFilter.cc (scoreMessage): Added code to put top tokens into
	  the Message object while scoring.

	* Message.h (class Message): Added code to store and retrieve top
	  tokens.

	* spamprobe.cc (main): Added -T command line option to print
	  top terms and their score and message count.

	* Tokenizer.cc (is_special_char): Removed ' from special chars
	  since it seemed to hurt accuracy to include it.

2002-09-15  Brian Burton  <brian@burton-computer.com>

	* MimeMessageReader.cc (readText): Content type is now returned
	  with each text block so that it can be added to the token list.

2002-09-13  Brian Burton  <brian@burton-computer.com>

	* MimeHeader.cc (isFromLine): Improved mbox reading logic by
	  incorporating the From line format specification as defined in
	  the qmail mbox man page.  Tried to be a little flexible for
	  flawed variations but still strict enough to not think a
	  sentence starting with From is a new message.

	* MessageFactory.cc (addWordPartsToMessage): Now breaks tokens
	  containing non-alnums into pieces adding each sub-word plus each
	  suffix.  This breaks host names down into their host and domain
	  names.  This seems to improve accurracy.

	* MimeHeader.cc (read): Added extra argument to control whether or
	  not to allow the header to begin with a From_ line.  This fixes
	  a bug causing SP to miss some emails in mboxes if the preceeding
	  email was multipart and did not have a terminator.

	* Released 0.7a with receive mode bug fix, solaris ctype functions
	  bug fix, and better tokenizer.

2002-09-12  Brian Burton  <brian@burton-computer.com>

	* Tokenizer.h (class Tokenizer): Changed tokenizing of text to
	  involve less copying.

	* util.h: Added ctype front-end functions to work around problems
	  on solaris with non-ascii chars.

	* MimeLineReader.cc (readLine): Rewrote loop to make it handle
	  lines terminated by only CR as well as CR or CRLF.

2002-09-11  Brian Burton  <brian@burton-computer.com>

	* MessageFactory.cc (addStringToMessage): Fixed bug that dropped 8
	  bit characters when m_replaceNonAsciiChars was false.

	* MimeLineReader.cc (readLine): Converts null bytes into spaces.

	* spamprobe.cc (import_words): Added import command to import
	  terms previously saved using export command.

2002-09-10  Brian Burton  <brian@burton-computer.com>

	* util.cc (is_all_digits): Added is_all_digits.

	* MessageFactory.cc (addWordToMessage): Fixed all digits token
	  removal so that IP addresses are added as tokens.

	* MimeMessageReader.cc (readToBoundary): When reading messages
	  from mboxes now honor content-length fields in the headers
	  unless -Y option was specified.

	* FrequencyDBImpl_cache.h (class FrequencyDBImpl_cache): Added
	  is_dirty flag to cache entries so that values that haven't
	  changed don't get written to the database.

	* spamprobe.cc (main): Added -S command line option to allow
	  messages per cache flush to be controlled from command line.
	  Some other code cleanup as well.

	* FrequencyDBImpl_cache.h (class FrequencyDBImpl_cache): Added a
	  caching proxy frequency db impl class that uses an STL map to
	  cache term counts to reduce disk i/o at the expense of more cpu
	  time and memory usage.

	* FrequencyDBImpl.h (class FrequencyDBImpl): Added an abstract
	  base class for frequency db impls so that I could have a caching
	  proxy.

	* SpamFilter.cc (token_qsort_criterion): Fixed incorrect sort
	  order that put spammy words ahead of good words in the tie
	  breaker.  Also imposed a limit on the term count when sorting
	  since counts above a certain number become basically identical.

	* FrequencyDB.h (class FrequencyDB): Modified to use an
	  implementation class for all database access.  This will make it
	  easier to plug in new ones later.

	* FrequencyDBImpl_bdb.h (class FrequencyDBImpl): Added berkeley db
	  based implementation class to isolate the rest of the code from
	  the choice of database.  This version uses btree files instead
	  of hash for better performance, smaller file sizes, and sorted
	  output during traversals.

	* FrequencyDBImpl_dbm.h (class FrequencyDBImpl): Added dbm based
	  implementation class to isolate the rest of the code from the
	  choice of database.

	* MessageFactory.cc (addWordToMessage): Fixed bug that allowed all
	  digit tokens to slip in.

2002-09-07  Brian Burton  <brian@burton-computer.com>

	* contrib/README-maildrop.txt: Added Matthias Andree's maildrop
	  howto to the contrib directory.

	* MimeLineReader.cc (readLine): Fixed to properly handle null
	  bytes in lines.  Not that those are valid but bugged mailer
	  sometimes embed them.

2002-09-06  Brian Burton  <brian@burton-computer.com>

	* spamprobe.cc: added two new commands: dump and export.

	* SpamFilter.cc: Fixed a memory leak in scoreToken().  Converted
	  to use only a single FrequencyDB now.  Added an accessor to
	  allow clients to get access to the db.  Changed comparisons to
	  zero to allow for inexact floating point differences.

	* FrequencyDB.cc: FrequencyDB modified to store both spam and good
	  word counts for each word in a single dbm file.  Added a pair of
	  traveral functions for the export command.

	* util.h: Moved iostream inclusion into util.h.  Also added cctype
	  include there at Matthias Andree's suggestion for better gcc 3
	  compatibility.

2002-09-05  Brian Burton  <brian@burton-computer.com>

	* FrequencyDB.h: Added hooks for switching to berkeley db in ndbm
	  compatibility mode.  GDBM does not scale well for large
	  databases.  Will continue to use GDBM until 0.7 but will switch
	  over at that release.

	* util.h: Added using namespace std to avoid problems on modern
	  C++ compilers.  Thanks to Matthias Andree for bug report.

	* README.txt: Put in fix to procmail recipe.  Thanks to Steven
 	  Grimm for bug report.

	* FrequencyDB.cc (removeMessage): Will not attempt to remove a
	  message which has no message id.
	  (addMessage): Will not attempt to add a message which has no
   	  message id.

	* MessageFactory.cc (initMessage): Fixed bug 604808 which caused
	  messages with no message id to not have their bodies read.
	  Thanks to Steven Grimm for bug report.

	* MimeMessageReader.cc (readText): Fixed potential bug which could
	  incorrectly skip part of message body for non-multipart
	  messages.

	* MessageFactory.cc (addWordToMessage): Allows leading $ in
	  tokens.  Ignores tokens consisting entirely of digits.

2002-09-03  Brian Burton  <brian@burton-computer.com>

	* spamprobe.cc (set_headers): Removed obsolete test cases.
	(process_stream): Added find-spam and find-good commands.

2002-09-02  Brian Burton  <brian@burton-computer.com>

	* Changed the use to string::find() == 0 to use a new inline
	  function that used strncmp() for greater efficiency.

	* SpamFilter.cc (SpamFilter): Changed default scoring params to
	  use top 27 words and max of 2 repeats.  Found this to be a good
	  option based on test runs with sample corpus.

	* spamprobe.cc (set_headers): Added -H command line option to
	  control which headers are parsed to find tokens.

2002-08-30  Brian Burton  <brian@burton-computer.com>

	* spamprobe.cc (main): Added -h command line option to retain html
	  tags when generating tokens.

	* MessageFactory.cc (expandEntitiesInHtml): When not removing html
	  we still expand any entities in the html.

	* SpamFilter.cc (token_qsort_criterion): Modified token sort
	  criteria to favor good words over spammy ones if their distance
	  from mean and counts are equal.

	* spamprobe.cc (process_test_case): Removed tune_1 test case.

	* MessageFactory.cc (MessageFactory): Changed default settings for
	  better spam detection.

	* SpamFilter.cc (SpamFilter): Changed default settings for better
	  spam detection.

	* MessageFactory.cc (initMessage): Scoring additional headers.
	  Scoring subject header twice for extra emphasis.

	* MessageFactory.h: Made the scoring parameters setting member
	  variables.

	* SpamFilter.h: Made the scoring parameters setting member variables.

	* MimeMessageReader.cc (readText): skips junk at end of message in
	  mime multipart messages.  Previously became confused by the
	  extra junk and generated spurious scores for some messages which
	  threw off accuracy.

	* Misc: Added scripts for testing accuracy of results.