1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729
|
Copyright (C) 2002-2009 Python Software Foundation; All Rights Reserved
The Python Software Foundation (PSF) holds copyright on all material
in this project. You may use it under the terms of the PSF license;
see LICENSE.txt.
Assorted clues.
What's Here?
============
Lots of mondo cool partially documented code. What else could there be <wink>?
The focus of this project so far has not been to produce the fastest or
smallest filters, but to set up a flexible pure-Python implementation
for doing algorithm research. Lots of people are making fast/small
implementations, and it takes an entirely different kind of effort to
make genuine algorithm improvements. I think we've done quite well at
that so far. The focus of this codebase may change to small/fast
later -- as is, the false positive rate has gotten too small to measure
reliably across test sets with 4000 hams + 2750 spams, and the f-n rate
has also gotten too small to measure reliably across that much training data.
The code in this project requires Python 2.2 (or later).
You should definitely check out the FAQ:
http://spambayes.org/faq.html
Getting Source Code
===================
The SpamBayes project source code is hosted at SourceForge
(http://spambayes.sourceforge.net/). Access is via Subversion.
Running Unit Tests
==================
SpamBayes has a currently incomplete set of unit tests, not all of which
pass, due, in part, to bit rot. We are working on getting the unit tests to
run using the `nose <http://somethingaboutorange.com/mrl/projects/nose/>`_
package. After downloading and installing nose, you can run the current
unit tests on Unix-like systems like so from the SpamBayes top-level
directory::
TMPDIR=/tmp BAYESCUSTOMIZE= nosetests -v . 2>&1 \
| sed -e "s:$(pwd)/::" \
-e "s:$(python -c 'import sys ; print sys.exec_prefix')/::" \
| tee failing-unit-tests.txt
The file, failing-unit-tests.txt, is checked into the Subversion repository
at the top level using Python from Subversion (currently 2.7a0). You can
look at it for any failing unit tests and work to get them passing, or write
new tests.
Primary Core Files
==================
Options.py
Uses ConfigParser to allow fiddling various aspects of the classifier,
tokenizer, and test drivers. Create a file named bayescustomize.ini to
alter the defaults. Modules wishing to control aspects of their
operation merely do
from Options import options
near the start, and consult attributes of options. To see what options
are available, import Options.py and do
print Options.options.display_full()
This will print out a detailed description of each option, the allowed
values, and so on. (You can pass in a section or section and option
name to display_full if you don't want the whole list).
As an alternative to bayescustomize.ini, you can set the environment
variable BAYESCUSTOMIZE to a list of one or more .ini files, these will
be read in, in order, and applied to the options. This allows you to
tweak individual runs by combining fragments of .ini files. The
character used to separate different .ini files is platform-dependent.
On Unix, Linux and Mac OS X systems it is ':'. On Windows it is ';'.
On Mac OS 9 and earlier systems it is a NL character.
classifier.py
The classifier, which is the soul of the method.
tokenizer.py
An implementation of tokenize() that Tim can't seem to help but keep
working on <wink>. Generates a token stream from a message, which
the classifier trains on or predicts against.
chi2.py
A collection of statistics functions.
Apps
====
sb_filter.py
A simpler hammie front-end that doesn't print anything. Useful for
procmail filtering and scoring from your MUA.
sb_mboxtrain.py
Trainer for Maildir, MH, or mbox mailboxes. Remembers which
messages it saw the last time you ran it, and will only train on new
messages or messages which should be retrained.
The idea is to run this automatically every night on your Inbox and
Spam folders, and then sort misclassified messages by hand. This
will work with any IMAP4 mail client, or any client running on the
server.
sb_server.py
A spam-classifying POP3 proxy. It adds a spam-judgment header to
each mail as it's retrieved, so you can use your email client's
filters to deal with them without needing to fiddle with your email
delivery system.
Also acts as a web server providing a user interface that allows you
to train the classifier, classify messages interactively, and query
the token database. This piece may at some point be split out into
a separate module.
If the appropriate options are set, also serves a message training
SMTP proxy. It sits between your email client and your SMTP server
and intercepts mail to set ham and spam addresses.
All other mail is simply passed through to the SMTP server.
sb_mailsort.py
A delivery agent that uses a CDB of word probabilities and delivers
a message to one of two Maildir message folders, depending on the
classifier score. Note that both Maildirs must be on the same
device.
sb_xmlrpcserver.py
A stab at making hammie into a client/server model, using XML-RPC.
sb_client.py
A client for sb_xmlrpcserver.py.
sb_imapfilter.py
A spam-classifying and training application for use with IMAP servers.
You can specify folders that contain mail to train as ham/spam, and
folders that contain mail to classify, and the filter will do so.
Test Driver Core
================
Tester.py
A test-driver class that feeds streams of msgs to a classifier
instance, and keeps track of right/wrong percentages and lists
of false positives and false negatives.
TestDriver.py
A flexible higher layer of test helpers, building on Tester above.
For example, it's usable for building simple test drivers, NxN test
grids, and N-fold cross-validation drivers. See also rates.py,
cmp.py, and table.py below.
msgs.py
Some simple classes to wrap raw msgs, and to produce streams of
msgs. The test drivers use these.
Concrete Test Drivers
=====================
mboxtest.py
A concrete test driver like timtest.py, but working with a pair of
mailbox files rather than the specialized timtest setup.
timcv.py
An N-fold cross-validating test driver. Assumes "a standard" data
directory setup (see below)) rather than the specialized mboxtest
setup.
N classifiers are built.
1 run is done with each classifier.
Each classifier is trained on N-1 sets, and predicts against the sole
remaining set (the set not used to train the classifier).
mboxtest does the same.
This (or mboxtest) is the preferred way to test when possible: it
makes best use of limited data, and interpreting results is
straightforward.
timtest.py
A concrete test driver like mboxtest.py, but working with "a standard"
test data setup (see below). This runs an NxN test grid, skipping
the diagonal.
N classifiers are built.
N-1 runs are done with each classifier.
Each classifier is trained on 1 set, and predicts against each of
the N-1 remaining sets (those not used to train the classifier).
This is a much harder test than timcv, because it trains on N-1 times
less data, and makes each classifier predict against N-1 times
more data than it's been taught about.
It's harder to interpret the results of timtest (than timcv) correctly,
because each msg is predicted against N-1 times overall. So, e.g.,
one terribly difficult spam or ham can count against you N-1 times.
Test Utilities
==============
rates.py
Scans the output (so far) produced by TestDriver.Drive(), and captures
summary statistics.
cmp.py
Given two summary files produced by rates.py, displays an account
of all the f-p and f-n rates side-by-side, along with who won which
(etc), the change in total # of unique false positives and negatives,
and the change in average f-p and f-n rates.
table.py
Summarizes the high-order bits from any number of summary files,
in a compact table.
fpfn.py
Given one or more TestDriver output files, prints list of false
positive and false negative filenames, one per line.
Test Data Utilities
===================
cleanarch.py
A script to repair mbox archives by finding "Unix From" lines that
should have been escaped, and escaping them.
unheader.py
A script to remove unwanted headers from an mbox file. This is mostly
useful to delete headers which incorrectly might bias the results.
In default mode, this is similar to 'spamassassin -d', but much, much
faster.
loosecksum.py
A script to calculate a "loose" checksum for a message. See the text of
the script for an operational definition of "loose".
rebal.py
Evens out the number of messages in "standard" test data folders (see
below). Needs generalization (e.g., Ham and 4000 are hardcoded now).
mboxcount.py
Count the number of messages (both parseable and unparseable) in
mbox archives.
split.py
splitn.py
Split an mbox into random pieces in various ways. Tim recommends
using "the standard" test data set up instead (see below).
splitndirs.py
Like splitn.py (above), but splits an mbox into one message per file in
"the standard" directory structure (see below). This does an
approximate split; rebal.py (above) can be used afterwards to even out
the number of messages per folder.
runtest.sh
A Bourne shell script (for Unix) which will run some test or other.
I (Neale) will try to keep this updated to test whatever Tim is
currently asking for. The idea is, if you have a standard directory
structure (below), you can run this thing, go have some tea while it
works, then paste the output to the SpamBayes list for good karma.
Standard Test Data Setup
========================
Barry gave Tim mboxes, but the spam corpus he got off the web had one spam
per file, and it only took two days of extreme pain to realize that one msg
per file is enormously easier to work with when testing: you want to split
these at random into random collections, you may need to replace some at
random when testing reveals spam mistakenly called ham (and vice versa),
etc -- even pasting examples into email is much easier when it's one msg
per file (and the test drivers make it easy to print a msg's file path).
The directory structure under my spambayes directory looks like so:
Data/
Spam/
Set1/ (contains 1375 spam .txt files)
Set2/ ""
Set3/ ""
Set4/ ""
Set5/ ""
Set6/ ""
Set7/ ""
Set9/ ""
Set9/ ""
Set10/ ""
reservoir/ (contains "backup spam")
Ham/
Set1/ (contains 2000 ham .txt files)
Set2/ ""
Set3/ ""
Set4/ ""
Set5/ ""
Set6/ ""
Set7/ ""
Set8/ ""
Set9/ ""
Set10/ ""
reservoir/ (contains "backup ham")
Every file at the deepest level is used (not just files with .txt
extensions). The files don't need to have a "Unix From"
header before the RFC-822 message (i.e. a line of the form "From
<address> <date>").
If you use the same names and structure, huge mounds of the tedious testing
code will work as-is. The more Set directories the merrier, although you
want at least a few hundred messages in each one. The "reservoir"
directories contain a few thousand other random hams and spams. When a ham
is found that's really spam, move it into a spam directory, then use the
rebal.py utility to rebalance the Set directories moving random message(s)
into and/or out of the reservoir directories. The reverse works as well
(finding ham in your spam directories).
The hams are 20,000 msgs selected at random from a python-list archive.
The spams are essentially all of Bruce Guenter's 2002 spam archive:
<http://www.em.ca/~bruceg/spam/>
The sets are grouped into pairs in the obvious way: Spam/Set1 with
Ham/Set1, and so on. For each such pair, timtest trains a classifier on
that pair, then runs predictions on each of the other pairs. In effect,
it's a NxN test grid, skipping the diagonal. There's no particular reason
to avoid predicting against the same set trained on, except that it
takes more time and seems the least interesting thing to try.
Later, support for N-fold cross validation testing was added, which allows
more accurate measurement of error rates with smaller amounts of training
data. That's recommended now. timcv.py is to cross-validation testing
as the older timtest.py is to grid testing. timcv.py has grown additional
arguments to allow using only a random subset of messages in each Set.
CAUTION: The partitioning of your corpora across directories should
be random. If it isn't, bias creeps in to the test results. This is
usually screamingly obvious under the NxN grid method (rates vary by a
factor of 10 or more across training sets, and even within runs against
a single training set), but harder to spot using N-fold c-v.
Testing a change and posting the results
========================================
(Adapted from clues Tim posted on the spambayes and spambayes-dev lists)
Firstly, setup your data as above; it's really not worth the hassle to
come up with a different scheme. If you use the Outlook plug-in, the
export.py script in the Outlook2000 directory will export all the spam
and ham in your 'training' folders for you into this format (or close
enough).
Basically the idea is that you should have 10 sets of data, each with
200 to 500 messages in them. Obviously if you're testing something to
do with the size of a corpus, you'll want to change that. You then want
to run
timcv.py -n 10 > std.txt
(call std.txt whatever you like), and then
rates.py std.txt
You end up with two files, std.txt, which has the raw results, and stds.txt,
which has more of a summary of the results.
Now make the change to the code or options, and repeat the process,
giving the files different names (note that rates.py will automatically
choose the name for the output file, based on the input one).
You've now got the data you need, but you have to interpret it. The
simplest way of all is just to post it to spambayes-dev@python.org and let
someone else do it for you <wink>. The data you should post is the output of
cmp.py stds.txt alts.txt
along with the output of
table.py stds.txt alts.txt
(note that these just print to stdout).
Other information you can find in the 'raw' output (std.txt, above) are
histograms of the ham/spam spread, and a copy of the options settings.
Interpreting cmp.py output
--------------------------
(Using an example from Tim on spambayes-dev)
> cv_octs.txt -> cv_oct_subjs.txt
> -> <stat> tested 488 hams & 897 spams against 1824 hams & 3501 spams
> -> <stat> tested 462 hams & 863 spams against 1850 hams & 3535 spams
> -> <stat> tested 475 hams & 863 spams against 1837 hams & 3535 spams
> -> <stat> tested 430 hams & 887 spams against 1882 hams & 3511 spams
> -> <stat> tested 457 hams & 888 spams against 1855 hams & 3510 spams
> -> <stat> tested 488 hams & 897 spams against 1824 hams & 3501 spams
> -> <stat> tested 462 hams & 863 spams against 1850 hams & 3535 spams
> -> <stat> tested 475 hams & 863 spams against 1837 hams & 3535 spams
> -> <stat> tested 430 hams & 887 spams against 1882 hams & 3511 spams
> -> <stat> tested 457 hams & 888 spams against 1855 hams & 3510 spams
>
> false positive percentages
> 0.000 0.000 tied
> 0.000 0.000 tied
> 0.000 0.000 tied
> 0.000 0.000 tied
> 0.219 0.219 tied
>
> won 0 times
> tied 5 times
> lost 0 times
So all 5 runs tied on FP. That tells us much more than that the *net*
effect across 5 runs was nil on FP: it tells us that there are no hidden
glitches hiding behind that "net nothing" -- it was no change across the board.
> total unique fp went from 1 to 1 tied
> mean fp % went from 0.0437636761488 to 0.0437636761488 tied
>
> false negative percentages
> 2.007 2.007 tied
> 1.390 1.390 tied
> 1.622 1.622 tied
> 2.029 1.917 won -5.52%
> 2.703 2.477 won -8.36%
>
> won 2 times
> tied 3 times
> lost 0 times
When evaluating a small change, I'm heartened to see that in no run did it lose.
At worst it tied, and twice it helped a little. That's encouraging.
What the histograms would tell us that we can't tell from this is whether you
could have done just as well without the change by raising your ham cutoff a little.
That would also tie on FP, and *may* also get rid of the same number (or even
more) of FN.
> total unique fn went from 86 to 83 won -3.49%
> mean fn % went from 1.95029003772 to 1.88269707836 won -3.47%
>
> ham mean ham sdev
> 0.57 0.58 +1.75% 4.63 4.77 +3.02%
> 0.08 0.07 -12.50% 1.20 1.01 -15.83%
> 0.36 0.29 -19.44% 3.61 3.23 -10.53%
> 0.08 0.11 +37.50% 0.89 1.18 +32.58%
> 0.72 0.76 +5.56% 6.80 7.06 +3.82%
>
> ham mean and sdev for all runs
> 0.37 0.37 +0.00% 4.10 4.16 +1.46%
That's a good example of grand averages hiding the truth: the averaged change
in the mean ham score was 0 across all 5 runs, but *within* the 5 runs it slobbered
around wildly, from decreasing 20% to increasing 40%(!).
> spam mean spam sdev
> 96.43 96.44 +0.01% 15.89 15.89 +0.00%
> 97.01 97.07 +0.06% 13.79 13.70 -0.65%
> 97.14 97.16 +0.02% 14.05 14.02 -0.21%
> 96.52 96.56 +0.04% 15.65 15.52 -0.83%
> 95.53 95.63 +0.10% 17.47 17.31 -0.92%
>
> spam mean and sdev for all runs
> 96.52 96.57 +0.05% 15.46 15.37 -0.58%
That's good to see: it's a consistent win for spam scores across runs,
although an almost imperceptible one. It's good when the mean spam score rises,
and it's good when sdev (for ham or spam) decreases.
> ham/spam mean difference: 96.15 96.20 +0.05
This is a slight win for the chance, although seeing the details gives cause
to worry some about the effect on ham: the ham sdev increased overall, and
the effects on ham mean and ham sdev varied wildly across runs. OTOH, the
"before" numbers for ham mean and ham sdev varied wildly across runs already.
That gives cause to worry some about the data <wink>.
Making a source release
=======================
Source releases are built with distutils. Here's how I (Richie) have been
building them. I do this on a Windows box, partly so that the zip release
can have Windows line endings without needing to run a conversion script.
I don't think that's actually necessary, because everything would work on
Windows even with Unix line endings, but you couldn't load the files into
Notepad and sometimes it's convenient to do so. End users might not even
have any other text editor, so it make things like the README unREADable.
8-)
Anthony would rather eat live worms than trying to get a sane environment
on Windows, so his approach to building the zip file is at the end.
o If any new file types have been added since last time (eg. 1.0a5 went
out without the Windows .rc and .h files) then add them to MANIFEST.in.
If there are any new scripts or packages, add them to setup.py. Test
these changes (by building source packages according to the instructions
below) then commit your edits.
o Checkout the 'spambayes' module twice, once with Windows line endings
and once with Unix line endings (I use WinCVS for this, using "Admin /
Preferences / Globals / Checkout text files with the Unix LF". If you
use TortoiseCVS, like Tony, then the option is on the Options tab in
the checkout dialog).
o Change spambayes/__init__.py to contain the new version number but don't
commit it yet, just in case something goes wrong.
o Note that if you cheated above, and used an existing checkout, you need
to ensure that you don't have extra files in there. For example, if you
have a few thousand email messages in testtools/Data, setup.py will take
a *very* long time.
o In the Windows checkout, run "python setup.py sdist --formats zip"
o In the Unix checkout, run "python setup.py sdist --formats gztar"
o Take the resulting spambayes-1.0a5.zip and spambayes-1.0a5.tar.gz, and
test the former on Windows (ideally in a freshly-installed Python
environment; I keep a VMWare snapshot of a clean Windows installation
for this, but that's probably overkill 8-) and test the latter on Unix
(a Debian VMWare box in my case).
o If you can, rename these with "rc" at the end, and make them available
to the spambayes-dev crowd as release candidates. If all is OK, then
fix the names (or redo this) and keep going.
o Dance the SourceForge release dance:
http://sourceforge.net/docman/display_doc.php?docid=6445&group_id=1#filereleasesteps
When it comes to the "what's new" and the ChangeLog, I cut'n'paste the
relevant pieces of WHAT_IS_NEW.txt and CHANGELOG.txt into the form, and
check the "Keep my preformatted text" checkbox.
o Now commit spambayes/__init__.py and tag the whole checkout - see the
existing tag names for the tag name format.
o In either checkout, run "python setup.py register" to register the new
version with PyPI.
o Update download.ht with checksums, links, and sizes for the files.
From release 1.1 doing a "setup.py sdist" will generate checksums
and sizes for you, and print out the results to stdout.
o Create OpenPGP/PGP signatures for the files. Using GnuPG:
% gpg -sab spambayes-1.0.1.zip
% gpg -sab spambayes-1.0.1.tar.gz
% gpg -sab spambayes-1.0.1.exe
Put the created *.asc files in the "sigs" directory of the website.
(Note that when you update the website, you will need to manually ssh
to shell1.sourceforge.net and chmod these files so that people can
access them.)
o If your public key isn't already linked to on the Download page, put
it there.
o Update the website News, Download and Windows sections.
o Update reply.txt in the website repository as needed (it specifies the
latest version). Then let Tim, Barry, Tony, or Skip know that they need
to update the autoresponder.
o Run "make install version" in the website directory to push the new
version file, so that "Check for new version" works.
o Add '+' to the end of spambayes/__init__.py's __version__, to
differentiate CVS users, and check this change in. After a number of
changes have been checked in, this can be incremented and have "a0"
added to the end. For example, with a 1.1 release:
[before the release process] '1.1rc1'
[during the release process] '1.1'
[after the release process] '1.1+'
[later] '1.2a0'
Then announce the release on the mailing lists and watch the bug reports
roll in. 8-)
Anthony's Alternate Approach to Building the Zipfile
o Unpack the tarball somewhere, making a spambayes-1.0a7 directory
(version number will obviously change in future releases)
o Run the following two commands:
find spambayes-1.0a7 -type f -name '*.txt' | xargs zip -l sb107.zip
find spambayes-1.0a7 -type f \! -name '*.txt' | xargs zip sb107.zip
o This makes a tarball where the .txt files are mangled, but everything
else is left alone.
Making a binary release
=======================
The binary release includes both sb_server and the Outlook plug-in and
is an installer for Windows (98 and above) systems. In order to have
COM typelibs that work with Outlook 2000, 2002 and 2003, you need to
build the installer on a system that has Outlook 2000 (not a more recent
version). You also need to have InnoSetup, pywin32, resourcepackage and
py2exe installed.
o Get hold of a fresh copy of the source (Windows line endings,
presumably).
o Run the setup.py file in the spambayes/Outlook2000/docs directory
to generate the dynamic documentation.
o Run sb_server and open the web interface. This gets resourcepackage
to generate the needed files.
o Replace the __init__.py file in spambayes/spambayes/resources with
a blank file to disable resourcepackage.
o Ensure that the version numbers in spambayes/spambayes/__init__.py
and spambayes/spambayes/Version.py are up-to-date.
o Ensure that you don't have any other copies of spambayes in your
PYTHONPATH, or py2exe will pick these up! If in doubt, run
setup.py install.
o Run the "setup_all.py" script in the spambayes/windows/py2exe/
directory. This uses py2exe to create the files that Inno will install.
o Open (in InnoSetup) the spambayes.iss file in the spambayes/windows/
directory. Change the version number in the AppVerName and
OutputBaseFilename lines to the new number.
o Compile the spambayes.iss script to get the executable.
o You can now follow the steps in the source release description above,
from the testing step.
Making a translation
====================
Note that it is, in general, best to translate against a stable version.
This means you avoid having to repeatedly re-translate text as the
code changes. This means code that has been released via the sourceforge
system, that does not have a letter code at the end of the version (e.g.
1.0.1, 1.1.2, but not 1.0a1, 1.1b1, or 2.1rc2). If you do want to
translate a more recent version, be sure to discuss your plans first on
spambayes-dev so that you can be warned about any planned changes.
Translation is only feasible for 1.1 and above. No translation effort
is planned for the 1.0.x series of releases.
To translate, you will need:
o A suitable version of Python (2.2 or greater) installed.
See http://python.org/download
o A copy of the SpamBayes source that you wish to translate.
o Resourcepackage installed.
See http://resourcepackage.sourceforge.net
Optional tools that may make translation easier include:
o A copy of VC++, Visual Studio, or some other GUI tool that allows
editing of VC++ dialog resource files.
o A GUI HTML editor.
o A GUI gettext editor, such as poEdit.
http://poedit.sourceforge.net
Setup
-----
You will need to create a directory structure as follows:
spambayes/ # spambayes package directory
# containing classifier.py, tokenizer.py, etc
languages/ # root languages directory,
# possibly already containing
# other translations
{lang_code}/ # directory for the specific
# translation - {lang_code} is
# described below
DIALOGS/ # directory for Outlook plug-in
# dialog resources, which should contain an
# empty __init__.py file, so that py2exe can
# include the directory
LC_MESSAGES/ # directory for gettext managed
# strings, which should also contain an
# empty __init__.py file
__init__.py # Copy of spambayes/spambayes/resources/__init__.py
Translation Tasks
-----------------
There are four translation tasks:
o Documentation. This is the least exciting, but the most important.
If the documentation is appropriately translated, then even if elements
of the interface are not translated, users should be able to manage.
A method of managing translated documents has yet to be created. If you
are interested in translating documentation, please contact
spambayes-dev@python.org.
o Outlook dialogs. The majority of the Outlook plug-in interface is
handled by a VC++/Visual Studio dialog resource file pair (dialogs.h
and dialogs.rc). The plug-in code then manipulates this to create the
actual dialog.
The easiest method of translating these dialogs is to use a tool like
VC++ or Visual Studio. Simply open the
'Outlook2000\dialogs\resources\dialogs.rc' file, translate the dialog,
and save the file as
'spambayes\languages\{lang_code}\DIALOGS\dialogs.rc', where {lang_code}
is the appropriate language code for the language you have translated
into (e.g. 'en_UK', 'es', 'de_DE'). If you do not have a GUI tool to
edit the dialogs, simply open the dialogs.rc file in a text editor,
manually change the appropriate strings, and save the file as above.
Once the dialogs are translated, you need to use the rc2py.py utility
to create the i18n_dialogs.py file. For example, in the
'Outlook2000\dialogs\resources' directory:
> rc2py.py {base}\spambayes\languages\de_DE\DIALOGS\dialogs.rc
{base}\spambayes\languages\de_DE\DIALOGS\i18n_dialogs.py 1
Where {base} is the directory that contains the spambayes package directory.
This should create a 'i18n_dialogs.py' in the same directory as your
translated dialogs.rc file - this is the file the the Outlook plug-in
uses.
o Web interface template file. The majority of the web interface is
created by dynamic use of a HTML template file.
The easiest method of translating this file is to use a GUI HTML editor.
Simply open the 'spambayes/resources/ui.html' file, translate
it as described within, and save the file as
'spambayes/languages/{lang_code}/i18n.ui.html', where {lang_code} is
the appropriate language code as described above. If you do not have
a GUI HTML editor, or are happy editing HTML by hand, simply use your
favority HTML editor to do this task.
Once the template file is created, resourcepackage will automatically
create the required ui_html.py file when SpamBayes is run with that
language selected.
o Gettext managed strings. The remainder of both the Outlook plug-in
and the web interface are contained within the various Python files
that make up SpamBayes. The Python gettext module (very similar to
the GNU gettext system) is used to manage translation of these strings.
To translate these strings, use the translation template
'spambayes/languages/messages.pot'. You can regenerate that file, if
necessary, by running this command in the spambayes package directory:
> {python dir}\tools\i18n\pygettext.py -o languages\messages.pot
..\contrib\*.py ..\Outlook2000\*.py ..\scripts\*.py *.py
..\testtools\*.py ..\utilities\*.py ..\windows\*.py
You may wish to use a GUI system to create the required messages.po file,
such as poEdit, but you can also do this manually with a text editor.
If your utility does not do it for you, you will also need to
compile the .po file to a .mo file. The utility msgfmt.py will do
this for you - it should be located '{python dir}\tools\i18n'.
Testing the translation
-----------------------
There are two ways to set the language that SpamBayes will use:
o If you are using Windows, change the preferred Windows language using
the Control Panel.
o Get the '[globals] language' SpamBayes option to a list of the
preferred language(s).
|