File: sb_notesfilter.py

package info (click to toggle)
spambayes 1.1a6-1
  • links: PTS, VCS
  • area: main
  • in suites: wheezy
  • size: 4,712 kB
  • sloc: python: 48,776; ansic: 535; sh: 87; lisp: 83; makefile: 46
file content (428 lines) | stat: -rw-r--r-- 15,134 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
#! /usr/bin/env python

'''sb_notesfilter.py - Lotus Notes SpamBayes interface.

    This module uses SpamBayes as a filter against a Lotus Notes mail
    database.  The Notes client must be running when this process is
    executed.

    It requires a Notes folder, named as a parameter, with four
    subfolders:
        Spam
        Ham
        Train as Spam
        Train as Ham

    Depending on the execution parameters, it will do any or all of the
    following steps, in the order given.

    1. Train Spam from the Train as Spam folder (-t option)
    2. Train Ham from the Train as Ham folder (-t option)
    3. Replicate (-r option)
    4. Classify the inbox (-c option)

    Mail that is to be trained as spam should be manually moved to
    that folder by the user. Likewise mail that is to be trained as
    ham.  After training, spam is moved to the Spam folder and ham is
    moved to the Ham folder.

    Replication takes place if a remote server has been specified.
    This step may take a long time, depending on replication
    parameters and how much information there is to download, as well
    as line speed and server load.  Please be patient if you run with
    replication.  There is currently no progress bar or anything like
    that to tell you that it's working, but it is and will complete
    eventually.  There is also no mechanism for notifying you that the
    replication failed.  If it did, there is no harm done, and the program
    will continue execution.

    Mail that is classified as Spam is moved from the inbox to the
    Train as Spam folder.  You should occasionally review your Spam
    folder for Ham that has mistakenly been classified as Spam.  If
    there is any there, move it to the Train as Ham folder, so
    SpamBayes will be less likely to make this mistake again.

    Mail that is classified as Ham or Unsure is left in the inbox.
    There is currently no means of telling if a mail was classified as
    Ham or Unsure.

    You should occasionally select some Ham and move it to the Train
    as Ham folder, so Spambayes can tell the difference between Spam
    and Ham. The goal is to maintain an approximate balance between the
    number of Spam and the number of Ham that have been trained into
    the database. These numbers are reported every time this program
    executes.  However, if the amount of Spam you receive far exceeds
    the amount of Ham you receive, it may be very difficult to
    maintain this balance.  This is not a matter of great concern.
    SpamBayes will still make very few mistakes in this circumstance.
    But, if this is the case, you should review your Spam folder for
    falsely classified Ham, and retrain those that you find, on a
    regular basis.  This will prevent statistical error accumulation,
    which if allowed to continue, would cause SpamBayes to tend to
    classify everything as Spam.

    Because there is no programmatic way to determine if a particular
    mail has been previously processed by this classification program,
    it keeps a pickled dictionary of notes mail ids, so that once a
    mail has been classified, it will not be classified again.  The
    non-existence of this index file, named <local database>.sbindex,
    indicates to the system that this is an initialization execution.
    Rather than classify the inbox in this case, the contents of the
    inbox are placed in the index to note the 'starting point' of the
    system.  After that, any new messages in the inbox are eligible
    for classification.

Usage:
    sb_notesfilter [options]

        note: option values with spaces in them must be enclosed
              in double quotes

        options:
            -p  dbname  : pickled training database filename
            -d  dbname  : dbm training database filename
            -l  dbname  : database filename of local mail replica
                            e.g. localmail.nsf
            -r  server  : server address of the server mail database
                            e.g. d27ml602/27/M/IBM
                          if specified, will initiate a replication
            -f  folder  : Name of SpamBayes folder
                            must have subfolders: Spam
                                                  Ham
                                                  Train as Spam
                                                  Train as Ham
            -t          : train contents of Train as Spam and Train as Ham
            -c          : classify inbox
            -h          : help
            -P          : prompt "Press Enter to end" before ending
                          This is useful for automated executions where the
                          statistics output would otherwise be lost when the
                          window closes.
            -i filename : index file name
            -W          : password
            -L dbname   : log to database (template alog4.ntf)
            -o section:option:value :
                          set [section, option] in the options database
                          to value

Examples:

    Replicate and classify inbox
        sb_notesfilter -c -d notesbayes -r mynoteserv -l mail.nsf -f Spambayes

    Train Spam and Ham, then classify inbox
        sb_notesfilter -t -c -d notesbayes -l mail.nsf -f Spambayes

    Replicate, then classify inbox
        sb_notesfilter -c -d test7 -l mail.nsf -r nynoteserv -f Spambayes

To Do:
    o Dump/purge notesindex file
    o Create correct folders if they do not exist
    o Options for some of this stuff?
    o sb_server style training/configuration interface?
    o parameter to retrain?
    o Use spambayes.message MessageInfo db's rather than own database.
    o Suggestions?
'''

# This module is part of the spambayes project, which is Copyright 2002-2007
# The Python Software Foundation and is covered by the Python Software
# Foundation license.

from __future__ import generators

__author__ = "Tim Stone <tim@fourstonesExpressions.com>"
__credits__ = "Mark Hammond, for his remarkable win32 modules."

import sys
import errno
import getopt

import win32com.client
import pywintypes

from spambayes import tokenizer, storage
from spambayes.Options import options
from spambayes.safepickle import pickle_read, pickle_write

def classifyInbox(v, vmoveto, bayes, ldbname, notesindex, log):

    # the notesindex hash ensures that a message is looked at only once

    if len(notesindex.keys()) == 0:
        firsttime = 1
    else:
        firsttime = 0

    docstomove = []
    numham = 0
    numspam = 0
    numuns = 0
    numdocs = 0

    doc = v.GetFirstDocument()
    while doc:
        nid = doc.NOTEID
        if firsttime:
            notesindex[nid] = 'never classified'
        else:
            if not notesindex.has_key(nid):
                numdocs += 1

                # Notes returns strings in unicode, and the Python
                # decoder has trouble with these strings when
                # you try to print them.  So don't...

                message = getMessage(doc)

                # generate_long_skips = True blows up on occasion,
                # probably due to this unicode problem.
                options["Tokenizer", "generate_long_skips"] = False
                tokens = tokenizer.tokenize(message)
                prob = bayes.spamprob(tokens)

                if prob < options["Categorization", "ham_cutoff"]:
                    numham += 1
                elif prob > options["Categorization", "spam_cutoff"]:
                    docstomove += [doc]
                    numspam += 1
                else:
                    numuns += 1

                notesindex[nid] = 'classified'
                subj = message["subject"]
                try:
                    print "%s spamprob is %s" % (subj[:30], prob)
                    if log:
                        log.LogAction("%s spamprob is %s" % (subj[:30],
                                                             prob))
                except UnicodeError:
                    print "<subject not printed> spamprob is %s" % (prob)
                    if log:
                        log.LogAction("<subject not printed> spamprob " \
                                      "is %s" % (prob,))

                item = doc.ReplaceItemValue("Spam", prob)
                item.IsSummary = True
                doc.save(False, True, False)

        doc = v.GetNextDocument(doc)

    # docstomove list is built because moving documents in the middle of
    # the classification loop loses the iterator position
    for doc in docstomove:
        doc.RemoveFromFolder(v.Name)
        doc.PutInFolder(vmoveto.Name)

    print "%s documents processed" % (numdocs,)
    print "   %s classified as spam" % (numspam,)
    print "   %s classified as ham" % (numham,)
    print "   %s classified as unsure" % (numuns,)
    if log:
        log.LogAction("%s documents processed" % (numdocs,))
        log.LogAction("   %s classified as spam" % (numspam,))
        log.LogAction("   %s classified as ham" % (numham,))
        log.LogAction("   %s classified as unsure" % (numuns,))

def getMessage(doc):
    try:
        subj = doc.GetItemValue('Subject')[0]
    except:
        subj = 'No Subject'

    try:
        body  = doc.GetItemValue('Body')[0]
    except:
        body = 'No Body'

    hdrs = ''
    for item in doc.Items:
        if item.Name == "From" or item.Name == "Sender" or \
           item.Name == "Received" or item.Name == "ReplyTo":
            try:
                hdrs = hdrs + ( "%s: %s\r\n" % (item.Name, item.Text) )
            except:
                hdrs = ''

    message = "%sSubject: %s\r\n\r\n%s" % (hdrs, subj, body)
    return message

def processAndTrain(v, vmoveto, bayes, is_spam, notesindex, log):
    if is_spam:
        header_str = options["Headers", "header_spam_string"]
    else:
        header_str = options["Headers", "header_ham_string"]

    print "Training %s" % (header_str,)

    docstomove = []
    doc = v.GetFirstDocument()
    while doc:
        message = getMessage(doc)

        options["Tokenizer", "generate_long_skips"] = False
        tokens = tokenizer.tokenize(message)

        nid = doc.NOTEID
        if notesindex.has_key(nid):
            trainedas = notesindex[nid]
            if trainedas == options["Headers", "header_spam_string"] and \
               not is_spam:
                # msg is trained as spam, is to be retrained as ham
                bayes.unlearn(tokens, True)
            elif trainedas == options["Headers", "header_ham_string"] and \
                 is_spam:
                # msg is trained as ham, is to be retrained as spam
                bayes.unlearn(tokens, False)

        bayes.learn(tokens, is_spam)

        notesindex[nid] = header_str
        docstomove += [doc]
        doc = v.GetNextDocument(doc)

    for doc in docstomove:
        doc.RemoveFromFolder(v.Name)
        doc.PutInFolder(vmoveto.Name)

    print "%s documents trained" % (len(docstomove),)
    if log:
        log.LogAction("%s documents trained" % (len(docstomove),))


def run(bdbname, useDBM, ldbname, rdbname, foldname, doTrain, doClassify,
        pwd, idxname, logname):
    bayes = storage.open_storage(bdbname, useDBM)

    try:
        notesindex = pickle_read(idxname)
    except IOError, e:
        if e.errno != errno.ENOENT:
            raise
        notesindex = {}
        print "%s file not found, this is a first time run" % (idxname,)
        print "No classification will be performed"

    need_replicate = False

    sess = win32com.client.Dispatch("Lotus.NotesSession")
    try:
        if pwd:
            sess.initialize(pwd)
        else:
            sess.initialize()
    except pywintypes.com_error:
        print "Session aborted"
        sys.exit()
    try:
        db = sess.GetDatabase(rdbname, ldbname)
    except pywintypes.com_error:
        if rdbname:
            print "Could not open database remotely, trying locally"
            try:
                db = sess.GetDatabase("", ldbname)
                need_replicate = True
            except pywintypes.com_error:
                print "Could not open database"
                sys.exit()
        else:
            raise

    log = sess.CreateLog("SpambayesAgentLog")
    try:
        log.OpenNotesLog("", logname)
    except pywintypes.com_error:
        print "Could not open log"
        log = None

    if log:
        log.LogAction("Running spambayes")

    vinbox = db.getView('($Inbox)')
    vspam = db.getView("%s\Spam" % (foldname,))
    vham = db.getView("%s\Ham" % (foldname,))
    vtrainspam = db.getView("%s\Train as Spam" % (foldname,))
    vtrainham = db.getView("%s\Train as Ham" % (foldname,))

    if doTrain:
        processAndTrain(vtrainspam, vspam, bayes, True, notesindex, log)
        # for some reason, using inbox as a target here loses the mail
        processAndTrain(vtrainham, vham, bayes, False, notesindex, log)

    if need_replicate:
        try:
            print "Replicating..."
            db.Replicate(rdbname)
            print "Done"
        except pywintypes.com_error:
            print "Could not replicate"

    if doClassify:
        classifyInbox(vinbox, vtrainspam, bayes, ldbname, notesindex, log)

    print "The Spambayes database currently has %s Spam and %s Ham" \
          % (bayes.nspam, bayes.nham)

    bayes.store()

    pickle_write(idxname, notesindex)

    if log:
        log.LogAction("Finished running spambayes")


if __name__ == '__main__':
    try:
        opts, args = getopt.getopt(sys.argv[1:], 'htcPd:p:l:r:f:o:i:W:L:')
    except getopt.error, msg:
        print >> sys.stderr, str(msg) + '\n\n' + __doc__
        sys.exit()

    ldbname = None  # local notes database name
    rdbname = None  # remote notes database location
    sbfname = None  # spambayes folder name
    idxname = None  # index file name
    logname = None  # log database name
    pwd = None # password
    doTrain = False
    doClassify = False
    doPrompt = False

    for opt, arg in opts:
        if opt == '-h':
            print >> sys.stderr, __doc__
            sys.exit()
        elif opt == '-l':
            ldbname = arg
        elif opt == '-r':
            rdbname = arg
        elif opt == '-f':
            sbfname = arg
        elif opt == '-t':
            doTrain = True
        elif opt == '-c':
            doClassify = True
        elif opt == '-P':
            doPrompt = True
        elif opt == '-i':
            idxname = arg
        elif opt == '-L':
            logname = arg
        elif opt == '-W':
            pwd = arg
        elif opt == '-o':
            options.set_from_cmdline(arg, sys.stderr)
    bdbname, useDBM = storage.database_type(opts)

    if not idxname:
        idxname = "%s.sbindex" % (ldbname)

    if (bdbname and ldbname and sbfname and (doTrain or doClassify)):
        run(bdbname, useDBM, ldbname, rdbname, \
            sbfname, doTrain, doClassify, pwd, idxname, logname)

        if doPrompt:
            raw_input("Press Enter to end ")
    else:
        print >> sys.stderr, __doc__