File: TREC2005.txt

package info (click to toggle)
dbacl 1.12-2.2
  • links: PTS
  • area: main
  • in suites: stretch
  • size: 3,740 kB
  • sloc: ansic: 16,594; sh: 7,963; makefile: 244; yacc: 167; lex: 78; awk: 24; xml: 17; perl: 8
file content (79 lines) | stat: -rw-r--r-- 2,625 bytes parent folder | download | duplicates (5)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
TREC 2005 / spam filtering track

This is a description of the four dbacl filters that were submitted
for the pilot run. Each filter is based on dbacl 1.11, and packaged as
a standalong tar.gz file, containing the dbacl-1.11.TREC.sfx.sh script,
this README, and an OPTIONS.default text file.

To use this, simply unpack the archive and run the self extracting
script in the directory containing the OPTIONS.default file. This
will unpack the initialize and classify scripts. Then you can
run the spamjig scripts to perform all the remaining work.

dbacl needs a standard gcc build environment to compile but no 
special libraries. 

--------------------------
DESCRIPTION OF PILOT RUNS
--------------------------

breyerSPAMp1cefhuj.tar.gz

This filter tests the cef tokenizer with full standard header analysis
and uniform reference measure, case sensitive tokens.

breyerSPAMp2adphu.tar.gz

This filter tests the adp tokenizer with full standard header analysis
and uniform reference measure, lowercase tokens.

breyerSPAMp3adphd.tar.gz

This filter tests the adp tokenizer with full standard header analysis
and dirichlet reference measure, lowercase tokens.

breyerSPAMp4adp.tar.gz

This filter tests the adp tokenizer with only Subject analysis,
uniform reference measure, lowercase tokens.


==> OPTIONS.1cefhuj <==
# these settings are interesting for the SA corpus
DBACL_LOPTS='-H 25 -1 -T email -T email:headers -T email:theaders -T html:links -T html:alt -L uniform -e cef -j'
DBACL_COPTS='-nv'
DBACL_CHAM='ham[ ]*\([^ ]*\)'
DBACL_CSPAM='.* spam[ ]*\([^ ]*\)'
DBACL_SGN=''

==> OPTIONS.2adphu <==
# these settings are interesting for the SA corpus
DBACL_LOPTS='-H 25 -1 -T email -T email:headers -T email:theaders -T html:links -T html:alt -L uniform -e adp'
DBACL_COPTS='-nv'
DBACL_CHAM='ham[ ]*\([^ ]*\)'
DBACL_CSPAM='.* spam[ ]*\([^ ]*\)'
DBACL_SGN=''

==> OPTIONS.3adphd <==
# these settings are interesting for the SA corpus
DBACL_LOPTS='-H 25 -1 -T email -T email:headers -T email:theaders -T html:links -T html:alt -L dirichlet -e adp'
DBACL_COPTS='-nv'
DBACL_CHAM='ham[ ]*\([^ ]*\)'
DBACL_CSPAM='.* spam[ ]*\([^ ]*\)'
DBACL_SGN=''

==> OPTIONS.4adp <==
# these settings are interesting for the SA corpus
DBACL_LOPTS='-H 25 -1 -T email -T email:noheaders -L uniform -e adp'
DBACL_COPTS='-nv'
DBACL_CHAM='ham[ ]*\([^ ]*\)'
DBACL_CSPAM='.* spam[ ]*\([^ ]*\)'
DBACL_SGN=''

----------------------------
DESCRIPTION OF OFFICIAL RUNS
----------------------------

The enron pilot run was not very useful as a way of choosing
interesting options for the official run, so the pilot packages will
be repeated as-is.