File: readme.texi

package info (click to toggle)
bow 19991122-4
  • links: PTS
  • area: main
  • in suites: woody
  • size: 2,544 kB
  • ctags: 2,987
  • sloc: ansic: 38,660; lisp: 1,072; makefile: 594; perl: 492; yacc: 149; sh: 91
file content (119 lines) | stat: -rw-r--r-- 2,544 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
@chapter Bag Of Words Library README

@c set the vars BOW_VERSION
@include version.texi

@samp{libbow}, version @value{BOWVERSION}.

@include libbow-desc.texi


@section Rainbow

@samp{Rainbow} is a standalone program that does document
classification.  Here are some examples:

@itemize @bullet

@item

@example
rainbow -i ./training/positive ./training/negative
@end example

Using the text files found under the directories
@file{./positive} and @file{./negative},
tokenize, build word vectors, and write the resulting data structures
to disk.

@item

@example
rainbow --query=./testing/254
@end example

Tokenize the text document @file{./testing/254}, and classify it,
producing output like:

@example
/home/mccallum/training/positive 0.72
/home/mccallum/training/negative 0.28
@end example

@item

@example
rainbow --test-set=0.5 -t 5
@end example

Perform 5 trials, each consisting of a new random test/train split and
outputs of the classification of the test documents.

@end itemize

Typing @samp{rainbow --help} will give list of all rainbow options.

After you have compiled @samp{libbow} and @samp{rainbow}, you can run
the shell script @file{./demo/script} to see an annotated demonstration
of the classifier in action.

More information and documentation is available at
http://www.cs.cmu.edu/~mccallum/bow


@format
Rainbow improvements coming eventually:
   Better documentation.
   Incremental model training.
@end format



@section Arrow

@samp{Arrow} is a standalone program that does document retrieval by
TFIDF.  

Index all the documents in directory @samp{foo} by typing

@example
arrow --index foo
@end example

Make a single query by typing

@example
arrow --query
@end example

then typing your query, and pressing Control-D.

If you want to make many queries, it will be more efficient to run arrow
as a server, and query it multiple times without restarts by
communicating through a socket.  Type, for example,

@example
arrow --query-server=9876
@end example

And access it through port number 9876.  For example:

@example
telnet localhost 9876
@end example

In this mode there is no need to press Control-D to end a query.  Simply
type your query on one line, and press return.


@section Crossbow

@samp{Crossbow} is a standalone program that does document clustering.
Sorry, there is no documentation yet.


@section Archer

@samp{Archer} is a standalone program that does document retrieval with
AltaVista-type queries.  Sorry, there is no documentation yet, however,
the basic interface is much like arrow.