There are three main modes of operation:
./rainbow -i /usr1/mitchell/datasets/homepagesections/NEG /usr1/mitchell/datasets/homepagesections/POS
Class `NEG'
Counting words... files : unique-words :: 522 : 5535
Class `POS'
Counting words... files : unique-words :: 164 : 6979
Class `NEG'
Gathering stats... files : unique-words :: 522 : 3830
Class `POS'
Gathering stats... files : unique-words :: 164 : 3830
Making vector-per-class... words :: 30
Normalizing weights: 0
Here is an example query:
./rainbow -q /usr1/mitchell/datasets/homepagesections/POS/ZABOWSKI-DAVID.h4-3
Loading data files...
Hit number 0, with score 1
Class `/usr1/mitchell/datasets/homepagesections/POS'
Hit number 1, with score 1.27581e-14
Class `/usr1/mitchell/datasets/homepagesections/NEG'
Here is an example of running an experiment, in two steps. The first
step uses the data from the most recent -i run of Rainbow. It
performs mulitple (10 here) iterations of a train/test split. The
second step uses a perl script called rainbow-stats to summarize the
results of these trials in lovely, human readable form.
./rainbow -t 10 > ~/rainbow.output
~/rainbow.output Loading data files... Making
vector-per-class... words :: 79 Normalizing weights: 0 Making
vector-per-class... words :: 79 Normalizing weights: 0 Making
vector-per-class... words :: 79 Normalizing weights: 0 Making
vector-per-class... words :: 79 Normalizing weights: 0 Making
vector-per-class... words :: 79 Normalizing weights: 0 Making
vector-per-class... words :: 79 Normalizing weights: 0 Making
vector-per-class... words :: 79 Normalizing weights: 0 Making
vector-per-class... words :: 79 Normalizing weights: 0 Making
vector-per-class... words :: 79 Normalizing weights: 0 Making
vector-per-class... words :: 79 Normalizing weights: 0
Now let's look at the results...
./rainbow
> cat ~/rainbow.output | ~/rainbow-stats
Trial 0
Correct: 154 out of 200 (77.00)
- Confusion details
Actual: NEG
NEG:124 POS:31
Actual: POS
NEG:15 POS:30
Trial 1
Correct: 152 out of 201 (75.62)
- Confusion details
Actual: NEG
NEG:125 POS:36
Actual: POS
NEG:13 POS:27
...more...
Here is a nice way to see the 15 terms with the highest information gain:
./rainbow -I 15
Loading data files...
Calculating info gain... words :: 79
0.12477 project
0.05289 research
0.05225 lyco
0.03599 html
0.03523 system
0.03445 vasc
0.02966 http
0.02792 home
0.02779 vision
0.02741 href
0.02730 cmu
0.02342 wa
0.01995 www
0.01952 parallel
0.01942 laboratori
Here is the way to get online help:
[jr6b@stomach bow]$ ./rainbow -X rainbow: illegal option -- X usage: ./rainbow [-d datadir] [-v <verbosity_level>] [-b] (where `datadir' is the directory in which to read/write data `verbosity_level' is 0=silent, 1=quiet, 2=show-progress, ... 5=max) [-b] don't use backspace when verbosifying (good for use in emacs) [-T <N>] prune all but top N words by info-gain (default: infinity) [-m <mname>] set method to <mname> (eg. naivebayes, tfidf, prind) [-U] in the PrInd method, use non-uniform prior probabilities [-G] in the PrInd method, scale Pr(w|d) by Foil-gain [-V] print version information and exit lexing options [-s] don't use the stoplist (i.e. don't prune frequent words) [-S] turn off stemming of the tokens [-H] ignore HTML tokens [-g <N>] set N for N-gram lexer (default=1) [-h] skip over email or news header then, for indexing and setting weights -i <class1_dir> <class2_dir> ... [-L] don't lex to get word counts, instead read archived barrel [-f <file>] prints file contents instead of class_dir at query time [-R <N>] remove words with occurrence counts less than N or, for querying -q [<file_containing_query>] [-n <N>] prints the N best-matching documents to the query or, for testing [-t <N>] perform N testing trials [-p <N>] (with -t) Use N% of the documents as test instances [-x <class1_dir> <class2_dir>...] use these files as test instances [-N] in the PrInd method, do not normalize the scores. or, for diagnostics [-I <N>] prints the top N words with highest information gain [-W <classname>] prints the weight-vector for <classname> [-F <classname>] print the unsorted foilgain #'s for <classname> [-P] print score contribution of each word to each class [-B] print barrel word vectors in awk-processable form
./rainbow -M -m naivebayes -i [news data directory]/*where 'news data directory' is the location which you untarred the newsgroup file from (note that the '/*' will pass to Rainbow each subdirectory as a separate class).
./rainbow -t [number of test runs] -p 33 > rainbow.output cat rainbow.output | ./rainbow-stats