File: rainbow-sample.html

package info (click to toggle)
bow 20020213-8
  • links: PTS
  • area: main
  • in suites: sarge
  • size: 2,596 kB
  • ctags: 2,871
  • sloc: ansic: 36,321; lisp: 1,072; cpp: 969; makefile: 569; perl: 495; sh: 101
file content (844 lines) | stat: -rw-r--r-- 31,329 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
<HTML>
<BODY>
<TITLE>Rainbow</TITLE>

<h1>Rainbow</h1>

<i>Rainbow</i> is a program that performs statistical
text classification.  It is based on the <i>Bow</i> library.  For more
information about obtaining the source and citing its use, see the <a
href="http://www.cs.cmu.edu/~mccallum/bow">Bow home page</a>.

<p>This documentation is intended as a brief tutorial for using
rainbow, version 0.9 or later.  It is not complete documentation.  It
is not a tutorial on the source code.

<p>The examples on this page assume that you have compiled libbow and
rainbow, and that rainbow is in your path.  Several of the examples
also assume that you have downloaded the <a
href="http://www.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes/20_newsgroups.tar.gz">20_newsgroups</a>
data set, unpacked it in your home directory, and therefore that its
files are available in the directory <tt>~/20_newsgroups</tt>.

<h3>1. Introduction</h3>

The general pattern of rainbow usage is in two steps (1) have rainbow
read your documents and write to disk a "model" containing their
statistics, (2) using the model, rainbow performs classification or
diagnostics.

<p>You can obtain on-line documentation of each rainbow command-line
option by typing <pre> rainbow --help | more </pre> This
<tt>--help</tt> option is useful checking the latest details of
particular options, but does not provide a tutorial or an overview of
rainbow's use.

<p>Command-line options in rainbow and all the <i>Bow</i> library
frontends are handled by the <tt>libargp</tt> library from the FSF.
Many command-line options have both long and short forms.  For
example, to set the verbosity level to 4 (to make rainbow give more
runtime diagnostic messages than usual), you can type
"<tt>--verbosity=4</tt>", or "<tt>--verbosity 4</tt>", or "<tt>-v
4</tt>".  For more detail about the verbosity option, see section 5.1.


<h3>2. Reading the documents, building a model</h3>

<p>Before performing classification or diagnostics with
rainbow, you must first have rainbow index your data--that is,
read your documents and archive a "model" containing their statistics.
The text indexed for the model must contain all the training data.
The testing data may also be read as part of the model, or it can be
left out and read later.

<p>The model is placed in the file system location indicated by the
<tt>-d</tt> option.  If no <tt>-d</tt> option is given, the name
<tt>~/.rainbow</tt> is used by default.  (The model name is actually a
file system directory containing separate files for different aspects
of the model.  If the model directory location does not exist when
rainbow is invoked, rainbow will create it automatically.)

<p>In the most basic setting, the text data should be in plain text
files, one file per document.  No special tags are needed at the
beginning or end of documents.  Thus, for example, you should be able
to index a directory of UseNet articles or MH mailboxes without any
preprocessing. 

The files should be organized in directories, such that all documents
with the same class label are contained within a directory.  (Rainbow
does not directly support classification tasks in which individual
documents have multiple class labels.  I recommend handling this as a
series of binary classification tasks.)

<p>To build a model, call rainbow with the <tt>--index</tt> (or
<tt>-i</tt>) option, followed by one directory name for each class.
For example, to build a model that distinguishes among the three
<tt>talk.politics</tt> classes of <i>20_newsgroups</i>, (and store
that model in the directory <tt>~/model</tt>), invoke rainbow like
this:

<pre>
   rainbow -d ~/model --index ~/20_newsgroups/talk.politics.*
</pre>

where <tt>~/20_newsgroups/talk.politics.*</tt> would be expanded by
the shell like this:

<pre>
   ~/20_newsgroups/talk.politics.guns ~/20_newsgroups/talk.politics.mideast ~/20_newsgroups/talk.politics.misc
</pre>

<p>To build a model containing all 20 newsgroups, type:

<pre>
   rainbow -d ~/model --index ~/20_newsgroups/*
</pre>


<h4>2.1. Tokenizing Options</h4>

<p>When indexing a file, rainbow turns the file's stream of characters
into tokens by a process called tokenization or "lexing".

<p>By default, rainbow tokenizes all alphabetic sequences of
characters (that is characters in A-Z and a-z), changing each sequence
to lowercase and tossing out any token which is on the "stoplist", a
list of common words such as "the", "of", "is", etc.

<!-- rainbow's tokenizer operates as follows: skip all
non-alphabetic characters (anything not A-Z or a-z), read characters
into a buffer until a non-alphabetic characters is reached, turn all
uppercase letters into lowercase, skip the token in the buffer if it
is in the "stoplist".  Otherwise, include this token among the
statistics, and read the next token. -->

<p>Rainbow supports several options for tokenizing text.  For example
the <tt>--skip-headers</tt> (or <tt>-h</tt>) option causes rainbow to
skip newsgroup or email headers before beginning tokenization.  (Which
should be used for the <i>20_newsgroups</i> dataset, since the headers
include the name of the correct newsgroup!)  It does this by scanning
forward until it finds two newlines in a row.

<pre>
   rainbow -d ~/model -h --index ~/20_newsgroups/talk.politics/*
</pre>

<p>Some other examples of handy tokenizing options are:

<p><table border=1>

<tr><td> <tt>--use-stemming</tt></td> </td>
<td> Pass all words through the Porter
stemmer before counting them.  (The default is not to stem.)
</td></tr>

<tr><td> <tt>--no-stoplist</tt></td>
<td> Include words in the stoplist among the
statistics.  The default is to skip them.  The stoplist is the SMART
system's list of 524 common words, like "the" and "of".)
</td></tr>

<tr><td> <tt>--istext-avoid-uuencode</tt> </td>
<td>Attempt to detect when a file mostly consists of a uuencoded block,
and if so, skip it.  This option is useful for tokenizing UseNet
articles, because word statistics can be thrown off by repetitive
tokens found in uuencoded images.
</td></tr>

<tr><td> <tt>--skip-html</tt> </td>
<td>Skip all characters between "<" and ">".  Useful for lexing HTML
files. 
</td></tr>

<tr><td> <tt>--lex-pipe-command SHELLCMD</tt>  </td>
<td> Rather than tokenizing the
file directly, pass the file as standard input into this shell
command, and tokenize the standard output of the shell command.  For
example, to index only the first 20 lines of each file, use:<br>
<tt>rainbow --lex-pipe-command "head -n 20" -d ~/model --index
~/20_newsgroups/talk.politics/* </tt>
</td></tr>

<tr><td> <tt>--lex-white</tt>  </td>
<td> Rather than tokenizing the file with the default rules (skipping
non-alphabetics, downcasing, etc), instead simply grab space-delimited
strings, and make no further changes.  This option is useful if you
want to take complete control of tokenization with your own script, as
specified by <tt>--lex-pipe-command</tt>, and don't want rainbow to
make any further changes.
</td></tr>

</table>

<p>For a complete list of rainbow tokenizing options, see the "Lexing
options" section in the output of <tt>rainbow --help</tt>.


<h3>3. Classifying Documents</h3>

<p>Once indexing is performed and a model has been archived to disk,
rainbow can perform document classification.  Statistics from a
set of <i>training</i> documents will determine the parameters of the
classifier; classification of a set of <i>testing</i> documents will
be output.

<p>The <tt>--test</tt> (or <tt>-t</tt>) option performs a specified
number of trials and prints the classifications of the documents in
each trial's test-set to standard output.  For example,

<pre>
   rainbow -d ~/model --test-set=0.4 --test=3
</pre>

will output the results of three trials, each with a randomized
test-train split in which 60 percent of the documents are used for
training, and 40 percent for testing.  Details of the
<tt>--test-set</tt> option are described in section 3.1.

<p>Classification results are printed as a series of text lines that look
something like this:
<pre>
   /home/mccallum/20_newsgroups/talk.politics.misc/178939 talk.politics.misc talk.politics.misc:0.98 talk.politics.mideast:0.015 talk.politics.guns:0.005
</pre>

<p>That is, one test file per line, consisting of the following fields:
<pre>
   directory/filename TrueClass TopPredictedClass:score1 2ndPredictedClass:score2 ...
</pre>

<p>The Perl script <tt>rainbow-stats</tt>, which is provided in the
Bow source distribution, reads lines like this and outputs average
accuracy, standard error, and a confusion matrix.

<p>For example, the command

<pre>
   rainbow -d ~/model --test-set=0.4 --test=2 | rainbow-stats
</pre>

will, for a model build from the three <tt>talk.politics</tt> classes,
print something like the following:

<p>
<dd><table border=1>
<tr><td>
<pre>
Trial 0

Correct: 1079 out of 1201 (89.84 percent accuracy)

 - Confusion details, row is actual, column is predicted
               classname   0   1   2  :total
 0    talk.politics.guns 372   2  27  :401  92.77%
 1 talk.politics.mideast   6 371  23  :400  92.75%
 2    talk.politics.misc  44  20 336  :400  84.00%

Trial 1

Correct: 1086 out of 1201 (90.42 percent accuracy)

 - Confusion details, row is actual, column is predicted
               classname   0   1   2  :total
 0    talk.politics.guns 377   2  22  :401  94.01%
 1 talk.politics.mideast   6 371  23  :400  92.75%
 2    talk.politics.misc  40  22 338  :400  84.50%

Percent_Accuracy  average 90.13 stderr 0.21
</pre>
</tr></td>
</table></dd>

<p>(To give you some idea of the speed of rainbow: On a 200 MHz
Pentium, the above rainbow command finishes in 14 seconds.  The
command reads the model from disk, and performs two trials--each
building a model from about 1800 documents and testing on about 1200.
The rainbow-stats command finishes in 2 seconds.)

<p>The Perl script <tt>rainbow-be</tt>, also provided in the Bow
source distribution, reads lines like this and outputs
precision-recall breakeven points.

<p>You can vary the precision with which classification scores are
printed using the <tt>--score-precision=NUM</tt> option, where
<tt>NUM</tt> is the number of digits to print after the decimal point.
Note, however, that several internal variables are of type
<i>float</i>, (which has only about 7 digits of resolution) and the
classification scores are calculated as <i>double</i>'s, (which has
only about 17 digits of resolution), so precision is inherently
limited.  The default printed score precision is 10.
This option works only with the naive Bayes classifier.

<h4>3.1. Specifying the Training and Testing Sets</h4>

In cases in which the test documents have been tokenized as part of
the model, the test set is specified with the <tt>--test-set</tt>
option.  For example, 

<pre>
   rainbow -d ~/model --test-set=0.5 --test=1
</pre>

will use a pseudo-random number generator to select one-half of the
documents in the model and place them into the test set, then place
the remaining documents in the training set.

<p>When the argument to <tt>--test-set</tt> contains no decimal point,
the number is interpreted as an exact number of documents.  For
example, 

<pre>
   rainbow -d ~/model --test-set=30 --test=1
</pre>

will place 30 documents in the test set, attempting to select a number
of documents from each class such that the class proportions in the
test set roughly matches that in the entire model.

<p>If the number argument is followed by "<tt>pc</tt>", then the
arguments indicates a number of documents <i>per class</i>.  Thus

<pre>
   rainbow -d ~/model --test-set=200pc --test=1
</pre>

will place into the test set 200 randomly-selected documents from each
of the classes in the model, for a total of 600 test documents, if the
model was build using three classes.

<p>You can also specify exactly which files should be in the test set,
listing them by name.  If the argument to <tt>--test-set</tt> contains
non-numeric characters, it is interpreted as a filename, which in turn
should contain a list of white-space-separated filenames of documents
indexed in the model.  For example,

<pre>
   rainbow -d ~/model --test-set=~/filelist1 --test=1
</pre>

will open the file <tt>~/filelist1</tt> and take from there the list
of names of files to be place in the test set.  Note that the class
labels of these documents are already known from when the
<tt>model</tt> file was built.

<p>The list of filenames should be named as they where then the model
was built.  A list of all the filenames of documents contained in a
rainbow model can be obtained with the following command:

<pre> 
   rainbow -d ~/model --print-doc-names
</pre> 

<p>See section 4.3 for more details on the <tt>--print-doc-names</tt>
option. 

<p>The default value for <tt>--test-set</tt> is 0, indicating the no
documents are placed in the test set.  Thus, when using the
<tt>--test</tt> option, you must use the <tt>--test-set</tt> option in
order to give rainbow some documents to classify.


<h5>3.1.1. Training Set</h5>

<p>The training set can be specified using the <tt>--train-set</tt>
option with the same types of arguments described above.  For example,

<pre>
   rainbow -d ~/model --test-set=~/filelist1 --train-set=~/filelist2 --test=1
</pre>

will take all test documents from the list in <tt>~/filelist1</tt>,
all training documents from <tt>~/filelist2</tt>, and ignore all
documents that don't appear in either list.  It is an error for a
document to be listed in both the test set and the train set.

<p>The default value for the <tt>--train-set</tt> is the keyword
<tt>remaining</tt>, which specifies that all documents not placed in
the test set should be placed in the training set.

<p>The keyword <tt>remaining</tt> can also be used for the test set.
For example,

<pre>
   rainbow -d ~/model --train-set=1pc --test-set=remaining --test=1
</pre>

will put one document from each class into the training set, and put
all the rest of the documents in the testing set.

<h5>3.1.2. Classifying Files not in the Model</h5>

<p>You can classify files that were not indexed into the model by
replacing the <tt>--test</tt> option with the <tt>--test-files</tt>
option.  For example,

<pre>
   rainbow -d ~/model --test-files ~/more-talk.politics/*
</pre>

will use all the files in the model as the training set, and output
classifications for all files contained in the subdirectories of
<tt>~/more-talk.politics/</tt>.  Note that the number and basenames of
the directories listed must match those given to <tt>--index</tt> when
the model was built.

<p>You can classify a single file (read from standard input or from a
specified filename) using the <tt>--query</tt> option.

<h4>3.2. Rainbow Classification as a Server</h4>

<p>Rainbow can also efficiently classify individual documents not in
the model by running as a server.  In this mode, rainbow starts, reads
the model from disk, then waits for query documents by listening on a
network socket.

<p>To do this, run rainbow with the command line option
<tt>--query-server=PORT</tt> (where <tt>PORT</tt> is some port number
larger than 1000).  For example

<pre>
   rainbow -d ~/model --query-server=1821
</pre>

<p>In order to test the server, telnet to whatever port you specified
(e.g. "<tt>telnet localhost 1821</tt>"), type in a document you want
to classify, then type '<tt>.</tt>' alone on a line, followed by
Return.  Rainbow will then print back to the socket (and thus to your
screen) a list of classes and their scores.  If you write your own
program to connect to a rainbow server (to replace <tt>telnet</tt> in
this example), make sure to use the sequence "<tt>\r\n</tt>" to send a
newline.  Thus, to indicate the end of a query document, you should
send the sequence "<tt>\r\n.\r\n</tt>".

<h4>3.2. Feature Selection</h4>

<p>Feature set or "vocabulary" size may be reduced by by occurrence
counts or by average mutual information with the class variable
(<i>[Cover & Thomas, "Elements of Information Theory" Wiley & Sons,
1991]</i>, (which we also call "information gain").

<p><table border=1>

<tr><td> <tt>--prune-vocab-by-infogain=N</tt><br>
or <tt>-T</tt></td> </td>
<td> Remove all but the top <tt>N</tt> words by selecting words with highest
     average mutual information with the class variable.  Default is
     <tt>N</tt>=0, which is a special case that removes no words.
</td></tr>

<tr><td> <tt>--prune-vocab-by-doc-count=N</tt><br>
or <tt>-D</tt></td> </td>
<td> Remove words that occur in <tt>N</tt> or fewer documents.
</td></tr>

<tr><td> <tt>--prune-vocab-by-occur-count=N</tt><br>
or <tt>-O</tt></td> </td>
<td> Remove words that occur less than <tt>N</tt> times.
</td></tr>

</table>

<p>For example, to classify using only the 50 words that have the
highest mutual information with the class variable, type:

<pre>
   rainbow -d ~/model --prune-vocab-by-infogain=50 --test=1
</pre>

<p>If you want to see what these 50 words are, type:

<pre>
   rainbow -d ~/model -I 50
</pre>

There is more information about <tt>-I</tt> and other
diagnostic-printing command-line options options in section 4.

<h4>3.3. Selecting the Classification Method</h4>

Rainbow supports several different classification methods, (and the
code makes it easy to add more).  The default is Naive Bayes, but
k-nearest neighbor, TFIDF, and probabilistic indexing are all
available.  These are specified with the <tt>--method</tt> (or
<tt>-m</tt>) option, followed by one of the following keywords:
<tt>naivebayes, knn, tfidf, prind</tt>.  For example,

<pre>
   rainbow -d ~/model --method=tfidf --test=1
</pre>

will use TFIDF/Rocchio for classification.


<h4>3.4. Naive Bayes Options</h4>

The following options change parameters of Naive Bayes.

<p><table border=1>

<tr><td> <tt>--smoothing-method=METHOD</tt></td> </td>
<td> Set the method for smoothing word probabilities to avoid zeros;
 <tt>METHOD</tt> may be one of: <tt>goodturing, laplace, mestimate,
 wittenbell</tt>.  The default is <tt>laplace</tt>, which is a uniform
 Dirichlet prior with alpha=2.
</td></tr>

<tr><td> <tt>--event-model=EVENTNAME</tt></td> </td>
<td> Set what objects will be considered the `events' of the
  probabilistic model.  <tt>EVENTNAME</tt> can be one of:
  <tt>word</tt> (i.e. multinomial, unigram), <tt>document</tt>
  (i.e. multi-variate Bernoulli, bit vector), or
  <tt>document-then-word</tt> (i.e. document-length-normalized
  multinomial).  For more details on these methods, see <i><a
  href="http://www.cs.cmu.edu/~mccallum">A Comparison of Event Models
  for Naive Bayes Text Classification</a></i>.  The default is
  <tt>word</tt>. 
</td></tr>

<tr><td> <tt>--uniform-class-priors</tt></td> </td>
<td> When classifying and calculating mutual information, use equal
 prior probabilities on classes, instead of using the distribution
 determined from the training data.
</td></tr>

</table>


<h3>4. Diagnostics</h3>

<p>In addition to using a model for document classification, you can
also print various information about the model.  

<h4>4.1. Words by Mutual Information with the Class</h4>

<p>To see a list of the words that have highest average
mutual information with the class variable (sorted by mutual
information), use the <tt>--print-word-infogain</tt> (or <tt>-I</tt>)
option.  For example

<pre>
   rainbow -d ~/model -I 10
</pre>

<p>When invoked on a model containing all 20 classes of the
<i>20_newsgroups</i> dataset, the following is printed to standard
out:

<pre>
  0.09381 windows
  0.09003 god
  0.07900 dod
  0.07700 government
  0.06609 team
  0.06570 game
  0.06448 people
  0.06323 car
  0.06171 bike
  0.05609 hockey
</pre>

The above is calculated using all the training data.  To restrict the
calculation to a subset of the data, use any of the methods for
defining the training set described in section 3.1.  For example, to
calculate mutual information based just on the the documents listed in
<tt>~/docs1</tt>, type:

<pre>
   rainbow -d ~/model --train-set=~/docs1 -I 10
</pre>


<h4>4.2. Words by Probability</h4>

To print the probability of all the words use the
<tt>--print-word-probabilities</tt> option.  For example, the
following command will print the word probabilities in the
<tt>talk.politics.mideast</tt> class, after pruning the vocabulary to
the ten words that have highest mutual information with the class.

<pre>
   rainbow -d ~/model -T 10 --print-word-probabilities=talk.politics.mideast
</pre>

<p>Here is the output of this command.  Notice that the word
probabilities correctly sum to one. 

<pre>
   god                             0.05026782
   people                          0.64977338
   government                      0.24062629
   car                             0.03502266
   game                            0.00412031
   team                            0.01030078
   bike                            0.00041203
   dod                             0.00041203
   hockey                          0.00123609
   windows                         0.00782859
</pre>


<h4>4.3. Word Counts and Probabilities</h4>

<p>To print the number of times a word occurs in each class (as well as
the total number of words in the class, and the word's probability in
each class), use the <tt>--print-word-counts</tt> option.  For
example, the following command prints diagnostics about the word
<i>team</i>.

<pre>
   rainbow -d ~/model --print-word-counts=team
</pre>

<p>Here is the output on the above command, on a model built from
<i>20_newsgroups</i>.  Note that the word probabilities (in
parenthesis) may not simply be equal to the ratio of the two previous
counts because of smoothing.

<pre>
        2 /    125039  (  0.00002) alt.atheism
        6 /    119511  (  0.00005) comp.graphics
        5 /     91147  (  0.00005) comp.os.ms-windows.misc
        1 /     71002  (  0.00001) comp.sys.mac.hardware
       12 /    131120  (  0.00009) comp.windows.x
       15 /     62130  (  0.00024) misc.forsale
        2 /     83942  (  0.00002) rec.autos
       10 /     78685  (  0.00013) rec.motorcycles
      543 /     88623  (  0.00613) rec.sport.baseball
      970 /    115109  (  0.00843) rec.sport.hockey
        9 /    136655  (  0.00007) sci.crypt
        1 /     81206  (  0.00001) sci.electronics
        8 /    125235  (  0.00006) sci.med
       71 /    128754  (  0.00055) sci.space
        2 /    141389  (  0.00001) soc.religion.christian
       13 /    135054  (  0.00010) talk.politics.guns
       24 /    208367  (  0.00012) talk.politics.mideast
       14 /    164266  (  0.00009) talk.politics.misc
        9 /    130013  (  0.00007) talk.religion.misc
</pre>

<p>(Note: the probability of the word <i>team</i> is not equal to the
probability of team from the <tt>--print-word-probabilities</tt>
command above, because we did not reduce vocabulary size to 10 in this
example. 

<h4>4.4. Document Names</h4>

<p>To print a list of the filenames of all documents, use the
<tt>--print-doc-names</tt> option.  Document filenames are printed in
the order in which they were indexed.  Thus all documents of the same
class appear contiguously.

<p>This command is often useful for generating lists of document names
to be used with the <tt>--test-set</tt> and <tt>--train-set</tt>
options.

<p>For example, the following command prints 10 randomly selected
documents that were indexed.  In order to obtain a random
selection, <tt>gawk</tt>, the GNU version of <tt>awk</tt>, is used
to generate random numbers, and <tt>sort</tt> is used to permute the
list.  The command <tt>head</tt> is then used to select the first 10
from the permuted list.

<pre>
   rainbow -d ~/model --print-doc-names \
   | gawk '{print rand(), $1}' | sort -n | gawk '{print $2}' | head -n 10
</pre>

<p>Example output of this command on the <i>20_newsgroups</i> data set
is:

<pre>
   ~/20_newsgroups/rec.motorcycles/104735
   ~/20_newsgroups/comp.windows.x/67345
   ~/20_newsgroups/sci.med/59555
   ~/20_newsgroups/talk.politics.misc/178418
   ~/20_newsgroups/misc.forsale/76867
   ~/20_newsgroups/rec.sport.hockey/52601
   ~/20_newsgroups/talk.politics.mideast/77394
   ~/20_newsgroups/comp.os.ms-windows.misc/9661
   ~/20_newsgroups/talk.politics.mideast/75947
   ~/20_newsgroups/talk.politics.misc/179105
</pre>

<p>You can also print the names of just those documents that fall into
one of the sets of the test/train split.  For example

<pre>
   rainbow -d ~/model --train-set=3pc --print-doc-names=train
</pre>

will select three documents from each class to be in the training set,
and print just those documents.  The output of this command might be:

<pre>
   ~/20_newsgroups/talk.politics.guns/53329
   ~/20_newsgroups/talk.politics.guns/54704
   ~/20_newsgroups/talk.politics.guns/54656
   ~/20_newsgroups/talk.politics.mideast/76420
   ~/20_newsgroups/talk.politics.mideast/76523
   ~/20_newsgroups/talk.politics.mideast/77392
   ~/20_newsgroups/talk.politics.misc/179005
   ~/20_newsgroups/talk.politics.misc/176939
   ~/20_newsgroups/talk.politics.misc/179083
</pre>

<h4>4.5. Printing Entire Word/Document Matrix</h4>

<p>You can print the entire word/document matrix to standard output in
using the <tt>--print-matrix</tt> option.  Documents are printed one
to a line.  The first (white-space separated) field is the document
name; this is followed by entries for the words.

<p>There are several different alternatives for the format in which
the words are printed, and all of them are amenable to processing by
<tt>perl</tt> or <tt>awk</tt>, and somewhat human-readable.  The
alternatives are specified by an optional "formatting" argument to the
<tt>--print-matrix</tt> option.

<p>The format is specified as a string of three characters, consisting
of selections from the following three groups

<p><table border=1>

<tr><td colspan=2>
Print entries for all words in the vocabulary, or just print the words
that actually occur in the document.</td></tr>

<tr><td width=15% align=center><tt>a</tt></td><td>all</td></tr>
<tr><td width=15% align=center><tt>s</tt></td><td>sparse, (default)</td></tr>

<tr><td colspan=2>
Print word counts as integers or as binary presence/absence indicators.
</td></tr>

<tr><td width=15% align=center><tt>b</tt></td><td>binary</td></tr>
<tr><td width=15% align=center><tt>i</tt></td><td>integer, (default)</td></tr>

<tr><td colspan=2>
How to indicate the word itself.
</td></tr>

<tr><td width=15% align=center><tt>n</tt></td><td>integer word index</td></tr>
<tr><td width=15% align=center><tt>w</tt></td><td>word string</td></tr>
<tr><td width=15% align=center><tt>c</tt></td><td>combination of
   integer word index and word string, (default)</td></tr> 
<tr><td width=15% align=center><tt>e</tt></td><td>empty, don't print
   anything to indicate the identity of the word</td></tr>

</table>

<p>For example, to print a sparse matrix, in which the word string and
the word counts for each document are listed, use the format string
``<tt>siw</tt>''.  The command

<pre>
   rainbow -d ~/model -T 100 --print-matrix=siw | head -n 10
</pre>

<p>reduces the vocabulary to only 100 words, then prints 

<pre>
   ~/20_newsgroups/alt.atheism/53366 alt.atheism  god 2  jesus 1  nasa 2  people 2  
   ~/20_newsgroups/alt.atheism/53367 alt.atheism  jesus 2  jewish 1  christian 1  
   ~/20_newsgroups/alt.atheism/51247 alt.atheism  god 4  evidence 2  
   ~/20_newsgroups/alt.atheism/51248 alt.atheism  
   ~/20_newsgroups/alt.atheism/51249 alt.atheism  nasa 1  country 2  files 1  law 3  system 1  government 1  
   ~/20_newsgroups/alt.atheism/51250 alt.atheism  god 3  people 2  evidence 1  law 1  system 1  public 5  rights 1  fact 1  religious 1  
   ~/20_newsgroups/alt.atheism/51251 alt.atheism  
   ~/20_newsgroups/alt.atheism/51252 alt.atheism  people 4  evidence 2  system 2  religion 1  
   ~/20_newsgroups/alt.atheism/51253 alt.atheism  god 19  christian 1  evidence 1  faith 5  car 2  space 1  game 1  
   ~/20_newsgroups/alt.atheism/51254 alt.atheism  people 1  jewish 3  game 1  bible 7  
</pre>

<p>To print a non-sparse matrix, indicating the binary
presence/absence of all words in the vocabulary for each document, use
the format string 
``<tt>abe</tt>''.  The command

<pre>
   rainbow -d ~/model -T 10 --print-matrix=abe | head -n 10
</pre>

<p>reduces the vocabulary to only 10 words, then prints 

<pre>
   ~/20_newsgroups/alt.atheism/53366 alt.atheism  1  1  0  0  0  0  0  0  0  0  
   ~/20_newsgroups/alt.atheism/53367 alt.atheism  0  0  0  0  0  0  0  0  0  0  
   ~/20_newsgroups/alt.atheism/51247 alt.atheism  1  0  0  0  0  0  0  0  0  0  
   ~/20_newsgroups/alt.atheism/51248 alt.atheism  0  0  0  0  0  0  0  0  0  0  
   ~/20_newsgroups/alt.atheism/51249 alt.atheism  0  0  1  0  0  0  0  0  0  0  
   ~/20_newsgroups/alt.atheism/51250 alt.atheism  1  1  0  0  0  0  0  0  0  0  
   ~/20_newsgroups/alt.atheism/51251 alt.atheism  0  0  0  0  0  0  0  0  0  0  
   ~/20_newsgroups/alt.atheism/51252 alt.atheism  0  1  0  0  0  0  0  0  0  0  
   ~/20_newsgroups/alt.atheism/51253 alt.atheism  1  0  0  1  1  0  0  0  0  0  
   ~/20_newsgroups/alt.atheism/51254 alt.atheism  0  1  0  0  1  0  0  0  0  0  
</pre>

<p>

<p>For a summary of all the diagnostic options, see the "Diagnostics"
section of the <tt>rainbow --help</tt> output.


<h3>5. General options</h3>

<h4>5.1. Verbosity of Progress Messages</h4>

<p>Rainbow prints messages about its progress to standard error as it
runs.  You can change the verbosity of these progress messages with
the <tt>--verbosity=LEVEL</tt> (or <tt>-v</tt> option.  The argument
<tt>LEVEL</tt> should be an integer from 0 to 5, 0 being silent (no
progress messages printed to standard error), and 5 being most
verbose.  The default is 2.

<p>For example, the following command will print no progress messages.

<pre>
   rainbow -v 0 -d ~/model -I 10
</pre>

<p>Some of the progress messages print backspace characters in order
to show running counters.  When running rainbow with GDB inside an
Emacs buffer, however, the backspace character is printed as a
character escape sequence and fills the buffer.  You can avoid
printing progress messages that contain backspace characters by using
the <tt>--no-backspaces</tt> (or <tt>-b</tt>) option.


<h4>5.1. Initializing of the Pseudo-Random Seed</h4>

<p>Rainbow may use a pseudo-random number generator for several tasks,
including the randomized test-train splits described in section 3.1.
You can specify the seed for this random number generator using the
<tt>--random-seed</tt> option.  For example

<pre>
   rainbow -d ~/model -t 1 --test-set=0.3 --random-seed=2
</pre>

<p>You can verify that use of the same random seed results in
identical test/train splits by using the <tt>--print-doc-names</tt>
option.  For example

<pre>
   rainbow -d ~/model --random-seed=1 --train-set=4pc --print-doc-names=train
</pre>

will perform the specified test/train split, then print only the
training documents.  The above command will produce the same output
each time it is called.  However, the above command with the
<tt>--random-seed=1</tt> option removed will print different document
names each time.

<p>If this option is not given, then the seed is set using the
computer's real-time clock.

<p>
<p>


<hr>
Last updated: 30 September 1998,
<i><a href="mailto:mccallum@cs.cmu.edu">mccallum@cs.cmu.edu</a></i>

</BODY>
</HTML>