File: libbow.texi

package info (click to toggle)
bow 19991122-4
  • links: PTS
  • area: main
  • in suites: woody
  • size: 2,544 kB
  • ctags: 2,987
  • sloc: ansic: 38,660; lisp: 1,072; makefile: 594; perl: 492; yacc: 149; sh: 91
file content (360 lines) | stat: -rw-r--r-- 12,444 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
\input texinfo @c -*-texinfo-*-

@c %**start of header
@setfilename libbow.info
@settitle Programmer's Guide to BOW
@c %**end of header


@ifinfo
@format
START-INFO-DIR-ENTRY
* Libbow::                      Bag-Of-Words Library
END-INFO-DIR-ENTRY
@end format
@end ifinfo

@c set the vars BOWVERSION
@include version.texi

@ifinfo
This file documents the features and implementation of Libbow.

Copyright (C) 1996, 1997, 1998 Free Software Foundation, Inc.

Permission is granted to make and distribute verbatim copies of
this manual provided the copyright notice and this permission notice
are preserved on all copies.

Permission is granted to copy and distribute modified versions of this
manual under the conditions for verbatim copying, provided also that the
section entitled ``GNU Library General Public License'' is included exactly as
in the original, and provided that the entire resulting derived work is
distributed under the terms of a permission notice identical to this one.

Permission is granted to copy and distribute translations of this manual
into another language, under the above conditions for modified versions,
except that the section entitled ``GNU Library General Public License'' and
this permission notice may be included in translations approved by the
Free Software Foundation instead of in the original English.
@end ifinfo

@titlepage
@title Programmer's Guide to BOW
@subtitle A library of C code for statistical text processing.
@sp 3
@c @subtitle last updated October 1996
@subtitle Version @value{BOWVERSION}
@author Andrew Kachites McCallum (mccallum@@cs.cmu.edu)
@page
@vskip 0pt plus 1filll
Copyright @copyright{} 1996, 1997 Andrew Kachites McCallum.

Permission is granted to make and distribute verbatim copies of
this manual provided the copyright notice and this permission notice
are preserved on all copies.

Permission is granted to copy and distribute modified versions of this
manual under the conditions for verbatim copying, provided also that the
section entitled ``GNU Library General Public License'' is included exactly as
in the original, and provided that the entire resulting derived work is
distributed under the terms of a permission notice identical to this one.

Permission is granted to copy and distribute translations of this manual
into another language, under the above conditions for modified versions,
except that the section entitled ``GNU Library General Public License'' may be
included in a translation approved by the author instead of in the original
English.
@end titlepage

@node Top, Overview, (dir), (dir)

@ifinfo
This manual documents how to install and use @samp{libbow}),
(a library of C code for statistical text processing),
version @value{BOWVERSION}.
@end ifinfo

@menu
* Overview::                    
* Traversing Diretories to find Text Files::  
* Getting Words from Text Files::  
* Mapping between Words and Integers::  
* Word Vectors::                
* Vectors of Documents::        
* A Matrix of Document/Word Statistics::  
* Document/Word Models::        
* Vector-per-Class Models::     
* Arrays of Structures::        
* Command-line argument processing with Argp::  
@end menu

@node Overview, Traversing Diretories to find Text Files, Top, Top
@chapter Overview

@include libbow-desc.texi

Pronounciation guide: "libbow" rhymes with "lib-low", not "lib-cow".


Notes from Devika:

How to delimit documents.
How to tag things---how to augment the lexers.
Lead in gently, steps.  Big picture.... more and more interesting things
Variety of examples.
Guide to sea of command-line references.  Structure.
When to consider using which switch.
Sensible defaults.




@node Traversing Diretories to find Text Files, Getting Words from Text Files, Overview, Top
@chapter Traversing Diretories to find Text Files


@node Getting Words from Text Files, Mapping between Words and Integers, Traversing Diretories to find Text Files, Top
@chapter Getting Words from Text Files

Lexer buffers, Lexers

@menu
* The Simple Lexer::            
* The N-Gram Lexer::            
* The Email/News Lexer::        
* The HTML Lexer::              
* Functions Useful for Writing Lexers::  
@end menu

@node The Simple Lexer, The N-Gram Lexer, Getting Words from Text Files, Getting Words from Text Files
@section The Simple Lexer

@node The N-Gram Lexer, The Email/News Lexer, The Simple Lexer, Getting Words from Text Files
@section The N-Gram Lexer

@node The Email/News Lexer, The HTML Lexer, The N-Gram Lexer, Getting Words from Text Files
@section The Email/News Lexer

@node The HTML Lexer, Functions Useful for Writing Lexers, The Email/News Lexer, Getting Words from Text Files
@section The HTML Lexer

@node Functions Useful for Writing Lexers,  , The HTML Lexer, Getting Words from Text Files
@section Functions Useful for Writing Lexers

@deftypefun int bow_stem_porter (char *@var{word})
Apply the Porter stemming algorithm to modify @var{word}.  Return 0 on success.
@end deftypefun

@deftypefun int bow_isalpha (int @var{character})
A function wrapper around POSIX's @code{isalpha} macro.
@end deftypefun

@deftypefun int bow_isgraph (int @var{character})
A function wrapper around POSIX's @code{isgraph} macro.
@end deftypefun

@deftypefun int bow_stoplist_present (const char *@var{word})
Return non-zero if @var{word} is on the stoplist.
@end deftypefun

@deftypefun int bow_stoplist_add_from_file (const char *@var{filename})
Add to the stoplist the white-space delineated words from
@var{filename}.  Return the number of words added.  If the file could
not be opened, return -1.
@end deftypefun



@node Mapping between Words and Integers, Word Vectors, Getting Words from Text Files, Top
@chapter Mapping between Words and Integers

@menu
* Generic Maps between Integers and Strings::  
* The Global Dictionary::       
@end menu

@node Generic Maps between Integers and Strings, The Global Dictionary, Mapping between Words and Integers, Mapping between Words and Integers
@section Generic Maps between Integers and Strings

@deftp {} bow_int4str
@end deftp

@deftypefun {bow_int4str *} bow_int4str_new (int @var{capacity})
Allocate, initialize and return a new int/string mapping structure.  The
parameter @var{capacity} is used as a hint about the number of words to
expect; if you don't know or don't care about a @var{capacity} value,
pass 0, and a default value will be used.
@end deftypefun

@deftypefun {const char *} bow_int2str (bow_int4str *@var{map}, int @var{index})
Given a integer @var{index}, return its corresponding string.
@end deftypefun

@deftypefun int bow_str2int (bow_int4str *@var{map}, const char *@var{string})
Given the char-pointer @var{string}, return its integer
index.  If this is the first time we're seeing @var{string}, add it to
the mapping, assign it a new index, and return the new index.
@end deftypefun

@deftypefun int bow_str2int_no_add (bow_int4str *@var{map}, const char *@var{string})
Given the char-pointer @var{string}, return its integer index.  If
@var{string} is not yet in the mapping, return -1.
@end deftypefun

@deftypefun void bow_int4str_write (bow_int4str *@var{map}, FILE *@var{fp})
Write the int-str mapping to file-pointer @var{fp}.
@end deftypefun

@deftypefun {bow_int4str *} bow_int4str_new_from_fp (FILE *@var{fp})
Return a new int-str mapping, created by reading file-pointer @var{fp}.
@end deftypefun

@deftypefun {bow_int4str *} bow_int4str_new_from_file (const char *@var{filename})
Return a new int-string mapping, created by reading @var{filename}.
@end deftypefun

@deftypefun void bow_int4str_free (bow_int4str *@var{map})
Free the memory held by the int-string mapping @var{map}.
@end deftypefun




@node The Global Dictionary,  , Generic Maps between Integers and Strings, Mapping between Words and Integers
@section The Global Dictionary

@deftypefun {const char *} bow_int2word (int @var{wi})
Given a "word index" @var{wi}, return its word, according to the global
word-int mapping.
@end deftypefun

@deftypefun int bow_word2int (const char *@var{word});
Given a @var{word}, return its ``word index,'' according to the global
word-int mapping; if it's not yet in the mapping, add it.
@end deftypefun

@deftypefun int bow_word2int_add_occurrence (const char *@var{word})
Like @code{bow_word2int()}, except it also increments the occurrence
count associated with @var{word}.
@end deftypefun

@deftypevar int bow_word2int_do_not_add
If this is non-zero, then @code{bow_word2int()} will return -1 when
asked for the index of a word that is not already in the mapping.
Essentially, setting this to non-zero makes @code{bow_word2int()} and
@code{bow_word2int_add_occurrence()} behave like
@code{bow_str2int_no_add()}.
@end deftypevar

@deftypefun int bow_words_add_occurrences_from_text_dir (const char *@var{dirname}, const char *@var{exception_name})
Add to the word occurrence counts by recursively decending directory 
@var{dirname} and lexing all the text files; skip any files matching
@var{exception_name}.
@end deftypefun

@deftypefun int bow_words_occurrences_for_wi (int @var{wi});
Return the number of times @code{bow_word2int_add_occurrence()} was
called with the word whose index is @var{wi}.
@end deftypefun

@deftypefun void bow_words_set_map (bow_int4str *@var{map}, int @var{free_old_map})
Replace the current word/int mapping with @var{map}.
@end deftypefun

@deftypefun void bow_words_remove_occurrences_less_than (int @var{occur});
Modify the int/word mapping by removing all words that occurred less
than @var{occur} number of times.  WARNING: This totally changes the
word/int mapping; any @code{wv}'s, @code{wi2dvf}'s or @code{barrel}'s
you build with the old mapping will have bogus word indices afterward.
@end deftypefun

@deftypefun int bow_num_words ()
Return the total number of unique words in the int/word map.
@end deftypefun

@deftypefun void bow_words_write (FILE *@var{fp})
Save the int/word map to file-pointer @var{FP}.
@end deftypefun

@deftypefun void bow_words_write_to_file (const char *@var{filename})
Same as above, but with a filename instead of a @code{FILE*}.
@end deftypefun

@deftypefun void bow_words_read_from_fp (FILE *@var{fp})
Read the int/word map from file-pointer @var{fp}.
@end deftypefun

@deftypefun void bow_words_read_from_file (const char *@var{filename})
Same as above, but with a filename instead of a @code{FILE*}.
@end deftypefun

@deftypefun void bow_words_reread_from_file (const char *@var{filename}, int @var{force_update})
Same as above, but don't bother rereading unless @var{filename} is different
from the last one, or @var{force_update} is non-zero.
@end deftypefun



@node Word Vectors, Vectors of Documents, Mapping between Words and Integers, Top
@chapter Word Vectors

@menu
* Creating a Word Vector from a Text File::  
* Writing and Reading Word Vectors as Data Files::  
@end menu

@node Creating a Word Vector from a Text File, Writing and Reading Word Vectors as Data Files, Word Vectors, Word Vectors
@section Creating a Word Vector from a Text File

@node Writing and Reading Word Vectors as Data Files,  , Creating a Word Vector from a Text File, Word Vectors
@section Writing and Reading Word Vectors as Data Files


@node Vectors of Documents, A Matrix of Document/Word Statistics, Word Vectors, Top
@chapter Vectors of Documents

@deftp Type bow_dv
@end deftp


@node A Matrix of Document/Word Statistics, Document/Word Models, Vectors of Documents, Top
@chapter A Matrix of Document/Word Statistics

@deftp Type bow_dvf
@end deftp

@deftp Type bow_wi2dvf
@end deftp


@node Document/Word Models, Vector-per-Class Models, A Matrix of Document/Word Statistics, Top
@chapter Document/Word Models

@deftp Type bow_barrel
@end deftp


@node Vector-per-Class Models, Arrays of Structures, Document/Word Models, Top
@chapter Vector-per-Class Models


@node Arrays of Structures, Command-line argument processing with Argp, Vector-per-Class Models, Top
@chapter Arrays of Structures

@menu
* Arrays indexed by integers::  
* Arrays indexed by strings::   
@end menu

@node Arrays indexed by integers, Arrays indexed by strings, Arrays of Structures, Arrays of Structures
@section Arrays indexed by integers

@node Arrays indexed by strings,  , Arrays indexed by integers, Arrays of Structures
@section Arrays indexed by strings

@node Command-line argument processing with Argp,  , Arrays of Structures, Top
@chapter Command-line argument processing with Argp

@contents
@bye