1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359
|
\input texinfo @c -*-texinfo-*-
@c %**start of header
@setfilename libbow.info
@settitle Programmer's Guide to BOW
@c %**end of header
@ifinfo
@format
START-INFO-DIR-ENTRY
* Libbow:: Bag-Of-Words Library
END-INFO-DIR-ENTRY
@end format
@end ifinfo
@c set the vars BOWVERSION
@include version.texi
@ifinfo
This file documents the features and implementation of Libbow.
Copyright (C) 1996, 1997, 1998 Free Software Foundation, Inc.
Permission is granted to make and distribute verbatim copies of
this manual provided the copyright notice and this permission notice
are preserved on all copies.
Permission is granted to copy and distribute modified versions of this
manual under the conditions for verbatim copying, provided also that the
section entitled ``GNU Library General Public License'' is included exactly as
in the original, and provided that the entire resulting derived work is
distributed under the terms of a permission notice identical to this one.
Permission is granted to copy and distribute translations of this manual
into another language, under the above conditions for modified versions,
except that the section entitled ``GNU Library General Public License'' and
this permission notice may be included in translations approved by the
Free Software Foundation instead of in the original English.
@end ifinfo
@titlepage
@title Programmer's Guide to BOW
@subtitle A library of C code for statistical text processing.
@sp 3
@c @subtitle last updated October 1996
@subtitle Version @value{BOWVERSION}
@author Andrew Kachites McCallum (mccallum@@cs.cmu.edu)
@page
@vskip 0pt plus 1filll
Copyright @copyright{} 1996, 1997 Andrew Kachites McCallum.
Permission is granted to make and distribute verbatim copies of
this manual provided the copyright notice and this permission notice
are preserved on all copies.
Permission is granted to copy and distribute modified versions of this
manual under the conditions for verbatim copying, provided also that the
section entitled ``GNU Library General Public License'' is included exactly as
in the original, and provided that the entire resulting derived work is
distributed under the terms of a permission notice identical to this one.
Permission is granted to copy and distribute translations of this manual
into another language, under the above conditions for modified versions,
except that the section entitled ``GNU Library General Public License'' may be
included in a translation approved by the author instead of in the original
English.
@end titlepage
@node Top, Overview, (dir), (dir)
@ifinfo
This manual documents how to install and use @samp{libbow}),
(a library of C code for statistical text processing),
version @value{BOWVERSION}.
@end ifinfo
@menu
* Overview::
* Traversing Diretories to find Text Files::
* Getting Words from Text Files::
* Mapping between Words and Integers::
* Word Vectors::
* Vectors of Documents::
* A Matrix of Document/Word Statistics::
* Document/Word Models::
* Vector-per-Class Models::
* Arrays of Structures::
* Command-line argument processing with Argp::
@end menu
@node Overview, Traversing Diretories to find Text Files, Top, Top
@chapter Overview
@include libbow-desc.texi
Pronounciation guide: "libbow" rhymes with "lib-low", not "lib-cow".
Notes from Devika:
How to delimit documents.
How to tag things---how to augment the lexers.
Lead in gently, steps. Big picture.... more and more interesting things
Variety of examples.
Guide to sea of command-line references. Structure.
When to consider using which switch.
Sensible defaults.
@node Traversing Diretories to find Text Files, Getting Words from Text Files, Overview, Top
@chapter Traversing Diretories to find Text Files
@node Getting Words from Text Files, Mapping between Words and Integers, Traversing Diretories to find Text Files, Top
@chapter Getting Words from Text Files
Lexer buffers, Lexers
@menu
* The Simple Lexer::
* The N-Gram Lexer::
* The Email/News Lexer::
* The HTML Lexer::
* Functions Useful for Writing Lexers::
@end menu
@node The Simple Lexer, The N-Gram Lexer, Getting Words from Text Files, Getting Words from Text Files
@section The Simple Lexer
@node The N-Gram Lexer, The Email/News Lexer, The Simple Lexer, Getting Words from Text Files
@section The N-Gram Lexer
@node The Email/News Lexer, The HTML Lexer, The N-Gram Lexer, Getting Words from Text Files
@section The Email/News Lexer
@node The HTML Lexer, Functions Useful for Writing Lexers, The Email/News Lexer, Getting Words from Text Files
@section The HTML Lexer
@node Functions Useful for Writing Lexers, , The HTML Lexer, Getting Words from Text Files
@section Functions Useful for Writing Lexers
@deftypefun int bow_stem_porter (char *@var{word})
Apply the Porter stemming algorithm to modify @var{word}. Return 0 on success.
@end deftypefun
@deftypefun int bow_isalpha (int @var{character})
A function wrapper around POSIX's @code{isalpha} macro.
@end deftypefun
@deftypefun int bow_isgraph (int @var{character})
A function wrapper around POSIX's @code{isgraph} macro.
@end deftypefun
@deftypefun int bow_stoplist_present (const char *@var{word})
Return non-zero if @var{word} is on the stoplist.
@end deftypefun
@deftypefun int bow_stoplist_add_from_file (const char *@var{filename})
Add to the stoplist the white-space delineated words from
@var{filename}. Return the number of words added. If the file could
not be opened, return -1.
@end deftypefun
@node Mapping between Words and Integers, Word Vectors, Getting Words from Text Files, Top
@chapter Mapping between Words and Integers
@menu
* Generic Maps between Integers and Strings::
* The Global Dictionary::
@end menu
@node Generic Maps between Integers and Strings, The Global Dictionary, Mapping between Words and Integers, Mapping between Words and Integers
@section Generic Maps between Integers and Strings
@deftp Type bow_int4str
@end deftp
@deftypefun {bow_int4str *} bow_int4str_new (int @var{capacity})
Allocate, initialize and return a new int/string mapping structure. The
parameter @var{capacity} is used as a hint about the number of words to
expect; if you don't know or don't care about a @var{capacity} value,
pass 0, and a default value will be used.
@end deftypefun
@deftypefun {const char *} bow_int2str (bow_int4str *@var{map}, int @var{index})
Given a integer @var{index}, return its corresponding string.
@end deftypefun
@deftypefun int bow_str2int (bow_int4str *@var{map}, const char *@var{string})
Given the char-pointer @var{string}, return its integer
index. If this is the first time we're seeing @var{string}, add it to
the mapping, assign it a new index, and return the new index.
@end deftypefun
@deftypefun int bow_str2int_no_add (bow_int4str *@var{map}, const char *@var{string})
Given the char-pointer @var{string}, return its integer index. If
@var{string} is not yet in the mapping, return -1.
@end deftypefun
@deftypefun void bow_int4str_write (bow_int4str *@var{map}, FILE *@var{fp})
Write the int-str mapping to file-pointer @var{fp}.
@end deftypefun
@deftypefun {bow_int4str *} bow_int4str_new_from_fp (FILE *@var{fp})
Return a new int-str mapping, created by reading file-pointer @var{fp}.
@end deftypefun
@deftypefun {bow_int4str *} bow_int4str_new_from_file (const char *@var{filename})
Return a new int-string mapping, created by reading @var{filename}.
@end deftypefun
@deftypefun void bow_int4str_free (bow_int4str *@var{map})
Free the memory held by the int-string mapping @var{map}.
@end deftypefun
@node The Global Dictionary, , Generic Maps between Integers and Strings, Mapping between Words and Integers
@section The Global Dictionary
@deftypefun {const char *} bow_int2word (int @var{wi})
Given a "word index" @var{wi}, return its word, according to the global
word-int mapping.
@end deftypefun
@deftypefun int bow_word2int (const char *@var{word});
Given a @var{word}, return its ``word index,'' according to the global
word-int mapping; if it's not yet in the mapping, add it.
@end deftypefun
@deftypefun int bow_word2int_add_occurrence (const char *@var{word})
Like @code{bow_word2int()}, except it also increments the occurrence
count associated with @var{word}.
@end deftypefun
@deftypevar int bow_word2int_do_not_add
If this is non-zero, then @code{bow_word2int()} will return -1 when
asked for the index of a word that is not already in the mapping.
Essentially, setting this to non-zero makes @code{bow_word2int()} and
@code{bow_word2int_add_occurrence()} behave like
@code{bow_str2int_no_add()}.
@end deftypevar
@deftypefun int bow_words_add_occurrences_from_text_dir (const char *@var{dirname}, const char *@var{exception_name})
Add to the word occurrence counts by recursively decending directory
@var{dirname} and lexing all the text files; skip any files matching
@var{exception_name}.
@end deftypefun
@deftypefun int bow_words_occurrences_for_wi (int @var{wi});
Return the number of times @code{bow_word2int_add_occurrence()} was
called with the word whose index is @var{wi}.
@end deftypefun
@deftypefun void bow_words_set_map (bow_int4str *@var{map}, int @var{free_old_map})
Replace the current word/int mapping with @var{map}.
@end deftypefun
@deftypefun void bow_words_remove_occurrences_less_than (int @var{occur});
Modify the int/word mapping by removing all words that occurred less
than @var{occur} number of times. WARNING: This totally changes the
word/int mapping; any @code{wv}'s, @code{wi2dvf}'s or @code{barrel}'s
you build with the old mapping will have bogus word indices afterward.
@end deftypefun
@deftypefun int bow_num_words ()
Return the total number of unique words in the int/word map.
@end deftypefun
@deftypefun void bow_words_write (FILE *@var{fp})
Save the int/word map to file-pointer @var{FP}.
@end deftypefun
@deftypefun void bow_words_write_to_file (const char *@var{filename})
Same as above, but with a filename instead of a @code{FILE*}.
@end deftypefun
@deftypefun void bow_words_read_from_fp (FILE *@var{fp})
Read the int/word map from file-pointer @var{fp}.
@end deftypefun
@deftypefun void bow_words_read_from_file (const char *@var{filename})
Same as above, but with a filename instead of a @code{FILE*}.
@end deftypefun
@deftypefun void bow_words_reread_from_file (const char *@var{filename}, int @var{force_update})
Same as above, but don't bother rereading unless @var{filename} is different
from the last one, or @var{force_update} is non-zero.
@end deftypefun
@node Word Vectors, Vectors of Documents, Mapping between Words and Integers, Top
@chapter Word Vectors
@menu
* Creating a Word Vector from a Text File::
* Writing and Reading Word Vectors as Data Files::
@end menu
@node Creating a Word Vector from a Text File, Writing and Reading Word Vectors as Data Files, Word Vectors, Word Vectors
@section Creating a Word Vector from a Text File
@node Writing and Reading Word Vectors as Data Files, , Creating a Word Vector from a Text File, Word Vectors
@section Writing and Reading Word Vectors as Data Files
@node Vectors of Documents, A Matrix of Document/Word Statistics, Word Vectors, Top
@chapter Vectors of Documents
@deftp Type bow_dv
@end deftp
@node A Matrix of Document/Word Statistics, Document/Word Models, Vectors of Documents, Top
@chapter A Matrix of Document/Word Statistics
@deftp Type bow_dvf
@end deftp
@deftp Type bow_wi2dvf
@end deftp
@node Document/Word Models, Vector-per-Class Models, A Matrix of Document/Word Statistics, Top
@chapter Document/Word Models
@deftp Type bow_barrel
@end deftp
@node Vector-per-Class Models, Arrays of Structures, Document/Word Models, Top
@chapter Vector-per-Class Models
@node Arrays of Structures, Command-line argument processing with Argp, Vector-per-Class Models, Top
@chapter Arrays of Structures
@menu
* Arrays indexed by integers::
* Arrays indexed by strings::
@end menu
@node Arrays indexed by integers, Arrays indexed by strings, Arrays of Structures, Arrays of Structures
@section Arrays indexed by integers
@node Arrays indexed by strings, , Arrays indexed by integers, Arrays of Structures
@section Arrays indexed by strings
@node Command-line argument processing with Argp, , Arrays of Structures, Top
@chapter Command-line argument processing with Argp
@contents
@bye
|