1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
|
libstemmer_java
===============
This document pertains to the Java version of the libstemmer distribution,
available for download from:
https://snowballstem.org/download.html
What is Stemming?
-----------------
Stemming maps different forms of the same word to a common "stem" - for
example, the English stemmer maps *connection*, *connections*, *connective*,
*connected*, and *connecting* to *connect*. So a search for *connected*
would also find documents which only have the other forms.
This stem form is often a word itself, but this is not always the case as this
is not a requirement for text search systems, which are the intended field of
use. We also aim to conflate words with the same meaning, rather than all
words with a common linguistic root (so *awe* and *awful* don't have the same
stem), and over-stemming is more problematic than under-stemming so we tend not
to stem in cases that are hard to resolve. If you want to always reduce words
to a root form and/or get a root form which is itself a word then Snowball's
stemming algorithms likely aren't the right answer.
Requirements
============
The Java code generated by Snowball requires Java >= 7 (since Snowball 3.0.0).
Java 7 was released in 2011, and Java 6's EOL was 2013 so we don't expect this
to be a problematic requirement.
Compiling the library
=====================
Simply run the java compiler on all the java source files under the java
directory. For example, this can be done under unix by changing directory into
the java directory, and running:
javac org/tartarus/snowball/*.java org/tartarus/snowball/ext/*.java
This will compile the library and also an example program "TestApp" which
provides a command line interface to the library.
Using the library
=================
The stemming algorithms generally expect the input text to use composed accents
(Unicode NFC or NFKC) and to have been folded to lower case already.
There is currently no formal documentation on the use of the Java version
of the library. Additionally, its interface is not guaranteed to be
stable.
The best documentation of the library is the source of the TestApp example
program.
The stemmer code is re-entrant, but not thread-safe if the same stemmer object
is used concurrently in different threads.
If you want to perform stemming concurrently in different threads, we suggest
creating a new stemmer object for each thread. The alternative is to share
stemmer objects between threads and protect access using a mutex or similar
but that's liable to slow your program down as threads can end up waiting for
the lock.
The TestApp example
===================
The TestApp example program allows you to run any of the stemmers
compiled into the libstemmer library on a sample vocabulary. For
details on how to use it, run it with no command line parameters.
|