File: libstemmer_csharp_README

package info (click to toggle)
snowball 3.0.1-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 1,708 kB
  • sloc: ansic: 15,641; ada: 849; python: 531; cs: 485; pascal: 473; java: 473; javascript: 411; perl: 312; sh: 40; makefile: 17
file content (56 lines) | stat: -rw-r--r-- 2,184 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
libstemmer_csharp
=================

This document pertains to the C# version of the libstemmer distribution,
available for download from:

https://snowballstem.org/download.html


What is Stemming?
-----------------

Stemming maps different forms of the same word to a common "stem" - for
example, the English stemmer maps *connection*, *connections*, *connective*,
*connected*, and *connecting* to *connect*.  So a search for *connected*
would also find documents which only have the other forms.

This stem form is often a word itself, but this is not always the case as this
is not a requirement for text search systems, which are the intended field of
use.  We also aim to conflate words with the same meaning, rather than all
words with a common linguistic root (so *awe* and *awful* don't have the same
stem), and over-stemming is more problematic than under-stemming so we tend not
to stem in cases that are hard to resolve.  If you want to always reduce words
to a root form and/or get a root form which is itself a word then Snowball's
stemming algorithms likely aren't the right answer.


Compiling the library
=====================

To build a library::

mcs -target:library -out:snowballstemmer.dll csharp/Snowball/*.cs csharp/Snowball/Algorithms/*cs

And to build the example program using that library::

mcs -target:exe -out:stemwords.exe -r:snowballstemmer.dll csharp/Stemwords/Program.cs

Using the library
=================

The stemming algorithms generally expect the input text to use composed accents
(Unicode NFC or NFKC) and to have been folded to lower case already.

There is currently no formal documentation on the use of the C# version
of the library. Additionally, its interface is not guaranteed to be
stable.

The stemmer code is re-entrant, but not thread-safe if the same stemmer object
is used concurrently in different threads.

If you want to perform stemming concurrently in different threads, we suggest
creating a new stemmer object for each thread.  The alternative is to share
stemmer objects between threads and protect access using a mutex or similar
but that's liable to slow your program down as threads can end up waiting for
the lock.