File: libstemmer_python_README

package info (click to toggle)
snowball 3.0.1-1
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid
  • size: 1,708 kB
  • sloc: ansic: 15,641; ada: 849; python: 531; cs: 485; pascal: 473; java: 473; javascript: 411; perl: 312; sh: 40; makefile: 17
file content (118 lines) | stat: -rw-r--r-- 4,806 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
Snowball stemming library collection for Python
===============================================

Python 3 (>= 3.3) is supported.  We no longer support Python 2 as the Python
developers stopped supporting it at the start of 2020.  Snowball 2.1.0 was the
last release to officially support Python 2; Snowball 3.0.0 was the last
release which had the code to support Python 2, but we were no longer testing
it.

What is Stemming?
-----------------

Stemming maps different forms of the same word to a common "stem" - for
example, the English stemmer maps *connection*, *connections*, *connective*,
*connected*, and *connecting* to *connect*.  So a search for *connected*
would also find documents which only have the other forms.

This stem form is often a word itself, but this is not always the case as this
is not a requirement for text search systems, which are the intended field of
use.  We also aim to conflate words with the same meaning, rather than all
words with a common linguistic root (so *awe* and *awful* don't have the same
stem), and over-stemming is more problematic than under-stemming so we tend not
to stem in cases that are hard to resolve.  If you want to always reduce words
to a root form and/or get a root form which is itself a word then Snowball's
stemming algorithms likely aren't the right answer.

How to use library
------------------

The stemming algorithms generally expect the input text to use composed accents
(Unicode NFC or NFKC) and to have been folded to lower case already.

The ``snowballstemmer`` module has two functions.

The ``snowballstemmer.algorithms`` function returns a list of available
algorithm names.

The ``snowballstemmer.stemmer`` function takes an algorithm name and returns a
``Stemmer`` object.

``Stemmer`` objects have a ``Stemmer.stemWord(word)`` method and a
``Stemmer.stemWords(word[])`` method.

.. code-block:: python

   import snowballstemmer

   stemmer = snowballstemmer.stemmer('english');
   print(stemmer.stemWords("We are the world".split()));

Generally you should create a stemmer object and reuse it rather than creating
a fresh object for each word stemmed (since there's some cost to creating and
destroying the object).

The stemmer code is re-entrant, but not thread-safe if the same stemmer object
is used concurrently in different threads.

If you want to perform stemming concurrently in different threads, we suggest
creating a new stemmer object for each thread.  The alternative is to share
stemmer objects between threads and protect access using a mutex or similar
(e.g. `threading.Lock` in Python) but that's liable to slow your program down
as threads can end up waiting for the lock.

Automatic Acceleration
----------------------

`PyStemmer <https://pypi.org/project/PyStemmer/>`_ is a wrapper module for
Snowball's ``libstemmer_c`` and should provide results 100% compatible to
**snowballstemmer**.

**PyStemmer** is faster because it wraps generated C versions of the stemmers;
**snowballstemmer** uses generate Python code and is slower but offers a pure
Python solution.

If PyStemmer is installed, ``snowballstemmer.stemmer`` returns a ``PyStemmer``
``Stemmer`` object which provides the same ``Stemmer.stemWord()`` and
``Stemmer.stemWords()`` methods.

Benchmark
~~~~~~~~~

This is a crude benchmark which measures the time for running each stemmer on
every word in its sample vocabulary (10,787,583 words over 26 languages).  It's
not a realistic test of normal use as a real application would do much more
than just stemming.  It's also skewed towards the stemmers which do more work
per word and towards those with larger sample vocabularies.

* Python 2.7 + **snowballstemmer** : 13m00s (15.0 * PyStemmer)
* Python 3.7 + **snowballstemmer** : 12m19s (14.2 * PyStemmer)
* PyPy 7.1.1 (Python 2.7.13) + **snowballstemmer** : 2m14s (2.6 * PyStemmer)
* PyPy 7.1.1 (Python 3.6.1) + **snowballstemmer** : 1m46s (2.0 * PyStemmer)
* Python 2.7 + **PyStemmer** : 52s

For reference the equivalent test for C runs in 9 seconds.

These results are for Snowball 2.0.0.  They're likely to evolve over time as
the code Snowball generates for both Python and C continues to improve (for
a much older test over a different set of stemmers using Python 2.7,
**snowballstemmer** was 30 times slower than **PyStemmer**, or 9 times slower
with **PyPy**).

The message to take away is that if you're stemming a lot of words you should
either install **PyStemmer** (which **snowballstemmer** will then automatically
use for you as described above) or use PyPy.

The TestApp example
-------------------

The ``testapp.py`` example program allows you to run any of the stemmers
on a sample vocabulary.

Usage::

   testapp.py <algorithm> "sentences ... "

.. code-block:: bash

   $ python testapp.py English "sentences... "