File: hunspell-tokenfilter.asciidoc

package info (click to toggle)
elasticsearch 1.0.3%2Bdfsg-5
links: PTS, VCS
area: main
in suites: jessie-kfreebsd
size: 37,220 kB
sloc: java: 365,486; xml: 1,258; sh: 714; python: 505; ruby: 354; perl: 134; makefile: 41
file content (115 lines) | stat: -rw-r--r-- 3,833 bytes
[[analysis-hunspell-tokenfilter]]
=== Hunspell Token Filter

Basic support for hunspell stemming. Hunspell dictionaries will be
picked up from a dedicated hunspell directory on the filesystem
(defaults to `<path.conf>/hunspell`). Each dictionary is expected to
have its own directory named after its associated locale (language).
This dictionary directory is expected to hold both the \*.aff and \*.dic
files (all of which will automatically be picked up). For example,
assuming the default hunspell location is used, the following directory
layout will define the `en_US` dictionary:

[source,js]
--------------------------------------------------
- conf
    |-- hunspell
    |    |-- en_US
    |    |    |-- en_US.dic
    |    |    |-- en_US.aff
--------------------------------------------------

The location of the hunspell directory can be configured using the
`indices.analysis.hunspell.dictionary.location` settings in
_elasticsearch.yml_.

Each dictionary can be configured with two settings:

`ignore_case`:: 
    If true, dictionary matching will be case insensitive
    (defaults to `false`)

`strict_affix_parsing`::
    Determines whether errors while reading a
    affix rules file will cause exception or simple be ignored (defaults to
    `true`)

These settings can be configured globally in `elasticsearch.yml` using

* `indices.analysis.hunspell.dictionary.ignore_case` and
* `indices.analysis.hunspell.dictionary.strict_affix_parsing`

or for specific dictionaries:

* `indices.analysis.hunspell.dictionary.en_US.ignore_case` and
* `indices.analysis.hunspell.dictionary.en_US.strict_affix_parsing`.

It is also possible to add `settings.yml` file under the dictionary
directory which holds these settings (this will override any other
settings defined in the `elasticsearch.yml`).

One can use the hunspell stem filter by configuring it the analysis
settings:

[source,js]
--------------------------------------------------
{
    "analysis" : {
        "analyzer" : {
            "en" : {
                "tokenizer" : "standard",
                "filter" : [ "lowercase", "en_US" ]
            }
        },
        "filter" : {
            "en_US" : {
                "type" : "hunspell",
                "locale" : "en_US",
                "dedup" : true
            }
        }
    }
}
--------------------------------------------------

The hunspell token filter accepts four options:

`locale`:: 
    A locale for this filter. If this is unset, the `lang` or
    `language` are used instead - so one of these has to be set.

`dictionary`:: 
    The name of a dictionary. The path to your hunspell
    dictionaries should be configured via
    `indices.analysis.hunspell.dictionary.location` before.

`dedup`:: 
    If only unique terms should be returned, this needs to be
    set to `true`. Defaults to `true`.

`recursion_level`:: 
    Configures the recursion level a
    stemmer can go into. Defaults to `2`. Some languages (for example czech)
    give better results when set to `1` or `0`, so you should test it out.

NOTE: As opposed to the snowball stemmers (which are algorithm based)
this is a dictionary lookup based stemmer and therefore the quality of
the stemming is determined by the quality of the dictionary.

[float]
==== References

Hunspell is a spell checker and morphological analyzer designed for
languages with rich morphology and complex word compounding and
character encoding.

1. Wikipedia, http://en.wikipedia.org/wiki/Hunspell

2. Source code, http://hunspell.sourceforge.net/

3. Open Office Hunspell dictionaries, http://wiki.openoffice.org/wiki/Dictionaries

4.  Mozilla Hunspell dictionaries, https://addons.mozilla.org/en-US/firefox/language-tools/

5. Chromium Hunspell dictionaries,
   http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/