1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115
|
[[analysis-hunspell-tokenfilter]]
=== Hunspell Token Filter
Basic support for hunspell stemming. Hunspell dictionaries will be
picked up from a dedicated hunspell directory on the filesystem
(defaults to `<path.conf>/hunspell`). Each dictionary is expected to
have its own directory named after its associated locale (language).
This dictionary directory is expected to hold both the \*.aff and \*.dic
files (all of which will automatically be picked up). For example,
assuming the default hunspell location is used, the following directory
layout will define the `en_US` dictionary:
[source,js]
--------------------------------------------------
- conf
|-- hunspell
| |-- en_US
| | |-- en_US.dic
| | |-- en_US.aff
--------------------------------------------------
The location of the hunspell directory can be configured using the
`indices.analysis.hunspell.dictionary.location` settings in
_elasticsearch.yml_.
Each dictionary can be configured with two settings:
`ignore_case`::
If true, dictionary matching will be case insensitive
(defaults to `false`)
`strict_affix_parsing`::
Determines whether errors while reading a
affix rules file will cause exception or simple be ignored (defaults to
`true`)
These settings can be configured globally in `elasticsearch.yml` using
* `indices.analysis.hunspell.dictionary.ignore_case` and
* `indices.analysis.hunspell.dictionary.strict_affix_parsing`
or for specific dictionaries:
* `indices.analysis.hunspell.dictionary.en_US.ignore_case` and
* `indices.analysis.hunspell.dictionary.en_US.strict_affix_parsing`.
It is also possible to add `settings.yml` file under the dictionary
directory which holds these settings (this will override any other
settings defined in the `elasticsearch.yml`).
One can use the hunspell stem filter by configuring it the analysis
settings:
[source,js]
--------------------------------------------------
{
"analysis" : {
"analyzer" : {
"en" : {
"tokenizer" : "standard",
"filter" : [ "lowercase", "en_US" ]
}
},
"filter" : {
"en_US" : {
"type" : "hunspell",
"locale" : "en_US",
"dedup" : true
}
}
}
}
--------------------------------------------------
The hunspell token filter accepts four options:
`locale`::
A locale for this filter. If this is unset, the `lang` or
`language` are used instead - so one of these has to be set.
`dictionary`::
The name of a dictionary. The path to your hunspell
dictionaries should be configured via
`indices.analysis.hunspell.dictionary.location` before.
`dedup`::
If only unique terms should be returned, this needs to be
set to `true`. Defaults to `true`.
`recursion_level`::
Configures the recursion level a
stemmer can go into. Defaults to `2`. Some languages (for example czech)
give better results when set to `1` or `0`, so you should test it out.
NOTE: As opposed to the snowball stemmers (which are algorithm based)
this is a dictionary lookup based stemmer and therefore the quality of
the stemming is determined by the quality of the dictionary.
[float]
==== References
Hunspell is a spell checker and morphological analyzer designed for
languages with rich morphology and complex word compounding and
character encoding.
1. Wikipedia, http://en.wikipedia.org/wiki/Hunspell
2. Source code, http://hunspell.sourceforge.net/
3. Open Office Hunspell dictionaries, http://wiki.openoffice.org/wiki/Dictionaries
4. Mozilla Hunspell dictionaries, https://addons.mozilla.org/en-US/firefox/language-tools/
5. Chromium Hunspell dictionaries,
http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/hunspell_dictionaries/
|