File: standard-tokenizer.asciidoc

package info (click to toggle)

elasticsearch 1.0.3%2Bdfsg-5

links: PTS, VCS
area: main
in suites: jessie-kfreebsd
size: 37,220 kB
sloc: java: 365,486; xml: 1,258; sh: 714; python: 505; ruby: 354; perl: 134; makefile: 41

file content (18 lines) | stat: -rw-r--r-- 738 bytes

parent folder | download | duplicates (2)

[[analysis-standard-tokenizer]]
=== Standard Tokenizer

A tokenizer of type `standard` providing grammar based tokenizer that is
a good tokenizer for most European language documents. The tokenizer
implements the Unicode Text Segmentation algorithm, as specified in
http://unicode.org/reports/tr29/[Unicode Standard Annex #29].

The following are settings that can be set for a `standard` tokenizer
type:

[cols="<,<",options="header",]
|=======================================================================
|Setting |Description
|`max_token_length` |The maximum token length. If a token is seen that
exceeds this length then it is discarded. Defaults to `255`.
|=======================================================================