File: tokenize.rst

package info (click to toggle)
groonga 9.0.0-1%2Bdeb10u1
  • links: PTS, VCS
  • area: main
  • in suites: buster
  • size: 101,496 kB
  • sloc: ansic: 608,707; ruby: 35,042; xml: 23,643; cpp: 10,319; sh: 7,453; yacc: 5,968; python: 3,033; makefile: 2,609; perl: 133
file content (248 lines) | stat: -rw-r--r-- 6,604 bytes parent folder | download | duplicates (5)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
.. -*- rst -*-

.. highlightlang:: none

.. groonga-command
.. database: commands_tokenize

``tokenize``
============

Summary
-------

``tokenize`` command tokenizes text by the specified tokenizer.
It is useful to debug tokenization.

Syntax
------

This command takes many parameters.

``tokenizer`` and ``string`` are required parameters. Others are
optional::

  tokenize tokenizer
           string
           [normalizer=null]
           [flags=NONE]
           [mode=ADD]
           [token_filters=NONE]

Usage
-----

Here is a simple example.

.. groonga-command
.. include:: ../../example/reference/commands/tokenize/simple_example.log
.. tokenize TokenBigram "Fulltext Search"

It has only required parameters. ``tokenizer`` is ``TokenBigram`` and
``string`` is ``"Fulltext Search"``. It returns tokens that is
generated by tokenizing ``"Fulltext Search"`` with ``TokenBigram``
tokenizer. It doesn't normalize ``"Fulltext Search"``.

Parameters
----------

This section describes all parameters. Parameters are categorized.

Required parameters
^^^^^^^^^^^^^^^^^^^

There are required parameters, ``tokenizer`` and ``string``.

.. _tokenize-tokenizer:

``tokenizer``
"""""""""""""

Specifies the tokenizer name. ``tokenize`` command uses the
tokenizer that is named ``tokenizer``.

See :doc:`/reference/tokenizers` about built-in tokenizers.

Here is an example to use built-in ``TokenTrigram`` tokenizer.

.. groonga-command
.. include:: ../../example/reference/commands/tokenize/tokenizer_token_trigram.log
.. tokenize TokenTrigram "Fulltext Search"

If you want to use other tokenizers, you need to register additional
tokenizer plugin by :doc:`register` command. For example, you can use
`KyTea <http://www.phontron.com/kytea/>`_ based tokenizer by
registering ``tokenizers/kytea``.

.. _tokenize-string:

``string``
""""""""""

Specifies any string which you want to tokenize.

If you want to include spaces in ``string``, you need to quote
``string`` by single quotation (``'``) or double quotation (``"``).

Here is an example to use spaces in ``string``.

.. groonga-command
.. include:: ../../example/reference/commands/tokenize/string_include_spaces.log
.. tokenize TokenBigram "Groonga is a fast fulltext earch engine!"

Optional parameters
^^^^^^^^^^^^^^^^^^^

There are optional parameters.

.. _tokenize-normalizer:

``normalizer``
""""""""""""""

Specifies the normalizer name. ``tokenize`` command uses the
normalizer that is named ``normalizer``. Normalizer is important for
N-gram family tokenizers such as ``TokenBigram``.

Normalizer detects character type for each character while
normalizing. N-gram family tokenizers use character types while
tokenizing.

Here is an example that doesn't use normalizer.

.. groonga-command
.. include:: ../../example/reference/commands/tokenize/normalizer_none.log
.. tokenize TokenBigram "Fulltext Search"

All alphabets are tokenized by two characters. For example, ``Fu`` is
a token.

Here is an example that uses normalizer.

.. groonga-command
.. include:: ../../example/reference/commands/tokenize/normalizer_use.log
.. tokenize TokenBigram "Fulltext Search" NormalizerAuto

Continuous alphabets are tokenized as one token. For example,
``fulltext`` is a token.

If you want to tokenize by two characters with noramlizer, use
``TokenBigramSplitSymbolAlpha``.

.. groonga-command
.. include:: ../../example/reference/commands/tokenize/normalizer_use_with_split_symbol_alpha.log
.. tokenize TokenBigramSplitSymbolAlpha "Fulltext Search" NormalizerAuto

All alphabets are tokenized by two characters. And they are normalized
to lower case characters. For example, ``fu`` is a token.

.. _tokenize-flags:

``flags``
"""""""""

Specifies a tokenization customize options. You can specify
multiple options separated by "``|``". For example,
``NONE|ENABLE_TOKENIZED_DELIMITER``.

Here are available flags.

.. list-table::
   :header-rows: 1

   * - Flag
     - Description
   * - ``NONE``
     - Just ignored.
   * - ``ENABLE_TOKENIZED_DELIMITER``
     - Enables tokenized delimiter. See :doc:`/reference/tokenizers` about
       tokenized delimiter details.

Here is an example that uses ``ENABLE_TOKENIZED_DELIMITER``.

.. groonga-command
.. include:: ../../example/reference/commands/tokenize/flags_enable_tokenized_delimiter.log
.. tokenize TokenDelimit "Full￾text Sea￾crch" NormalizerAuto ENABLE_TOKENIZED_DELIMITER

``TokenDelimit`` tokenizer is one of tokenized delimiter supported
tokenizer. ``ENABLE_TOKENIZED_DELIMITER`` enables tokenized delimiter.
Tokenized delimiter is special character that indicates token
border. It is ``U+FFFE``. The character is not assigned any
character. It means that the character is not appeared in normal
string. So the character is good character for this puropose. If
``ENABLE_TOKENIZED_DELIMITER`` is enabled, the target string is
treated as already tokenized string. Tokenizer just tokenizes by
tokenized delimiter.

.. _tokenize-mode:

``mode``
""""""""

Specifies a tokenize mode. If the mode is specified ``ADD``, the text
is tokenized by the rule that adding a document. If the mode is specified
``GET``, the text is tokenized by the rule that searching a document. If
the mode is omitted, the text is tokenized by the ``ADD`` mode.

The default mode is ``ADD``.

Here is an example to the ``ADD`` mode.

.. groonga-command
.. include:: ../../example/reference/commands/tokenize/add_mode.log
.. tokenize TokenBigram "Fulltext Search" --mode ADD

The last alphabet is tokenized by one character.

Here is an example to the ``GET`` mode.

.. groonga-command
.. include:: ../../example/reference/commands/tokenize/get_mode.log
.. tokenize TokenBigram "Fulltext Search" --mode GET

The last alphabet is tokenized by two characters.

.. _tokenize-token-filters:

``token_filters``
"""""""""""""""""

Specifies the token filter names. ``tokenize`` command uses the
tokenizer that is named ``token_filters``.

See :doc:`/reference/token_filters` about token filters.

.. _tokenize-return-value:

Return value
------------

``tokenize`` command returns tokenized tokens. Each token has some
attributes except token itself. The attributes will be increased in
the feature::

  [HEADER, tokens]

``HEADER``

  See :doc:`/reference/command/output_format` about ``HEADER``.

``tokens``

  ``tokens`` is an array of token. Token is an object that has the following
  attributes.

  .. list-table::
     :header-rows: 1

     * - Name
       - Description
     * - ``value``
       - Token itself.
     * - ``position``
       - The N-th token.

See also
--------

* :doc:`/reference/tokenizers`