File: tokenize.rst

package info (click to toggle)
groonga 9.0.0-1%2Bdeb10u1
links: PTS, VCS
area: main
in suites: buster
size: 101,496 kB
sloc: ansic: 608,707; ruby: 35,042; xml: 23,643; cpp: 10,319; sh: 7,453; yacc: 5,968; python: 3,033; makefile: 2,609; perl: 133
file content (248 lines) | stat: -rw-r--r-- 6,604 bytes
parent folder | download | duplicates (5)
.. -*- rst -*-

.. highlightlang:: none

.. groonga-command
.. database: commands_tokenize

``tokenize``
============

Summary
-------

``tokenize`` command tokenizes text by the specified tokenizer.
It is useful to debug tokenization.

Syntax
------

This command takes many parameters.

``tokenizer`` and ``string`` are required parameters. Others are
optional::

  tokenize tokenizer
           string
           [normalizer=null]
           [flags=NONE]
           [mode=ADD]
           [token_filters=NONE]

Usage
-----

Here is a simple example.

.. groonga-command
.. include:: ../../example/reference/commands/tokenize/simple_example.log
.. tokenize TokenBigram "Fulltext Search"

It has only required parameters. ``tokenizer`` is ``TokenBigram`` and
``string`` is ``"Fulltext Search"``. It returns tokens that is
generated by tokenizing ``"Fulltext Search"`` with ``TokenBigram``
tokenizer. It doesn't normalize ``"Fulltext Search"``.

Parameters
----------

This section describes all parameters. Parameters are categorized.

Required parameters
^^^^^^^^^^^^^^^^^^^

There are required parameters, ``tokenizer`` and ``string``.

.. _tokenize-tokenizer:

``tokenizer``
"""""""""""""

Specifies the tokenizer name. ``tokenize`` command uses the
tokenizer that is named ``tokenizer``.

See :doc:`/reference/tokenizers` about built-in tokenizers.

Here is an example to use built-in ``TokenTrigram`` tokenizer.

.. groonga-command
.. include:: ../../example/reference/commands/tokenize/tokenizer_token_trigram.log
.. tokenize TokenTrigram "Fulltext Search"

If you want to use other tokenizers, you need to register additional
tokenizer plugin by :doc:`register` command. For example, you can use
`KyTea <http://www.phontron.com/kytea/>`_ based tokenizer by
registering ``tokenizers/kytea``.

.. _tokenize-string:

``string``
""""""""""

Specifies any string which you want to tokenize.

If you want to include spaces in ``string``, you need to quote
``string`` by single quotation (``'``) or double quotation (``"``).

Here is an example to use spaces in ``string``.

.. groonga-command
.. include:: ../../example/reference/commands/tokenize/string_include_spaces.log
.. tokenize TokenBigram "Groonga is a fast fulltext earch engine!"

Optional parameters
^^^^^^^^^^^^^^^^^^^

There are optional parameters.

.. _tokenize-normalizer:

``normalizer``
""""""""""""""

Specifies the normalizer name. ``tokenize`` command uses the
normalizer that is named ``normalizer``. Normalizer is important for
N-gram family tokenizers such as ``TokenBigram``.

Normalizer detects character type for each character while
normalizing. N-gram family tokenizers use character types while
tokenizing.

Here is an example that doesn't use normalizer.

.. groonga-command
.. include:: ../../example/reference/commands/tokenize/normalizer_none.log
.. tokenize TokenBigram "Fulltext Search"

All alphabets are tokenized by two characters. For example, ``Fu`` is
a token.

Here is an example that uses normalizer.

.. groonga-command
.. include:: ../../example/reference/commands/tokenize/normalizer_use.log
.. tokenize TokenBigram "Fulltext Search" NormalizerAuto

Continuous alphabets are tokenized as one token. For example,
``fulltext`` is a token.

If you want to tokenize by two characters with noramlizer, use
``TokenBigramSplitSymbolAlpha``.

.. groonga-command
.. include:: ../../example/reference/commands/tokenize/normalizer_use_with_split_symbol_alpha.log
.. tokenize TokenBigramSplitSymbolAlpha "Fulltext Search" NormalizerAuto

All alphabets are tokenized by two characters. And they are normalized
to lower case characters. For example, ``fu`` is a token.

.. _tokenize-flags:

``flags``
"""""""""

Specifies a tokenization customize options. You can specify
multiple options separated by "``|``". For example,
``NONE|ENABLE_TOKENIZED_DELIMITER``.

Here are available flags.

.. list-table::
   :header-rows: 1

   * - Flag
     - Description
   * - ``NONE``
     - Just ignored.
   * - ``ENABLE_TOKENIZED_DELIMITER``
     - Enables tokenized delimiter. See :doc:`/reference/tokenizers` about
       tokenized delimiter details.

Here is an example that uses ``ENABLE_TOKENIZED_DELIMITER``.

.. groonga-command
.. include:: ../../example/reference/commands/tokenize/flags_enable_tokenized_delimiter.log
.. tokenize TokenDelimit "Fulltext Seacrch" NormalizerAuto ENABLE_TOKENIZED_DELIMITER

``TokenDelimit`` tokenizer is one of tokenized delimiter supported
tokenizer. ``ENABLE_TOKENIZED_DELIMITER`` enables tokenized delimiter.
Tokenized delimiter is special character that indicates token
border. It is ``U+FFFE``. The character is not assigned any
character. It means that the character is not appeared in normal
string. So the character is good character for this puropose. If
``ENABLE_TOKENIZED_DELIMITER`` is enabled, the target string is
treated as already tokenized string. Tokenizer just tokenizes by
tokenized delimiter.

.. _tokenize-mode:

``mode``
""""""""

Specifies a tokenize mode. If the mode is specified ``ADD``, the text
is tokenized by the rule that adding a document. If the mode is specified
``GET``, the text is tokenized by the rule that searching a document. If
the mode is omitted, the text is tokenized by the ``ADD`` mode.

The default mode is ``ADD``.

Here is an example to the ``ADD`` mode.

.. groonga-command
.. include:: ../../example/reference/commands/tokenize/add_mode.log
.. tokenize TokenBigram "Fulltext Search" --mode ADD

The last alphabet is tokenized by one character.

Here is an example to the ``GET`` mode.

.. groonga-command
.. include:: ../../example/reference/commands/tokenize/get_mode.log
.. tokenize TokenBigram "Fulltext Search" --mode GET

The last alphabet is tokenized by two characters.

.. _tokenize-token-filters:

``token_filters``
"""""""""""""""""

Specifies the token filter names. ``tokenize`` command uses the
tokenizer that is named ``token_filters``.

See :doc:`/reference/token_filters` about token filters.

.. _tokenize-return-value:

Return value
------------

``tokenize`` command returns tokenized tokens. Each token has some
attributes except token itself. The attributes will be increased in
the feature::

  [HEADER, tokens]

``HEADER``

  See :doc:`/reference/command/output_format` about ``HEADER``.

``tokens``

  ``tokens`` is an array of token. Token is an object that has the following
  attributes.

  .. list-table::
     :header-rows: 1

     * - Name
       - Description
     * - ``value``
       - Token itself.
     * - ``position``
       - The N-th token.

See also
--------

* :doc:`/reference/tokenizers`