1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145
|
.. -*- rst -*-
.. highlightlang:: none
.. groonga-command
.. database: tokenizers
.. _token-delimit:
``TokenDelimit``
================
Summary
-------
``TokenDelimit`` extracts token by splitting one or more space
characters (``U+0020``). For example, ``Hello World`` is tokenized to
``Hello`` and ``World``.
``TokenDelimit`` is suitable for tag text. You can extract ``groonga``
and ``full-text-search`` and ``http`` as tags from ``groonga
full-text-search http``.
Syntax
------
``TokenDelimit`` has optional parameter.
No options(Extracts token by splitting one or more space characters (``U+0020``))::
TokenDelimit
Specify delimiter::
TokenDelimit("delimiter", "delimiter1", delimiter", "delimiter2", ...)
Specify delimiter with regular expression::
TokenDelimit("pattern", pattern)
The ``delimiter`` option and a ``pattern`` option are not use at the same time.
Usage
-----
Simple usage
------------
Here is an example of ``TokenDelimit``:
.. groonga-command
.. include:: ../../example/reference/tokenizers/token-delimit.log
.. tokenize TokenDelimit "Groonga full-text-search HTTP" NormalizerAuto
``TokenDelimit`` can also specify options.
``TokenDelimit`` has ``delimiter`` option and ``pattern`` option.
``delimiter`` option can split token with a specified character.
For example, ``Hello,World`` is tokenized to ``Hello`` and ``World``
with ``delimiter`` option as below.
.. groonga-command
.. include:: ../../example/reference/tokenizers/token-delimit-delimiter-option.log
.. tokenize 'TokenDelimit("delimiter", ",")' "Hello,World"
``pattern`` option can split token with a regular expression.
You can except needless space by ``pattern`` option.
For example, ``This is a pen. This is an apple`` is tokenized to ``This is a pen`` and
``This is an apple`` with ``pattern`` option as below.
Normally, when ``This is a pen. This is an apple.`` is splitted by ``.``,
needless spaces are included at the beginning of "This is an apple.".
You can except the needless spaces by a ``pattern`` option as below example.
.. groonga-command
.. include:: ../../example/reference/tokenizers/token-delimit-pattern-option.log
.. tokenize 'TokenDelimit("pattern", "\\.\\s*")' "This is a pen. This is an apple."
Advanced usage
--------------
``delimiter`` option can also specify multiple delimiters.
For example, ``Hello, World`` is tokenized to ``Hello`` and ``World``.
``,`` and `` `` are delimiters in below example.
.. groonga-command
.. include:: ../../example/reference/tokenizers/token-delimit-delimiter-option-multiple-delimiters.log
.. tokenize 'TokenDelimit("delimiter", ",", "delimiter", " ")' "Hello, World"
You can extract token in complex conditions by ``pattern`` option.
For example, ``これはペンですか!?リンゴですか?「リンゴです。」`` is tokenize to ``これはペンですか`` and ``リンゴですか``, ``「リンゴです。」`` with ``delimiter`` option as below.
.. groonga-command
.. include:: ../../example/reference/tokenizers/token-delimit-pattern-option-with-complex-pattern.log
.. tokenize 'TokenDelimit("pattern", "([。!?]+(?![)」])|[\\r\\n]+)\\s*")' "これはペンですか!?リンゴですか?「リンゴです。」"
``\\s*`` of the end of above regular expression match 0 or more spaces after a delimiter.
``[。!?]+`` matches 1 or more ``。`` or ``!``, ``?``.
For example, ``[。!?]+`` matches ``!?`` of ``これはペンですか!?``.
``(?![)」])`` is negative lookahead.
``(?![)」])`` matches if a character is not matched ``)`` or ``」``.
negative lookahead interprets in combination regular expression of just before.
Therefore it interprets ``[。!?]+(?![)」])``.
``[。!?]+(?![)」])`` matches if there are not ``)`` or ``」`` after ``。`` or ``!``, ``?``.
In other words, ``[。!?]+(?![)」])`` matches ``。`` of ``これはペンですか。``. But ``[。!?]+(?![)」])`` doesn't match ``。`` of ``「リンゴです。」``.
Because there is ``」`` after ``。``.
``[\\r\\n]+`` match 1 or more newline character.
In conclusion, ``([。!?]+(?![)」])|[\\r\\n]+)\\s*`` uses ``。`` and ``!`` and ``?``, newline character as delimiter. However, ``。`` and ``!``, ``?`` are not delimiters if there is ``)`` or ``」`` after ``。`` or ``!``, ``?``.
Parameters
----------
Optional parameter
^^^^^^^^^^^^^^^^^^
There are two optional parameters ``delimiter`` and ``pattern``.
``delimiter``
"""""""""""""
Split token with a specified one or more characters.
You can use one or more characters for a delimiter.
``pattern``
"""""""""""
Split token with a regular expression.
See also
----------
* :doc:`../commands/tokenize`
|