File: token_delimit.rst

package info (click to toggle)
groonga 9.0.0-1%2Bdeb10u1
  • links: PTS, VCS
  • area: main
  • in suites: buster
  • size: 101,496 kB
  • sloc: ansic: 608,707; ruby: 35,042; xml: 23,643; cpp: 10,319; sh: 7,453; yacc: 5,968; python: 3,033; makefile: 2,609; perl: 133
file content (145 lines) | stat: -rw-r--r-- 4,746 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
.. -*- rst -*-

.. highlightlang:: none

.. groonga-command
.. database: tokenizers

.. _token-delimit:

``TokenDelimit``
================

Summary
-------

``TokenDelimit`` extracts token by splitting one or more space
characters (``U+0020``). For example, ``Hello World`` is tokenized to
``Hello`` and ``World``.

``TokenDelimit`` is suitable for tag text. You can extract ``groonga``
and ``full-text-search`` and ``http`` as tags from ``groonga
full-text-search http``.

Syntax
------

``TokenDelimit`` has optional parameter.

No options(Extracts token by splitting one or more space characters (``U+0020``))::

  TokenDelimit

Specify delimiter::

  TokenDelimit("delimiter",  "delimiter1", delimiter", "delimiter2", ...)

Specify delimiter with regular expression::

  TokenDelimit("pattern", pattern)

The ``delimiter`` option and a ``pattern`` option are not use at the same time.

Usage
-----

Simple usage
------------

Here is an example of ``TokenDelimit``:

.. groonga-command
.. include:: ../../example/reference/tokenizers/token-delimit.log
.. tokenize TokenDelimit "Groonga full-text-search HTTP" NormalizerAuto

``TokenDelimit`` can also specify options.
``TokenDelimit`` has ``delimiter`` option and ``pattern`` option.

``delimiter`` option can split token with a specified character.

For example, ``Hello,World`` is tokenized to ``Hello`` and ``World``
with ``delimiter`` option as below.

.. groonga-command
.. include:: ../../example/reference/tokenizers/token-delimit-delimiter-option.log
.. tokenize 'TokenDelimit("delimiter", ",")' "Hello,World"

``pattern`` option can split token with a regular expression.
You can except needless space by ``pattern`` option.

For example, ``This is a pen. This is an apple`` is tokenized to ``This is a pen`` and
``This is an apple`` with ``pattern`` option as below.

Normally, when ``This is a pen. This is an apple.`` is splitted by ``.``,
needless spaces are included at the beginning of "This is an apple.".

You can except the needless spaces by a ``pattern`` option as below example.

.. groonga-command
.. include:: ../../example/reference/tokenizers/token-delimit-pattern-option.log
.. tokenize 'TokenDelimit("pattern", "\\.\\s*")' "This is a pen. This is an apple."

Advanced usage
--------------

``delimiter`` option can also specify multiple delimiters.

For example, ``Hello, World`` is tokenized to ``Hello`` and ``World``.
``,`` and `` `` are delimiters in below example.

.. groonga-command
.. include:: ../../example/reference/tokenizers/token-delimit-delimiter-option-multiple-delimiters.log
.. tokenize 'TokenDelimit("delimiter", ",", "delimiter", " ")' "Hello, World"

You can extract token in complex conditions by ``pattern`` option.

For example, ``これはペンですか!?リンゴですか?「リンゴです。」`` is tokenize to ``これはペンですか`` and ``リンゴですか``, ``「リンゴです。」`` with ``delimiter`` option as below.

.. groonga-command
.. include:: ../../example/reference/tokenizers/token-delimit-pattern-option-with-complex-pattern.log
.. tokenize 'TokenDelimit("pattern", "([。!?]+(?![)」])|[\\r\\n]+)\\s*")' "これはペンですか!?リンゴですか?「リンゴです。」"

``\\s*`` of the end of above regular expression match 0 or more spaces after a delimiter.

``[。!?]+`` matches 1 or more ``。`` or ``!``, ``?``.
For example, ``[。!?]+`` matches ``!?`` of ``これはペンですか!?``.

``(?![)」])`` is negative lookahead.
``(?![)」])`` matches if a character is not matched ``)`` or ``」``.
negative lookahead interprets in combination regular expression of just before.

Therefore it interprets ``[。!?]+(?![)」])``.

``[。!?]+(?![)」])`` matches if there are not ``)`` or ``」`` after ``。`` or ``!``, ``?``.

In other words, ``[。!?]+(?![)」])`` matches ``。`` of ``これはペンですか。``. But ``[。!?]+(?![)」])`` doesn't match ``。`` of ``「リンゴです。」``.
Because there is ``」`` after ``。``.

``[\\r\\n]+`` match 1 or more newline character.

In conclusion, ``([。!?]+(?![)」])|[\\r\\n]+)\\s*`` uses ``。`` and ``!`` and ``?``, newline character as delimiter. However, ``。`` and ``!``, ``?`` are not delimiters if there is ``)`` or ``」`` after ``。`` or ``!``, ``?``.

Parameters
----------

Optional parameter
^^^^^^^^^^^^^^^^^^

There are two optional parameters ``delimiter`` and ``pattern``.

``delimiter``
"""""""""""""

Split token with a specified one or more characters.

You can use one or more characters for a delimiter.

``pattern``
"""""""""""

Split token with a regular expression.

See also
----------

* :doc:`../commands/tokenize`