File: normalizer_table.rst

package info (click to toggle)
groonga 15.0.4%2Bdfsg-2
  • links: PTS, VCS
  • area: main
  • in suites: trixie
  • size: 163,080 kB
  • sloc: ansic: 770,564; cpp: 48,925; ruby: 40,447; javascript: 10,250; yacc: 7,045; sh: 5,602; python: 2,821; makefile: 1,672
file content (203 lines) | stat: -rw-r--r-- 6,615 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
.. -*- rst -*-

.. groonga-command
.. database: normalisers

.. _normalizer-table:

``NormalizerTable``
===================

Summary
-------

.. versionadded:: 11.0.4

``NormalizerTable`` normalizes text by user defined normalization table. User defined normalization table is just a normal table but it must satisfy some conditions. They are described later.

.. note::

   The normalized text is depends on contents of user defined
   normalization table. If you want to use this normalizer for
   lexicon, you need to re-index when you change your user defined
   normalization table.

Syntax
------

There are required and optional parameters.

Required parameters::

  NormalizerTable("normalized", "UserDefinedTable.normalized_column")

Optional parameters::

  NormalizerTable("normalized", "UserDefinedTable.normalized_column",
                  "target", "target_column")

  NormalizerTable("normalized", "UserDefinedTable.normalized_column",
                  "unicode_version", "13.0.0")

Usage
-----

.. _normalizer-table-simple-usage:

Simple usage
^^^^^^^^^^^^

Here is an example of ``NormalizerTable``.

``NormalizerTable`` normalizes text by user defined normalization table. You use the following user defined normalization table here:

  * Table type must be ``TABLE_PAT_KEY``.

  * Table key type must be ``ShortText``.

  * Table must have at least one ``ShortText`` column.

Here are schema and data for this example:

.. groonga-command
.. include:: ../../example/reference/normalizers/normalizer-table-simple-usage-prepare.log
.. table_create Normalizations TABLE_PAT_KEY ShortText
.. column_create Normalizations normalized COLUMN_SCALAR ShortText
.. load --table Normalizations
.. [
.. {"_key": "a", "normalized": "<A>"},
.. {"_key": "ac", "normalized": "<AC>"}
.. ]

You can normalize ``a`` with ``<A>`` and ``ac`` with ``<AC>`` with this user defined normalization table. For example:

  * ``Groonga`` -> ``Groong<A>``

  * ``hack`` -> ``h<AC>k``

Here are examples of ``NormalizerTable`` with the user defined normalization table:

.. groonga-command
.. include:: ../../example/reference/normalizers/normalizer-table-simple-usage-output.log
.. normalize 'NormalizerTable("normalized", "Normalizations.normalized")' "Groonga"
.. normalize 'NormalizerTable("normalized", "Normalizations.normalized")' "hack"

.. _normalizer-table-usage-unicode-version:

Unicode version
^^^^^^^^^^^^^^^

Some internal processings such as tokenization and highlight use character type. ``NormalizerTable`` provides character type based on Unicode. You can specify used Unicode version by :ref:`normalizer-table-unicode-version` option.

Here is an example to use Unicode 13.0.0:

.. groonga-command
.. include:: ../../example/reference/normalizers/normalizer-table-simple-usage-unicode-version.log
.. normalize 'NormalizerTable("normalized", "Normalizations.normalized")' "Groonga" WITH_TYPES

The default Unicode version is 5.0.0.

.. _normalizer-table-advanced-usage:

Advanced usage
^^^^^^^^^^^^^^

You can put a normalized string to a column instead of ``_key``. In this case, you need to create the following index column for the column:

  * Lexicon type of the index column must be ``TABLE_PAT_KEY``.

  * Lexicon key type of the index column must be ``ShortText``.

  * Lexicon of the index column must not have tokenizer.

You can use any table type for this usage such as ``TABLE_NO_KEY``. This is useful when you can't control table type. For example, PGroonga users can only use this usage.

Here are schema and data for this example:

.. groonga-command
.. include:: ../../example/reference/normalizers/normalizer-table-advanced-usage-prepare.log
.. table_create ColumnNormalizations TABLE_NO_KEY
.. column_create ColumnNormalizations target_column COLUMN_SCALAR ShortText
.. column_create ColumnNormalizations normalized COLUMN_SCALAR ShortText
..
.. table_create Targets TABLE_PAT_KEY ShortText
.. column_create Targets column_normalizations_target_column \
..    COLUMN_INDEX ColumnNormalizations target_column
..
.. load --table ColumnNormalizations
.. [
.. {"target_column": "a", "normalized": "<A>"},
.. {"target_column": "ac", "normalized": "<AC>"}
.. ]

You need to use :ref:`normalizer-table-target` option to use the user defined normalization table. The above schema uses ``target_column`` for explanation. Generally, ``_column`` in ``target_column`` is redundant but it's added for easy to distinct parameter name and parameter value.

Here are examples of ``NormalizerTable`` with the user defined normalization table:

.. groonga-command
.. include:: ../../example/reference/normalizers/normalizer-table-simple-usage-output.log
.. normalize 'NormalizerTable("normalized", "ColumnNormalizations.normalized", "target", "target_column")' "Groonga"
.. normalize 'NormalizerTable("normalized", "ColumnNormalizations.normalized", "target", "target_column")' "hack"

Parameters
----------

Required parameter
^^^^^^^^^^^^^^^^^^

.. _normalizer-table-normalized:

``normalized``
""""""""""""""

This option specifies a column that has normalized texts. Normalized target texts are texts in corresponding ``_key`` column or column specified by :ref:`normalizer-table-target`.

Value type of the column specified for this option must be one of ``ShortText``, ``Text`` and ``LongText``.

If you don't use :ref:`normalizer-table-target`, the table of column specified for this option must satisfy the followings:

  * Table type is ``TABLE_PAT_KEY``

  * Table key type is ``ShortText``

See :ref:`normalizer-table-simple-usage` for usage of this case.

Optional parameters
^^^^^^^^^^^^^^^^^^^

.. _normalizer-table-target:

``target``
""""""""""

This option specifies a column that has normalization target texts.

Value type of the column specified for this option must be one of ``ShortText``, ``Text`` and ``LongText``.

You must create an index column for the column specified for this option. The index column and its lexicon must satisfies the followings:

  * Index column can be a single column index or a multi column index.

  * Lexicon type of the index column must be ``TABLE_PAT_KEY``.

  * Lexicon key type of the index column must be ``ShortText``.

  * Lexicon of the index must not have tokenizer.

See :ref:`normalizer-table-advanced-usage` for usage of this case.

.. _normalizer-table-unicode-version:

``unicode_version``
"""""""""""""""""""

This option specifies Unicode version to use determining character type.

The default Unicode version is 5.0.0.

See :ref:`normalizer-table-usage-unicode-version` for usage.

See also
--------

* :doc:`../commands/normalize`