File: correction.rst

package info (click to toggle)
groonga 15.0.4%2Bdfsg-2
  • links: PTS, VCS
  • area: main
  • in suites: trixie
  • size: 163,080 kB
  • sloc: ansic: 770,564; cpp: 48,925; ruby: 40,447; javascript: 10,250; yacc: 7,045; sh: 5,602; python: 2,821; makefile: 1,672
file content (146 lines) | stat: -rw-r--r-- 5,199 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
.. -*- rst -*-

.. groonga-command
.. database: correction
.. $ groonga-suggest-create-dataset ${DB_PATH} query

Correction
==========

This section describes about the following correction
features:

* How it works
* How to use
* How to learn

How it works
------------

The correction feature uses three searches to compute corrected
words:

  1. Cooccurrence search against learned data.
  2. Similar search against registered words. (optional)

Cooccurrence search
^^^^^^^^^^^^^^^^^^^

Cooccurrence search can find registered words from user's
wrong input. It uses user submit sequences that will be
learned from query logs, access logs and so on.

For example, there are the following user submissions:

+-------------------+---------------------------+
|  query            |    time                   |
+===================+===========================+
| serach (typo!)    | 2011-08-10T22:20:50+09:00 |
+-------------------+---------------------------+
| search (fixed!)   | 2011-08-10T22:20:52+09:00 |
+-------------------+---------------------------+

Groonga creates the following correction pair from the above
submissions:

+----------+--------------------+
|  input   |   corrected word   |
+==========+====================+
|serach    |search              |
+----------+--------------------+

Groonga treats continuous submissions within a minute as
input correction by user. Not submitted user input sequence
between two submissions isn't used as learned data for
correction.

If an user inputs "serach" and cooccurrence search returns
"search" because "serach" is in input column and
corresponding corrected word column value is "search".

Similar search
^^^^^^^^^^^^^^

Similar search can find registered words that has one or
more the same tokens as user input. TokenBigram tokenizer is
used for tokenization because suggest dataset schema
created by :doc:`/reference/executables/groonga-suggest-create-dataset`
uses TokenBigram tokenizer as the default tokenizer.

For example, there is a registered query "search engine". An
user can find "search engine" by "web search service",
"sound engine" and so on. Because "search engine" and "web
search engine" have the same token "search" and "search
engine" and "sound engine" have the same token "engine".

"search engine" is tokenized to "search" and "engine"
tokens. (Groonga's TokenBigram tokenizer doesn't tokenize
two characters for continuous alphabets and continuous
digits for reducing search
noise. TokenBigramSplitSymbolAlphaDigit tokenizer should be
used to ensure tokenizing to two characters.) "web search
service" is tokenized to "web", "search" and
"service". "sound engine" is tokenized to "sound" and
"engine".

How to use
----------

.. groonga-command
.. load --table event_query --each 'suggest_preparer(_id, type, item, sequence, time, pair_query)'
.. [
.. {"sequence": "1", "time": 1312950803.86057, "item": "s"},
.. {"sequence": "1", "time": 1312950803.96857, "item": "sa"},
.. {"sequence": "1", "time": 1312950804.26057, "item": "sae"},
.. {"sequence": "1", "time": 1312950804.56057, "item": "saer"},
.. {"sequence": "1", "time": 1312950804.76057, "item": "saerc"},
.. {"sequence": "1", "time": 1312950805.76057, "item": "saerch", "type": "submit"},
.. {"sequence": "1", "time": 1312950809.76057, "item": "serch"},
.. {"sequence": "1", "time": 1312950810.86057, "item": "search", "type": "submit"}
.. ]

Groonga provides :doc:`/reference/commands/suggest` command to use
correction. `--type correct` option requests corrections.

For example, here is an command to get correction results by
"saerch":

.. groonga-command
.. include:: ../../example/reference/suggest/correction/select.log
.. suggest --table item_query --column kana --types correction --frequency_threshold 1 --query saerch

How it learns
-------------

Cooccurrence search uses learned data. They are based on
query logs, access logs and so on. To create learned data,
groonga needs user submit inputs with time stamp.

For example, an user wants to search by "search" but the
user has typo "saerch" before inputs the correct query. The
user inputs the query with the following sequence:

  1. 2011-08-10T13:33:23+09:00: s
  2. 2011-08-10T13:33:23+09:00: sa
  3. 2011-08-10T13:33:24+09:00: sae
  4. 2011-08-10T13:33:24+09:00: saer
  5. 2011-08-10T13:33:24+09:00: saerc
  6. 2011-08-10T13:33:25+09:00: saerch (submit!)
  7. 2011-08-10T13:33:29+09:00: serch (correcting...)
  8. 2011-08-10T13:33:30+09:00: search (submit!)

Groonga can be learned from the input sequence by the
following command::

  load --table event_query --each 'suggest_preparer(_id, type, item, sequence, time, pair_query)'
  [
  {"sequence": "1", "time": 1312950803.86057, "item": "s"},
  {"sequence": "1", "time": 1312950803.96857, "item": "sa"},
  {"sequence": "1", "time": 1312950804.26057, "item": "sae"},
  {"sequence": "1", "time": 1312950804.56057, "item": "saer"},
  {"sequence": "1", "time": 1312950804.76057, "item": "saerc"},
  {"sequence": "1", "time": 1312950805.76057, "item": "saerch", "type": "submit"},
  {"sequence": "1", "time": 1312950809.76057, "item": "serch"},
  {"sequence": "1", "time": 1312950810.86057, "item": "search", "type": "submit"}
  ]