File: multi-match-query.asciidoc

package info (click to toggle)
elasticsearch 1.6.2%2Bdfsg-1~bpo8%2B1
  • links: PTS, VCS
  • area: main
  • in suites: jessie-backports
  • size: 59,348 kB
  • sloc: java: 461,436; xml: 1,913; python: 1,402; sh: 1,183; ruby: 618; perl: 172; makefile: 46
file content (434 lines) | stat: -rw-r--r-- 14,073 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
[[query-dsl-multi-match-query]]
=== Multi Match Query

The `multi_match` query builds on the <<query-dsl-match-query,`match` query>>
to allow multi-field queries:

[source,js]
--------------------------------------------------
{
  "multi_match" : {
    "query":    "this is a test", <1>
    "fields": [ "subject", "message" ] <2>
  }
}
--------------------------------------------------
<1> The query string.
<2> The fields to be queried.

[float]
=== `fields` and per-field boosting

Fields can be specified with wildcards, eg:

[source,js]
--------------------------------------------------
{
  "multi_match" : {
    "query":    "Will Smith",
    "fields": [ "title", "*_name" ] <1>
  }
}
--------------------------------------------------
<1> Query the `title`, `first_name` and `last_name` fields.

Individual fields can be boosted with the caret (`^`) notation:

[source,js]
--------------------------------------------------
{
  "multi_match" : {
    "query" : "this is a test",
    "fields" : [ "subject^3", "message" ] <1>
  }
}
--------------------------------------------------
<1> The `subject` field is three times as important as the `message` field.

[float]
=== `use_dis_max`

By default, the `multi_match` query generates a `match` clause per field, then wraps them
in a `dis_max` query.  By setting `use_dis_max` to `false`, they will be wrapped in a
`bool` query instead.

[[multi-match-types]]
[float]
=== Types of `multi_match` query:

The way the `multi_match` query is executed internally depends on the `type`
parameter, which can be set to:

[horizontal]
`best_fields`::     (*default*) Finds documents which match any field, but
                    uses the  `_score` from the best field.  See <<type-best-fields>>.

`most_fields`::     Finds documents which match any field and combines
                    the `_score` from each field.  See <<type-most-fields>>.

`cross_fields`::    Treats fields with the same `analyzer` as though they
                    were one big field. Looks for each word in *any*
                    field. See <<type-cross-fields>>.

`phrase`::          Runs a `match_phrase` query on each field and combines
                    the `_score` from each field.  See <<type-phrase>>.

`phrase_prefix`::   Runs a `match_phrase_prefix` query on each field and
                    combines the `_score` from each field.  See <<type-phrase>>.

[[type-best-fields]]
==== `best_fields`

The `best_fields` type is most useful when you are searching for multiple
words best found in the same field. For instance ``brown fox'' in a single
field is more meaningful than ``brown'' in one field and ``fox'' in the other.

The `best_fields` type generates a <<query-dsl-match-query,`match` query>> for
each field and wraps them in a <<query-dsl-dis-max-query,`dis_max`>> query, to
find the single best matching field.  For instance, this query:

[source,js]
--------------------------------------------------
{
  "multi_match" : {
    "query":      "brown fox",
    "type":       "best_fields",
    "fields":     [ "subject", "message" ],
    "tie_breaker": 0.3
  }
}
--------------------------------------------------

would be executed as:

[source,js]
--------------------------------------------------
{
  "dis_max": {
    "queries": [
      { "match": { "subject": "brown fox" }},
      { "match": { "message": "brown fox" }}
    ],
    "tie_breaker": 0.3
  }
}
--------------------------------------------------

Normally the `best_fields` type uses the score of the *single* best matching
field, but if `tie_breaker` is specified, then it calculates the score as
follows:

  * the score from the best matching field
  * plus `tie_breaker * _score` for all other matching fields

Also, accepts `analyzer`, `boost`, `operator`, `minimum_should_match`,
`fuzziness`, `prefix_length`, `max_expansions`, `rewrite`, `zero_terms_query`
and `cutoff_frequency`, as explained in <<query-dsl-match-query, match query>>.

[IMPORTANT]
[[operator-min]]
.`operator` and `minimum_should_match`
==================================================

The `best_fields` and `most_fields` types are _field-centric_ -- they generate
a `match` query *per field*.  This means that the `operator` and
`minimum_should_match` parameters are applied to each field individually,
which is probably not what you want.

Take this query for example:

[source,js]
--------------------------------------------------
{
  "multi_match" : {
    "query":      "Will Smith",
    "type":       "best_fields",
    "fields":     [ "first_name", "last_name" ],
    "operator":   "and" <1>
  }
}
--------------------------------------------------
<1> All terms must be present.

This query is executed as:

      (+first_name:will +first_name:smith)
    | (+last_name:will  +last_name:smith)

In other words, *all terms* must be present *in a single field* for a document
to match.

See <<type-cross-fields>> for a better solution.

==================================================

[[type-most-fields]]
==== `most_fields`

The `most_fields` type is most useful when querying multiple fields that
contain the same text analyzed in different ways.  For instance, the main
field may contain synonyms, stemming and terms without diacritics. A second
field may contain the original terms, and a third field might contain
shingles. By combining scores from all three fields we can match as many
documents as possible with the main field, but use the second and third fields
to push the most similar results to the top of the list.

This query:

[source,js]
--------------------------------------------------
{
  "multi_match" : {
    "query":      "quick brown fox",
    "type":       "most_fields",
    "fields":     [ "title", "title.original", "title.shingles" ]
  }
}
--------------------------------------------------

would be executed as:

[source,js]
--------------------------------------------------
{
  "bool": {
    "should": [
      { "match": { "title":          "quick brown fox" }},
      { "match": { "title.original": "quick brown fox" }},
      { "match": { "title.shingles": "quick brown fox" }}
    ]
  }
}
--------------------------------------------------

The score from each `match` clause is added together, then divided by the
number of `match` clauses.

Also, accepts `analyzer`, `boost`, `operator`, `minimum_should_match`,
`fuzziness`, `prefix_length`, `max_expansions`, `rewrite`, `zero_terms_query`
and `cutoff_frequency`, as explained in <<query-dsl-match-query,match query>>, but
*see <<operator-min>>*.

[[type-phrase]]
==== `phrase` and `phrase_prefix`

The `phrase` and `phrase_prefix` types behave just like <<type-best-fields>>,
but they use a `match_phrase` or `match_phrase_prefix` query instead of a
`match` query.

This query:
[source,js]
--------------------------------------------------
{
  "multi_match" : {
    "query":      "quick brown f",
    "type":       "phrase_prefix",
    "fields":     [ "subject", "message" ]
  }
}
--------------------------------------------------

would be executed as:

[source,js]
--------------------------------------------------
{
  "dis_max": {
    "queries": [
      { "match_phrase_prefix": { "subject": "quick brown f" }},
      { "match_phrase_prefix": { "message": "quick brown f" }}
    ]
  }
}
--------------------------------------------------

Also, accepts `analyzer`, `boost`, `slop` and `zero_terms_query`  as explained
in <<query-dsl-match-query>>.  Type `phrase_prefix` additionally accepts
`max_expansions`.

[[type-cross-fields]]
==== `cross_fields`

The `cross_fields` type is particularly useful with structured documents where
multiple fields *should* match.  For instance, when querying the `first_name`
and `last_name` fields for ``Will Smith'', the best match is likely to have
``Will'' in one field and ``Smith'' in the other.

****

This sounds like a job for <<type-most-fields>> but there are two problems
with that approach. The first problem is that `operator` and
`minimum_should_match` are applied per-field, instead of per-term (see
<<operator-min,explanation above>>).

The second problem is to do with relevance: the different term frequencies in
the `first_name` and `last_name` fields   can produce unexpected results.

For instance, imagine we have two people: ``Will Smith'' and ``Smith Jones''.
``Smith'' as a last name is very common (and so is of low importance) but
``Smith'' as a first name is very uncommon (and so is of great importance).

If we do a search for ``Will Smith'', the ``Smith Jones'' document will
probably appear above the better matching ``Will Smith'' because the score of
`first_name:smith` has trumped the combined scores of `first_name:will` plus
`last_name:smith`.

****

One way of dealing with these types of queries is simply to index the
`first_name` and `last_name` fields into a single `full_name` field.  Of
course, this can only be done at index time.

The `cross_field` type tries to solve these problems at query time by taking a
_term-centric_ approach.  It first analyzes the query string into individual
terms, then looks for each term in any of the fields, as though they were one
big field.

A query like:

[source,js]
--------------------------------------------------
{
  "multi_match" : {
    "query":      "Will Smith",
    "type":       "cross_fields",
    "fields":     [ "first_name", "last_name" ],
    "operator":   "and"
  }
}
--------------------------------------------------

is executed as:

    +(first_name:will  last_name:will)
    +(first_name:smith last_name:smith)

In other words, *all terms* must be present *in at least one field* for a
document to match.  (Compare this to
<<operator-min,the logic used for `best_fields` and `most_fields`>>.)

That solves one of the two problems. The problem of differing term frequencies
is solved by _blending_ the term frequencies for all fields in order to even
out the differences.  In other words, `first_name:smith` will be treated as
though it has the same weight as `last_name:smith`. (Actually,
`first_name:smith` is given a tiny advantage over `last_name:smith`, just to
make the order of results more stable.)

If you run the above query through the <<search-validate>>, it returns this
explanation:

    +blended("will",  fields: [first_name, last_name])
    +blended("smith", fields: [first_name, last_name])

Also, accepts `analyzer`, `boost`, `operator`, `minimum_should_match`,
`zero_terms_query` and `cutoff_frequency`, as explained in
<<query-dsl-match-query, match query>>.

===== `cross_field` and analysis

The `cross_field` type can only work in term-centric mode on fields that have
the same analyzer. Fields with the same analyzer are grouped together as in
the example above.  If there are multiple groups, they are combined with a
`bool` query.

For instance, if we have a `first` and `last` field which have
the same analyzer, plus a `first.edge` and `last.edge` which
both use an `edge_ngram` analyzer, this query:

[source,js]
--------------------------------------------------
{
  "multi_match" : {
    "query":      "Jon",
    "type":       "cross_fields",
    "fields":     [
        "first", "first.edge",
        "last",  "last.edge"
    ]
  }
}
--------------------------------------------------

would be executed as:

        blended("jon", fields: [first, last])
    | (
        blended("j",   fields: [first.edge, last.edge])
        blended("jo",  fields: [first.edge, last.edge])
        blended("jon", fields: [first.edge, last.edge])
    )

In other words, `first` and `last` would be grouped together and
treated as a single field, and `first.edge` and `last.edge` would be
grouped together and treated as a single field.

Having multiple groups is fine, but when combined with `operator` or
`minimum_should_match`, it can suffer from the <<operator-min,same problem>>
as `most_fields` or `best_fields`.

You can easily rewrite this query yourself as two separate `cross_fields`
queries combined with a `bool` query, and apply the `minimum_should_match`
parameter to just one of them:

[source,js]
--------------------------------------------------
{
    "bool": {
        "should": [
            {
              "multi_match" : {
                "query":      "Will Smith",
                "type":       "cross_fields",
                "fields":     [ "first", "last" ],
                "minimum_should_match": "50%" <1>
              }
            },
            {
              "multi_match" : {
                "query":      "Will Smith",
                "type":       "cross_fields",
                "fields":     [ "*.edge" ]
              }
            }
        ]
    }
}
--------------------------------------------------
<1> Either `will` or `smith` must be present in either of the `first`
    or `last` fields

You can force all fields into the same group by specifying the `analyzer`
parameter in the query.

[source,js]
--------------------------------------------------
{
  "multi_match" : {
    "query":      "Jon",
    "type":       "cross_fields",
    "analyzer":   "standard", <1>
    "fields":     [ "first", "last", "*.edge" ]
  }
}
--------------------------------------------------
<1> Use the `standard` analyzer for all fields.

which will be executed as:

    blended("will",  fields: [first, first.edge, last.edge, last])
    blended("smith", fields: [first, first.edge, last.edge, last])

===== `tie_breaker`

By default, each per-term `blended` query will use the best score returned by
any field in a group, then these scores are added together to give the final
score. The `tie_breaker` parameter can change the default behaviour of the
per-term `blended` queries. It accepts:

[horizontal]
`0.0`::             Take the single best score out of (eg) `first_name:will`
                    and `last_name:will` (*default*)
`1.0`::             Add together the scores for (eg) `first_name:will` and
                    `last_name:will`
`0.0 < n < 1.0`::   Take the single best score plus +tie_breaker+ multiplied
                    by each of the scores from other matching fields.