File: terms-filter.asciidoc

package info (click to toggle)
elasticsearch 1.0.3%2Bdfsg-5
links: PTS, VCS
area: main
in suites: jessie-kfreebsd
size: 37,220 kB
sloc: java: 365,486; xml: 1,258; sh: 714; python: 505; ruby: 354; perl: 134; makefile: 41
file content (227 lines) | stat: -rw-r--r-- 7,321 bytes
[[query-dsl-terms-filter]]
=== Terms Filter

Filters documents that have fields that match any of the provided terms
(*not analyzed*). For example:

[source,js]
--------------------------------------------------
{
    "constant_score" : {
        "filter" : {
            "terms" : { "user" : ["kimchy", "elasticsearch"]}
        }
    }
}
--------------------------------------------------

The `terms` filter is also aliased with `in` as the filter name for
simpler usage.

[float]
==== Execution Mode

The way terms filter executes is by iterating over the terms provided
and finding matches docs (loading into a bitset) and caching it.
Sometimes, we want a different execution model that can still be
achieved by building more complex queries in the DSL, but we can support
them in the more compact model that terms filter provides.

The `execution` option now has the following options :

[horizontal]
`plain`:: 
    The default. Works as today. Iterates over all the terms,
    building a bit set matching it, and filtering. The total filter is
    cached.

`fielddata`::
    Generates a terms filters that uses the fielddata cache to
    compare terms.  This execution mode is great to use when filtering
    on a field that is already loaded into the fielddata cache from 
    faceting, sorting, or index warmers.  When filtering on
    a large number of terms, this execution can be considerably faster
    than the other modes.  The total filter is not cached unless
    explicitly configured to do so.

`bool`:: 
    Generates a term filter (which is cached) for each term, and
    wraps those in a bool filter. The bool filter itself is not cached as it
    can operate very quickly on the cached term filters.

`and`:: 
    Generates a term filter (which is cached) for each term, and
    wraps those in an and filter. The and filter itself is not cached.

`or`:: 
    Generates a term filter (which is cached) for each term, and
    wraps those in an or filter. The or filter itself is not cached.
    Generally, the `bool` execution mode should be preferred.

If you don't want the generated individual term queries to be cached,
you can use: `bool_nocache`, `and_nocache` or `or_nocache` instead, but
be aware that this will affect performance.

The "total" terms filter caching can still be explicitly controlled
using the `_cache` option. Note the default value for it depends on the
execution value.

For example:

[source,js]
--------------------------------------------------
{
    "constant_score" : {
        "filter" : {
            "terms" : {
                "user" : ["kimchy", "elasticsearch"],
                "execution" : "bool",
                "_cache": true
            }
        }
    }
}
--------------------------------------------------

[float]
==== Caching

The result of the filter is automatically cached by default. The
`_cache` can be set to `false` to turn it off.

[float]
==== Terms lookup mechanism

When it's needed to specify a `terms` filter with a lot of terms it can
be beneficial to fetch those term values from a document in an index. A
concrete example would be to filter tweets tweeted by your followers.
Potentially the amount of user ids specified in the terms filter can be
a lot. In this scenario it makes sense to use the terms filter's terms
lookup mechanism.

The terms lookup mechanism supports the following options:

[horizontal]
`index`:: 
    The index to fetch the term values from. Defaults to the
    current index.

`type`:: 
    The type to fetch the term values from.

`id`:: 
    The id of the document to fetch the term values from.

`path`:: 
    The field specified as path to fetch the actual values for the
    `terms` filter.

`routing`:: 
    A custom routing value to be used when retrieving the
    external terms doc.

`cache`:: 
    Whether to cache the filter built from the retrieved document
    (`true` - default) or whether to fetch and rebuild the filter on every
    request (`false`). See "<<query-dsl-terms-filter-lookup-caching,Terms lookup caching>>" below

The values for the `terms` filter will be fetched from a field in a
document with the specified id in the specified type and index.
Internally a get request is executed to fetch the values from the
specified path. At the moment for this feature to work the `_source`
needs to be stored.

Also, consider using an index with a single shard and fully replicated
across all nodes if the "reference" terms data is not large. The lookup
terms filter will prefer to execute the get request on a local node if
possible, reducing the need for networking.

["float",id="query-dsl-terms-filter-lookup-caching"]
==== Terms lookup caching

There is an additional cache involved, which caches the lookup of the
lookup document to the actual terms. This lookup cache is a LRU cache.
This cache has the following options:

`indices.cache.filter.terms.size`:: 
    The size of the lookup cache. The default is `10mb`.

`indices.cache.filter.terms.expire_after_access`:: 
    The time after the last read an entry should expire. Disabled by default.

`indices.cache.filter.terms.expire_after_write`: 
    The time after the last write an entry should expire. Disabled by default.

All options for the lookup of the documents cache can only be configured
via the `elasticsearch.yml` file.

When using the terms lookup the `execution` option isn't taken into
account and behaves as if the execution mode was set to `plain`.

[float]
==== Terms lookup twitter example

[source,js]
--------------------------------------------------
# index the information for user with id 2, specifically, its followers
curl -XPUT localhost:9200/users/user/2 -d '{
   "followers" : ["1", "3"]
}'

# index a tweet, from user with id 2
curl -XPUT localhost:9200/tweets/tweet/1 -d '{
   "user" : "2"
}'

# search on all the tweets that match the followers of user 2
curl -XGET localhost:9200/tweets/_search -d '{
  "query" : {
    "filtered" : {
      "filter" : {
        "terms" : {
          "user" : {
            "index" : "users",
            "type" : "user",
            "id" : "2",
            "path" : "followers"
          },
          "_cache_key" : "user_2_friends"
        }
      }
    }
  }
}'
--------------------------------------------------

The above is highly optimized, both in a sense that the list of
followers will not be fetched if the filter is already cached in the
filter cache, and with internal LRU cache for fetching external values
for the terms filter. Also, the entry in the filter cache will not hold
`all` the terms reducing the memory required for it.

`_cache_key` is recommended to be set, so its simple to clear the cache
associated with it using the clear cache API. For example:

[source,js]
--------------------------------------------------
curl -XPOST 'localhost:9200/tweets/_cache/clear?filter_keys=user_2_friends'
--------------------------------------------------

The structure of the external terms document can also include array of
inner objects, for example:

[source,js]
--------------------------------------------------
curl -XPUT localhost:9200/users/user/2 -d '{
 "followers" : [
   {
     "id" : "1"
   },
   {
     "id" : "2"
   }
 ]
}'
--------------------------------------------------

In which case, the lookup path will be `followers.id`.