1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270
|
[[index-modules-fielddata]]
== Field data
The field data cache is used mainly when sorting on or faceting on a
field. It loads all the field values to memory in order to provide fast
document based access to those values. The field data cache can be
expensive to build for a field, so its recommended to have enough memory
to allocate it, and to keep it loaded.
The amount of memory used for the field
data cache can be controlled using `indices.fielddata.cache.size`. Note:
reloading the field data which does not fit into your cache will be expensive
and perform poorly.
[cols="<,<",options="header",]
|=======================================================================
|Setting |Description
|`indices.fielddata.cache.size` |The max size of the field data cache,
eg `30%` of node heap space, or an absolute value, eg `12GB`. Defaults
to unbounded.
|`indices.fielddata.cache.expire` |A time based setting that expires
field data after a certain time of inactivity. Defaults to `-1`. For
example, can be set to `5m` for a 5 minute expiry.
|=======================================================================
[float]
[[fielddata-circuit-breaker]]
=== Field data circuit breaker
The field data circuit breaker allows Elasticsearch to estimate the amount of
memory a field will required to be loaded into memory. It can then prevent the
field data loading by raising and exception. By default the limit is configured
to 80% of the maximum JVM heap. It can be configured with the following
parameters:
[cols="<,<",options="header",]
|=======================================================================
|Setting |Description
|`indices.fielddata.breaker.limit` |Maximum size of estimated field data
to allow loading. Defaults to 80% of the maximum JVM heap.
|`indices.fielddata.breaker.overhead` |A constant that all field data
estimations are multiplied with to determine a final estimation. Defaults to
1.03
|=======================================================================
Both the `indices.fielddata.breaker.limit` and
`indices.fielddata.breaker.overhead` can be changed dynamically using the
cluster update settings API.
[float]
[[fielddata-monitoring]]
=== Monitoring field data
You can monitor memory usage for field data as well as the field data circuit
breaker using
<<cluster-nodes-stats,Nodes Stats API>>
[[fielddata-formats]]
== Field data formats
The field data format controls how field data should be stored.
Depending on the field type, there might be several field data types
available. In particular, string and numeric types support the `doc_values`
format which allows for computing the field data data-structures at indexing
time and storing them on disk. Although it will make the index larger and may
be slightly slower, this implementation will be more near-realtime-friendly
and will require much less memory from the JVM than other implementations.
Here is an example of how to configure the `tag` field to use the `fst` field
data format.
[source,js]
--------------------------------------------------
{
tag: {
type: "string",
fielddata: {
format: "fst"
}
}
}
--------------------------------------------------
It is possible to change the field data format (and the field data settings
in general) on a live index by using the update mapping API. When doing so,
field data which had already been loaded for existing segments will remain
alive while new segments will use the new field data configuration. Thanks to
the background merging process, all segments will eventually use the new
field data format.
[float]
==== String field data types
`paged_bytes` (default)::
Stores unique terms sequentially in a large buffer and maps documents to
the indices of the terms they contain in this large buffer.
`fst`::
Stores terms in a FST. Slower to build than `paged_bytes` but can help lower
memory usage if many terms share common prefixes and/or suffixes.
`doc_values`::
Computes and stores field data data-structures on disk at indexing time.
Lowers memory usage but only works on non-analyzed strings (`index`: `no` or
`not_analyzed`) and doesn't support filtering.
[float]
==== Numeric field data types
`array` (default)::
Stores field values in memory using arrays.
`doc_values`::
Computes and stores field data data-structures on disk at indexing time.
Doesn't support filtering.
[float]
==== Geo point field data types
`array` (default)::
Stores latitudes and longitudes in arrays.
`doc_values`::
Computes and stores field data data-structures on disk at indexing time.
[float]
=== Fielddata loading
By default, field data is loaded lazily, ie. the first time that a query that
requires them is executed. However, this can make the first requests that
follow a merge operation quite slow since fielddata loading is a heavy
operation.
It is possible to force field data to be loaded and cached eagerly through the
`loading` setting of fielddata:
[source,js]
--------------------------------------------------
{
category: {
type: "string",
fielddata: {
loading: "eager"
}
}
}
--------------------------------------------------
[float]
==== Disabling field data loading
Field data can take a lot of RAM so it makes sense to disable field data
loading on the fields that don't need field data, for example those that are
used for full-text search only. In order to disable field data loading, just
change the field data format to `disabled`. When disabled, all requests that
will try to load field data, e.g. when they include aggregations and/or sorting,
will return an error.
[source,js]
--------------------------------------------------
{
text: {
type: "string",
fielddata: {
format: "disabled"
}
}
}
--------------------------------------------------
The `disabled` format is supported by all field types.
[float]
[[field-data-filtering]]
=== Filtering fielddata
It is possible to control which field values are loaded into memory,
which is particularly useful for string fields. When specifying the
<<mapping-core-types,mapping>> for a field, you
can also specify a fielddata filter.
Fielddata filters can be changed using the
<<indices-put-mapping,PUT mapping>>
API. After changing the filters, use the
<<indices-clearcache,Clear Cache>> API
to reload the fielddata using the new filters.
[float]
==== Filtering by frequency:
The frequency filter allows you to only load terms whose frequency falls
between a `min` and `max` value, which can be expressed an absolute
number or as a percentage (eg `0.01` is `1%`). Frequency is calculated
*per segment*. Percentages are based on the number of docs which have a
value for the field, as opposed to all docs in the segment.
Small segments can be excluded completely by specifying the minimum
number of docs that the segment should contain with `min_segment_size`:
[source,js]
--------------------------------------------------
{
tag: {
type: "string",
fielddata: {
filter: {
frequency: {
min: 0.001,
max: 0.1,
min_segment_size: 500
}
}
}
}
}
--------------------------------------------------
[float]
==== Filtering by regex
Terms can also be filtered by regular expression - only values which
match the regular expression are loaded. Note: the regular expression is
applied to each term in the field, not to the whole field value. For
instance, to only load hashtags from a tweet, we can use a regular
expression which matches terms beginning with `#`:
[source,js]
--------------------------------------------------
{
tweet: {
type: "string",
analyzer: "whitespace"
fielddata: {
filter: {
regex: {
pattern: "^#.*"
}
}
}
}
}
--------------------------------------------------
[float]
==== Combining filters
The `frequency` and `regex` filters can be combined:
[source,js]
--------------------------------------------------
{
tweet: {
type: "string",
analyzer: "whitespace"
fielddata: {
filter: {
regex: {
pattern: "^#.*",
},
frequency: {
min: 0.001,
max: 0.1,
min_segment_size: 500
}
}
}
}
}
--------------------------------------------------
|