1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146
|
.. _helpers:
Helpers
=======
Collection of simple helper functions that abstract some specifics of the raw API.
Connecting
----------
.. code-block:: python
from elasticsearch import Elasticsearch
client = Elasticsearch("https://.../", api_key="YOUR_API_KEY")
Bulk helpers
------------
There are several helpers for the ``bulk`` API since its requirement for
specific formatting and other considerations can make it cumbersome if used directly.
All bulk helpers accept an instance of ``Elasticsearch`` class and an iterable
``actions`` (any iterable, can also be a generator, which is ideal in most
cases since it will allow you to index large datasets without the need of
loading them into memory).
The items in the ``action`` iterable should be the documents we wish to index
in several formats. The most common one is the same as returned by
:meth:`~elasticsearch.Elasticsearch.search`, for example:
.. code:: python
{
'_index': 'index-name',
'_id': 42,
'_routing': 5,
'pipeline': 'my-ingest-pipeline',
'_source': {
"title": "Hello World!",
"body": "..."
}
}
Alternatively, if `_source` is not present, it will pop all metadata fields
from the doc and use the rest as the document data:
.. code:: python
{
"_id": 42,
"_routing": 5,
"title": "Hello World!",
"body": "..."
}
The :meth:`~elasticsearch.Elasticsearch.bulk` api accepts ``index``, ``create``,
``delete``, and ``update`` actions. Use the ``_op_type`` field to specify an
action (``_op_type`` defaults to ``index``):
.. code:: python
{
'_op_type': 'delete',
'_index': 'index-name',
'_id': 42,
}
{
'_op_type': 'update',
'_index': 'index-name',
'_id': 42,
'doc': {'question': 'The life, universe and everything.'}
}
Example:
~~~~~~~~
Lets say we have an iterable of data. Lets say a list of words called ``mywords``
and we want to index those words into individual documents where the structure of the
document is like ``{"word": "<myword>"}``.
.. code:: python
from elasticsearch.helpers import bulk
def gendata():
mywords = ['foo', 'bar', 'baz']
for word in mywords:
yield {
"_index": "mywords",
"word": word,
}
bulk(client, gendata())
For a more complete and complex example please take a look at
https://github.com/elastic/elasticsearch-py/blob/main/examples/bulk-ingest
The :meth:`~elasticsearch.Elasticsearch.parallel_bulk` api is a wrapper around the :meth:`~elasticsearch.Elasticsearch.bulk` api to provide threading. :meth:`~elasticsearch.Elasticsearch.parallel_bulk` returns a generator which must be consumed to produce results.
To see the results use:
.. code:: python
for success, info in parallel_bulk(...):
if not success:
print('A document failed:', info)
If you don't care about the results, you can use deque from collections:
.. code:: python
from collections import deque
deque(parallel_bulk(...), maxlen=0)
.. note::
When reading raw json strings from a file, you can also pass them in
directly (without decoding to dicts first). In that case, however, you lose
the ability to specify anything (index, op_type and even id) on a per-record
basis, all documents will just be sent to elasticsearch to be indexed
as-is.
.. py:module:: elasticsearch.helpers
.. autofunction:: streaming_bulk
.. autofunction:: parallel_bulk
.. autofunction:: bulk
Scan
----
.. autofunction:: scan
Reindex
-------
.. autofunction:: reindex
|