File: helpers.rst

package info (click to toggle)
python-elasticsearch 8.17.2-2
  • links: PTS, VCS
  • area: main
  • in suites: trixie
  • size: 20,124 kB
  • sloc: python: 69,424; makefile: 150; javascript: 75
file content (146 lines) | stat: -rw-r--r-- 3,686 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
.. _helpers:

Helpers
=======

Collection of simple helper functions that abstract some specifics of the raw API.

Connecting
----------

.. code-block:: python

    from elasticsearch import Elasticsearch
    
    client = Elasticsearch("https://.../", api_key="YOUR_API_KEY")


Bulk helpers
------------

There are several helpers for the ``bulk`` API since its requirement for
specific formatting and other considerations can make it cumbersome if used directly.

All bulk helpers accept an instance of ``Elasticsearch`` class and an iterable
``actions`` (any iterable, can also be a generator, which is ideal in most
cases since it will allow you to index large datasets without the need of
loading them into memory).

The items in the ``action`` iterable should be the documents we wish to index
in several formats. The most common one is the same  as returned by
:meth:`~elasticsearch.Elasticsearch.search`, for example:

.. code:: python

    {
        '_index': 'index-name',
        '_id': 42,
        '_routing': 5,
        'pipeline': 'my-ingest-pipeline',
        '_source': {
            "title": "Hello World!",
            "body": "..."
        }
    }

Alternatively, if `_source` is not present, it will pop all metadata fields
from the doc and use the rest as the document data:

.. code:: python

    {
        "_id": 42,
        "_routing": 5,
        "title": "Hello World!",
        "body": "..."
    }

The :meth:`~elasticsearch.Elasticsearch.bulk` api accepts ``index``, ``create``,
``delete``, and ``update`` actions. Use the ``_op_type`` field to specify an
action (``_op_type`` defaults to ``index``):

.. code:: python

    {
        '_op_type': 'delete',
        '_index': 'index-name',
        '_id': 42,
    }
    {
        '_op_type': 'update',
        '_index': 'index-name',
        '_id': 42,
        'doc': {'question': 'The life, universe and everything.'}
    }


Example:
~~~~~~~~

Lets say we have an iterable of data. Lets say a list of words called ``mywords``
and we want to index those words into individual documents where the structure of the
document is like ``{"word": "<myword>"}``.

.. code:: python

    from elasticsearch.helpers import bulk

    def gendata():
        mywords = ['foo', 'bar', 'baz']
        for word in mywords:
            yield {
                "_index": "mywords",
                "word": word,
            }

    bulk(client, gendata())


For a more complete and complex example please take a look at
https://github.com/elastic/elasticsearch-py/blob/main/examples/bulk-ingest

The :meth:`~elasticsearch.Elasticsearch.parallel_bulk` api is a wrapper around the :meth:`~elasticsearch.Elasticsearch.bulk` api to provide threading. :meth:`~elasticsearch.Elasticsearch.parallel_bulk` returns a generator which must be consumed to produce results.

To see the results use:

.. code:: python

    for success, info in parallel_bulk(...):
    if not success:
        print('A document failed:', info)
        
If you don't care about the results, you can use deque from collections:

.. code:: python

    from collections import deque
    deque(parallel_bulk(...), maxlen=0)

.. note::

    When reading raw json strings from a file, you can also pass them in
    directly (without decoding to dicts first). In that case, however, you lose
    the ability to specify anything (index, op_type and even id) on a per-record
    basis, all documents will just be sent to elasticsearch to be indexed
    as-is.


.. py:module:: elasticsearch.helpers

.. autofunction:: streaming_bulk

.. autofunction:: parallel_bulk

.. autofunction:: bulk


Scan
----

.. autofunction:: scan


Reindex
-------

.. autofunction:: reindex