File: CStreamInterface.rst

package info (click to toggle)
apache-arrow 23.0.1-1
  • links: PTS
  • area: main
  • in suites: sid
  • size: 76,220 kB
  • sloc: cpp: 654,608; python: 70,522; ruby: 45,964; ansic: 18,742; sh: 7,365; makefile: 669; javascript: 125; xml: 41
file content (231 lines) | stat: -rw-r--r-- 8,053 bytes parent folder | download | duplicates (5)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
.. Licensed to the Apache Software Foundation (ASF) under one
.. or more contributor license agreements.  See the NOTICE file
.. distributed with this work for additional information
.. regarding copyright ownership.  The ASF licenses this file
.. to you under the Apache License, Version 2.0 (the
.. "License"); you may not use this file except in compliance
.. with the License.  You may obtain a copy of the License at

..   http://www.apache.org/licenses/LICENSE-2.0

.. Unless required by applicable law or agreed to in writing,
.. software distributed under the License is distributed on an
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
.. KIND, either express or implied.  See the License for the
.. specific language governing permissions and limitations
.. under the License.

.. highlight:: c

.. _c-stream-interface:

============================
The Arrow C stream interface
============================

The C stream interface builds on the structures defined in the
:ref:`C data interface <c-data-interface>` and combines them into a higher-level
specification so as to ease the communication of streaming data within a single
process.

Semantics
=========

An Arrow C stream exposes a streaming source of data chunks, each with the
same schema.  Chunks are obtained by calling a blocking pull-style iteration
function.

Structure definition
====================

The C stream interface is defined by a single ``struct`` definition::

   #ifndef ARROW_C_STREAM_INTERFACE
   #define ARROW_C_STREAM_INTERFACE

   struct ArrowArrayStream {
     // Callbacks providing stream functionality
     int (*get_schema)(struct ArrowArrayStream*, struct ArrowSchema* out);
     int (*get_next)(struct ArrowArrayStream*, struct ArrowArray* out);
     const char* (*get_last_error)(struct ArrowArrayStream*);

     // Release callback
     void (*release)(struct ArrowArrayStream*);

     // Opaque producer-specific data
     void* private_data;
   };

   #endif  // ARROW_C_STREAM_INTERFACE

.. note::
   The canonical guard ``ARROW_C_STREAM_INTERFACE`` is meant to avoid
   duplicate definitions if two projects copy the C data interface
   definitions in their own headers, and a third-party project
   includes from these two projects.  It is therefore important that
   this guard is kept exactly as-is when these definitions are copied.

The ArrowArrayStream structure
------------------------------

The ``ArrowArrayStream`` provides the required callbacks to interact with a
streaming source of Arrow arrays.  It has the following fields:

.. c:member:: int (*ArrowArrayStream.get_schema)(struct ArrowArrayStream*, struct ArrowSchema* out)

   *Mandatory.*  This callback allows the consumer to query the schema of
   the chunks of data in the stream.  The schema is the same for all
   data chunks.

   This callback must NOT be called on a released ``ArrowArrayStream``.

   *Return value:* 0 on success, a non-zero
   :ref:`error code <c-stream-interface-error-codes>` otherwise.

.. c:member:: int (*ArrowArrayStream.get_next)(struct ArrowArrayStream*, struct ArrowArray* out)

   *Mandatory.*  This callback allows the consumer to get the next chunk
   of data in the stream.

   This callback must NOT be called on a released ``ArrowArrayStream``.

   *Return value:* 0 on success, a non-zero
   :ref:`error code <c-stream-interface-error-codes>` otherwise.

   On success, the consumer must check whether the ``ArrowArray`` is
   marked :ref:`released <c-data-interface-released>`.  If the
   ``ArrowArray`` is released, then the end of stream has been reached.
   Otherwise, the ``ArrowArray`` contains a valid data chunk.

.. c:member:: const char* (*ArrowArrayStream.get_last_error)(struct ArrowArrayStream*)

   *Mandatory.*  This callback allows the consumer to get a textual description
   of the last error.

   This callback must ONLY be called if the last operation on the
   ``ArrowArrayStream`` returned an error.  It must NOT be called on a
   released ``ArrowArrayStream``.

   *Return value:* a pointer to a NULL-terminated character string (UTF8-encoded).
   NULL can also be returned if no detailed description is available.

   The returned pointer is only guaranteed to be valid until the next call of
   one of the stream's callbacks.  The character string it points to should
   be copied to consumer-managed storage if it is intended to survive longer.

.. c:member:: void (*ArrowArrayStream.release)(struct ArrowArrayStream*)

   *Mandatory.*  A pointer to a producer-provided release callback.

.. c:member:: void* ArrowArrayStream.private_data

   *Optional.*  An opaque pointer to producer-provided private data.

   Consumers MUST not process this member.  Lifetime of this member
   is handled by the producer, and especially by the release callback.


.. _c-stream-interface-error-codes:

Error codes
-----------

The ``get_schema`` and ``get_next`` callbacks may return an error under the form
of a non-zero integer code.  Such error codes should be interpreted like
``errno`` numbers (as defined by the local platform).  Note that the symbolic
forms of these constants are stable from platform to platform, but their numeric
values are platform-specific.

In particular, it is recommended to recognize the following values:

* ``EINVAL``: for a parameter or input validation error
* ``ENOMEM``: for a memory allocation failure (out of memory)
* ``EIO``: for a generic input/output error

.. seealso::
   `Standard POSIX error codes <https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/errno.h.html>`__.

   `Error codes recognized by the Windows C runtime library
   <https://docs.microsoft.com/en-us/cpp/c-runtime-library/errno-doserrno-sys-errlist-and-sys-nerr>`__.

Result lifetimes
----------------

The data returned by the ``get_schema`` and ``get_next`` callbacks must be
released independently.  Their lifetimes are not tied to that of the
``ArrowArrayStream``.

Stream lifetime
---------------

Lifetime of the C stream is managed using a release callback with similar
usage as in the :ref:`C data interface <c-data-interface-released>`.

Thread safety
-------------

The stream source is not assumed to be thread-safe.  Consumers wanting to
call ``get_next`` from several threads should ensure those calls are
serialized.

C consumer example
==================

Let's say a particular database provides the following C API to execute
a SQL query and return the result set as a Arrow C stream::

   void MyDB_Query(const char* query, struct ArrowArrayStream* result_set);

Then a consumer could use the following code to iterate over the results::

   static void handle_error(int errcode, struct ArrowArrayStream* stream) {
      // Print stream error
      const char* errdesc = stream->get_last_error(stream);
      if (errdesc != NULL) {
         fputs(errdesc, stderr);
      } else {
         fputs(strerror(errcode), stderr);
      }
      // Release stream and abort
      stream->release(stream),
      exit(1);
   }

   void run_query() {
      struct ArrowArrayStream stream;
      struct ArrowSchema schema;
      struct ArrowArray chunk;
      int errcode;

      MyDB_Query("SELECT * FROM my_table", &stream);

      // Query result set schema
      errcode = stream.get_schema(&stream, &schema);
      if (errcode != 0) {
         handle_error(errcode, &stream);
      }

      int64_t num_rows = 0;

      // Iterate over results: loop until error or end of stream
      while ((errcode = stream.get_next(&stream, &chunk) == 0) &&
             chunk.release != NULL) {
         // Do something with chunk...
         fprintf(stderr, "Result chunk: got %lld rows\n", chunk.length);
         num_rows += chunk.length;

         // Release chunk
         chunk.release(&chunk);
      }

      // Was it an error?
      if (errcode != 0) {
         handle_error(errcode, &stream);
      }

      fprintf(stderr, "Result stream ended: total %lld rows\n", num_rows);

      // Release schema and stream
      schema.release(&schema);
      stream.release(&stream);
   }