File: substrait.rst

package info (click to toggle)
apache-arrow 23.0.1-1
  • links: PTS
  • area: main
  • in suites: sid
  • size: 76,220 kB
  • sloc: cpp: 654,608; python: 70,522; ruby: 45,964; ansic: 18,742; sh: 7,365; makefile: 669; javascript: 125; xml: 41
file content (243 lines) | stat: -rw-r--r-- 7,825 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
.. Licensed to the Apache Software Foundation (ASF) under one
.. or more contributor license agreements.  See the NOTICE file
.. distributed with this work for additional information
.. regarding copyright ownership.  The ASF licenses this file
.. to you under the Apache License, Version 2.0 (the
.. "License"); you may not use this file except in compliance
.. with the License.  You may obtain a copy of the License at

..   http://www.apache.org/licenses/LICENSE-2.0

.. Unless required by applicable law or agreed to in writing,
.. software distributed under the License is distributed on an
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
.. KIND, either express or implied.  See the License for the
.. specific language governing permissions and limitations
.. under the License.

.. default-domain:: cpp
.. highlight:: cpp
.. cpp:namespace:: arrow::engine::substrait

.. _acero-substrait:

==========================
Using Acero with Substrait
==========================

In order to use Acero you will need to create an execution plan.  This is the
model that describes the computation you want to apply to your data.  Acero has
its own internal representation for execution plans but most users should not
interact with this directly as it will couple their code to Acero.

`Substrait <https://substrait.io>`_ is an open standard for execution plans.
Acero implements the Substrait "consumer" interface.  This means that Acero can
accept a Substrait plan and fulfill the plan, loading the requested data and
applying the desired computation.  By using Substrait plans users can easily
switch out to a different execution engine at a later time.

Substrait Conformance
---------------------

Substrait defines a broad set of operators and functions for many different
situations and it is unlikely that Acero will ever completely satisfy all
defined Substrait operators and functions.  To help understand what features
are available the following sections define which features have been currently
implemented in Acero and any caveats that apply.

Plans
^^^^^

* A plan should have a single top-level relation.
* The consumer is currently based on version 0.20.0 of Substrait.
  Any features added that are newer will not be supported.
* Due to a breaking change in 0.20.0 any Substrait plan older than 0.20.0
  will be rejected.

Extensions
^^^^^^^^^^

* If a plan contains any extension type variations it will be rejected.
* Advanced extensions can be provided by supplying a custom implementation of
  :class:`arrow::engine::ExtensionProvider`.

Relations (in general)
^^^^^^^^^^^^^^^^^^^^^^

* Any relation not explicitly listed below will not be supported
  and will cause the plan to be rejected.

Read Relations
^^^^^^^^^^^^^^

* The ``projection`` property is not supported and plans containing this
  property will be rejected.
* The ``VirtualTable`` and ``ExtensionTable`` read types are not supported.
  Plans containing these types will be rejected.
* Only the parquet and arrow file formats are currently supported.
* All URIs must use the ``file`` scheme
* ``partition_index``, ``start``, and ``length`` are not supported.  Plans containing
  non-default values for these properties will be rejected.
* The Substrait spec requires that a ``filter`` be completely satisfied by a read
  relation.  However, Acero only uses a read filter for pushdown projection and
  it may not be fully satisfied.  Users should generally attach an additional
  filter relation with the same filter expression after the read relation.

Filter Relations
^^^^^^^^^^^^^^^^

* No known caveats

Project Relations
^^^^^^^^^^^^^^^^^

* No known caveats

Join Relations
^^^^^^^^^^^^^^

* The join type ``JOIN_TYPE_SINGLE`` is not supported and plans containing this
  will be rejected.
* The join expression must be a call to either the ``equal`` or ``is_not_distinct_from``
  functions.  Both arguments to the call must be direct references.  Only a single
  join key is supported.
* The ``post_join_filter`` property is not supported and will be ignored.

Aggregate Relations
^^^^^^^^^^^^^^^^^^^

* At most one grouping set is supported.
* Each grouping expression must be a direct reference.
* Each measure's arguments must be direct references.
* A measure may not have a filter
* A measure may not have sorts
* A measure's invocation must be AGGREGATION_INVOCATION_ALL or
  AGGREGATION_INVOCATION_UNSPECIFIED
* A measure's phase must be AGGREGATION_PHASE_INITIAL_TO_RESULT

Expressions (general)
^^^^^^^^^^^^^^^^^^^^^

* Various places in the Substrait spec allow for expressions to be used outside
  of a filter or project relation.  For example, a join expression or an aggregate
  grouping set.  Acero typically expects these expressions to be direct references.
  Planners should extract the implicit projection into a formal project relation
  before delivering the plan to Acero.

Literals
^^^^^^^^

* A literal with non-default nullability will cause a plan to be rejected.

Types
^^^^^

* Acero does not have full support for non-nullable types and may allow input
  to have nulls without rejecting it.
* The table below shows the mapping between Arrow types and Substrait type
  classes that are currently supported

.. list-table:: Substrait / Arrow Type Mapping
   :widths: 25 25 50
   :header-rows: 1

   * - Substrait Type
     - Arrow Type
     - Caveat
   * - boolean
     - boolean
     -
   * - i8
     - int8
     -
   * - i16
     - int16
     -
   * - i32
     - int32
     -
   * - i64
     - int64
     -
   * - fp32
     - float32
     -
   * - fp64
     - float64
     -
   * - string
     - string
     -
   * - binary
     - binary
     -
   * - timestamp
     - timestamp<MICRO,"">
     -
   * - timestamp_tz
     - timestamp<MICRO,"UTC">
     -
   * - date
     - date32<DAY>
     -
   * - time
     - time64<MICRO>
     -
   * - interval_year
     -
     - Not currently supported
   * - interval_day
     -
     - Not currently supported
   * - uuid
     -
     - Not currently supported
   * - FIXEDCHAR<L>
     -
     - Not currently supported
   * - VARCHAR<L>
     -
     - Not currently supported
   * - FIXEDBINARY<L>
     - fixed_size_binary<L>
     -
   * - DECIMAL<P,S>
     - decimal128<P,S>
     -
   * - STRUCT<T1...TN>
     - struct<T1...TN>
     - Arrow struct fields will have no name (empty string)
   * - NSTRUCT<N:T1...N:Tn>
     -
     - Not currently supported
   * - LIST<T>
     - list<T>
     -
   * - MAP<K,V>
     - map<K,V>
     - K must not be nullable

Functions
^^^^^^^^^

* The following functions have caveats or are not supported at all.  Note that
  this is not a comprehensive list.  Functions are being added to Substrait at
  a rapid pace and new functions may be missing.

  * Acero does not support the SATURATE option for overflow
  * Acero does not support kernels that take more than two arguments
    for the functions ``and``, ``or``, ``xor``

* Substrait has not yet clearly identified the form that URIs should take for
  standard functions.  Acero will look for the URIs to the ``main`` GitHub branch.
  In other words, for the file ``functions_arithmetic.yaml`` Acero expects
  ``https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml``

  * Acero has functions that are not yet a part of Substrait (or may never be added
    as official functions).  To invoke these functions you can use the special URI
    ``urn:arrow:substrait_simple_extension_function``.  If this URI is encountered
    then Acero will match only on function name and will ignore any function options.

  * Alternatively, the URI can be left completely empty and Acero will match
    based only on function name.  This fallback mechanism is non-standard and should
    be considered deprecated in favor of the special URI above.