File: CanonicalExtensions.rst

package info (click to toggle)
apache-arrow 23.0.1-1
  • links: PTS
  • area: main
  • in suites: sid
  • size: 76,220 kB
  • sloc: cpp: 654,608; python: 70,522; ruby: 45,964; ansic: 18,742; sh: 7,365; makefile: 669; javascript: 125; xml: 41
file content (591 lines) | stat: -rw-r--r-- 24,558 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
.. Licensed to the Apache Software Foundation (ASF) under one
.. or more contributor license agreements.  See the NOTICE file
.. distributed with this work for additional information
.. regarding copyright ownership.  The ASF licenses this file
.. to you under the Apache License, Version 2.0 (the
.. "License"); you may not use this file except in compliance
.. with the License.  You may obtain a copy of the License at

..   http://www.apache.org/licenses/LICENSE-2.0

.. Unless required by applicable law or agreed to in writing,
.. software distributed under the License is distributed on an
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
.. KIND, either express or implied.  See the License for the
.. specific language governing permissions and limitations
.. under the License.

.. _format_canonical_extensions:

*************************
Canonical Extension Types
*************************

============
Introduction
============

The Arrow columnar format allows defining
:ref:`extension types <format_metadata_extension_types>` so as to extend
standard Arrow data types with custom semantics.  Often these semantics
will be specific to a system or application.  However, it is beneficial
to share the definitions of well-known extension types so as to improve
interoperability between different systems integrating Arrow columnar data.

Standardization
===============

These rules must be followed for the standardization of canonical extension
types:

* Canonical extension types are described and maintained below in this document.

* Each canonical extension type requires a distinct discussion and vote
  on the `Arrow development mailing-list <https://arrow.apache.org/community/>`__.

* The specification text to be added *must* follow these requirements:

  1) It *must* define a well-defined extension name starting with "``arrow.``".

  2) Its parameters, if any, *must* be described in the proposal.

  3) Its serialization *must* be described in the proposal and should
     not require unduly implementation work or unusual software dependencies
     (for example, a trivial custom text format or a JSON-based format would be acceptable).

  4) Its expected semantics *should* be described as well and any
     potential ambiguities or pain points addressed or at least mentioned.

* The extension type *should* have one implementation submitted;
  preferably two if non-trivial (for example if parameterized).

Making Modifications
====================

Like standard Arrow data types, canonical extension types should be considered
stable once standardized.  Modifying a canonical extension type (for example
to expand the set of parameters) should be an exceptional event, follow the
same rules as laid out above, and provide backwards compatibility guarantees.


=============
Official List
=============

.. _fixed_shape_tensor_extension:

Fixed shape tensor
==================

* Extension name: ``arrow.fixed_shape_tensor``.

* The storage type of the extension: ``FixedSizeList`` where:

  * **value_type** is the data type of individual tensor elements.
  * **list_size** is the product of all the elements in tensor shape.

* Extension type parameters:

  * **value_type** = the Arrow data type of individual tensor elements.
  * **shape** = the physical shape of the contained tensors
    as an array.

  Optional parameters describing the logical layout:

  * **dim_names** = explicit names to tensor dimensions
    as an array. The length of it should be equal to the shape
    length and equal to the number of dimensions.

    ``dim_names`` can be used if the dimensions have well-known
    names and they map to the physical layout (row-major).

  * **permutation**  = indices of the desired ordering of the
    original dimensions, defined as an array.

    The indices contain a permutation of the values [0, 1, .., N-1] where
    N is the number of dimensions. The permutation indicates which
    dimension of the logical layout corresponds to which dimension of the
    physical tensor (the i-th dimension of the logical view corresponds
    to the dimension with number ``permutations[i]`` of the physical tensor).

    Permutation can be useful in case the logical order of
    the tensor is a permutation of the physical order (row-major).

    When logical and physical layout are equal, the permutation will always
    be ([0, 1, .., N-1]) and can therefore be left out.

* Description of the serialization:

  The metadata must be a valid JSON object including shape of
  the contained tensors as an array with key **"shape"** plus optional
  dimension names with keys **"dim_names"** and ordering of the
  dimensions with key **"permutation"**.

  - Example: ``{ "shape": [2, 5]}``
  - Example with ``dim_names`` metadata for NCHW ordered data:

    ``{ "shape": [100, 200, 500], "dim_names": ["C", "H", "W"]}``

  - Example of permuted 3-dimensional tensor:

    ``{ "shape": [100, 200, 500], "permutation": [2, 0, 1]}``

    This is the physical layout shape and the shape of the logical
    layout would in this case be ``[500, 100, 200]``.

.. note::

  Elements in a fixed shape tensor extension array are stored
  in row-major/C-contiguous order.

.. note::

  Other Data Structures in Arrow include a
  `Tensor (Multi-dimensional Array) <https://arrow.apache.org/docs/format/Other.html>`_
  to be used as a message in the interprocess communication machinery (IPC).

  This structure has no relationship with the Fixed shape tensor extension type defined
  by this specification. Instead, this extension type lets one use fixed shape tensors
  as elements in a field of a RecordBatch or a Table.

.. _variable_shape_tensor_extension:

Variable shape tensor
=====================

* Extension name: ``arrow.variable_shape_tensor``.

* The storage type of the extension is: ``StructArray`` where struct
  is composed of **data** and **shape** fields describing a single
  tensor per row:

  * **data** is a ``List`` holding tensor elements (each list element is
    a single tensor). The List's value type is the value type of the tensor,
    such as an integer or floating-point type.
  * **shape** is a ``FixedSizeList<int32>[ndim]`` of the tensor shape where
    the size of the list ``ndim`` is equal to the number of dimensions of the
    tensor.

* Extension type parameters:

  * **value_type** = the Arrow data type of individual tensor elements.

  Optional parameters describing the logical layout:

  * **dim_names** = explicit names to tensor dimensions
    as an array. The length of it should be equal to the shape
    length and equal to the number of dimensions.

    ``dim_names`` can be used if the dimensions have well-known
    names and they map to the physical layout (row-major).

  * **permutation**  = indices of the desired ordering of the
    original dimensions, defined as an array.

    The indices contain a permutation of the values [0, 1, .., N-1] where
    N is the number of dimensions. The permutation indicates which
    dimension of the logical layout corresponds to which dimension of the
    physical tensor (the i-th dimension of the logical view corresponds
    to the dimension with number ``permutations[i]`` of the physical tensor).

    Permutation can be useful in case the logical order of
    the tensor is a permutation of the physical order (row-major).

    When logical and physical layout are equal, the permutation will always
    be ([0, 1, .., N-1]) and can therefore be left out.

  * **uniform_shape** = sizes of individual tensor's dimensions which are
    guaranteed to stay constant in uniform dimensions and can vary in
    non-uniform dimensions. This holds over all tensors in the array.
    Sizes in uniform dimensions are represented with int32 values, while
    sizes of the non-uniform dimensions are not known in advance and are
    represented with null. If ``uniform_shape`` is not provided it is assumed
    that all dimensions are non-uniform.
    An array containing a tensor with shape (2, 3, 4) and whose first and
    last dimensions are uniform would have ``uniform_shape`` (2, null, 4).
    This allows for interpreting the tensor correctly without accounting for
    uniform dimensions while still permitting optional optimizations that
    take advantage of the uniformity.

* Description of the serialization:

  The metadata must be a valid JSON object that optionally includes
  dimension names with keys **"dim_names"** and  ordering of dimensions
  with key **"permutation"**.
  Shapes of tensors can be defined in a subset of dimensions by providing
  key **"uniform_shape"**.
  Minimal metadata is an empty string.

  - Example with ``dim_names`` metadata for NCHW ordered data (note that the first
    logical dimension, ``N``, is mapped to the **data** List array: each element in the List
    is a CHW tensor and the List of tensors implicitly constitutes a single NCHW tensor):

    ``{ "dim_names": ["C", "H", "W"] }``

  - Example with ``uniform_shape`` metadata for a set of color images
    with fixed height, variable width and three color channels:

    ``{ "dim_names": ["H", "W", "C"], "uniform_shape": [400, null, 3] }``

  - Example of permuted 3-dimensional tensor:

    ``{ "permutation": [2, 0, 1] }``

    For example, if the physical **shape** of an individual tensor
    is ``[100, 200, 500]``, this permutation would denote a logical shape
    of ``[500, 100, 200]``.

.. note::

  With the exception of ``permutation``, the parameters and storage
  of VariableShapeTensor relate to the *physical* storage of the tensor.

  For example, consider a tensor with::
    shape = [10, 20, 30]
    dim_names = [x, y, z]
    permutations = [2, 0, 1]

  This means the logical tensor has names [z, x, y] and shape [30, 10, 20].

.. note::
   Values inside each **data** tensor element are stored in row-major/C-contiguous
   order according to the corresponding **shape**.

.. _json_extension:

JSON
====

* Extension name: ``arrow.json``.

* The storage type of this extension is ``String`` or
  ``LargeString`` or ``StringView``.
  Only UTF-8 encoded JSON as specified in `rfc8259`_ is supported.

* Extension type parameters:

  This type does not have any parameters.

* Description of the serialization:

  Metadata is either an empty string or a JSON string with an empty object.
  In the future, additional fields may be added, but they are not required
  to interpret the array.

.. _uuid_extension:

UUID
====

* Extension name: ``arrow.uuid``.

* The storage type of the extension is ``FixedSizeBinary`` with a length of 16 bytes.

.. note::
   A specific UUID version is not required or guaranteed. This extension represents
   UUIDs as FixedSizeBinary(16) with big-endian notation and does not interpret the bytes in any way.

Opaque
======

Opaque represents a type that an Arrow-based system received from an external
(often non-Arrow) system, but that it cannot interpret.  In this case, it can
pass on Opaque to its clients to at least show that a field exists and
preserve metadata about the type from the other system.

Extension parameters:

* Extension name: ``arrow.opaque``.

* The storage type of this extension is any type.  If there is no underlying
  data, the storage type should be Null.

* Extension type parameters:

  * **type_name** = the name of the unknown type in the external system.
  * **vendor_name** = the name of the external system.

* Description of the serialization:

  A valid JSON object containing the parameters as fields.  In the future,
  additional fields may be added, but all fields current and future are never
  required to interpret the array.

  Developers **should not** attempt to enable public semantic interoperability
  of Opaque by canonicalizing specific values of these parameters.

Rationale
---------

Interfacing with non-Arrow systems requires a way to handle data that doesn't
have an equivalent Arrow type.  In this case, use the Opaque type, which
explicitly represents an unsupported field.  Other solutions are inadequate:

* Raising an error means even one unsupported field makes all operations
  impossible, even if (for instance) the user is just trying to view a schema.
* Dropping unsupported columns misleads the user as to the actual schema.
* An extension type may not exist for the unsupported type.
* Generating an extension type on the fly would falsely imply support.

Applications **should not** make conventions around vendor_name and type_name.
These parameters are meant for human end users to understand what type wasn't
supported.  Applications may try to interpret these fields, but must be
prepared for breakage (e.g., when the type becomes supported with a custom
extension type later on).  Similarly, **Opaque is not a generic container for
file formats**.  Considerations such as MIME types are irrelevant.  In both of
these cases, create a custom extension type instead.

Examples:

* A Flight SQL service that supports connecting external databases may
  encounter columns with unsupported types in external tables.  In this case,
  it can use the Opaque[Null] type to at least report that a column exists
  with a particular name and type name.  This lets clients know that a column
  exists, but is not supported.  Null is used as the storage type here because
  only schemas are involved.

  An example of the extension metadata would be::

    {"type_name": "varray", "vendor_name": "Oracle"}

* The ADBC PostgreSQL driver gets results as a series of length-prefixed byte
  fields.  But the driver will not always know how to parse the bytes, as
  there may be extensions (e.g. PostGIS).  It can use Opaque[Binary] to still
  return those bytes to the application, which may be able to parse the data
  itself.  Opaque differentiates the column from an actual binary column and
  makes it clear that the value is directly from PostgreSQL.  (A custom
  extension type is preferred, but there will always be extensions that the
  driver does not know about.)

  An example of the extension metadata would be::

    {"type_name": "geometry", "vendor_name": "PostGIS"}

* The ADBC PostgreSQL driver may also know how to parse the bytes, but not
  know the intended semantics.  For example, `composite types
  <https://www.postgresql.org/docs/current/rowtypes.html>`_ can add new
  semantics to existing types, somewhat like Arrow extension types.  The
  driver would be able to parse the underlying bytes in this case, but would
  still use the Opaque type.

  Consider the example in the PostgreSQL documentation of a ``complex`` type.
  Mapping the type to a plain Arrow ``struct`` type would lose meaning, just
  like how an Arrow system deciding to treat all extension types by dropping
  the extension metadata would be undesirable.  Instead, the driver can use
  Opaque[Struct] to pass on the composite type info.  (It would be wrong to
  try to map this to an Arrow-defined complex type: it does not know the
  proper semantics of a user-defined type, which cannot and should not be
  hardcoded into the driver in the first place.)

  An example of the extension metadata would be::

    {"type_name": "database_name.schema_name.complex", "vendor_name": "PostgreSQL"}

* The JDBC adapter in the Arrow Java libraries converts JDBC result sets into
  Arrow arrays, and can get Arrow schemas from result sets.  JDBC, however,
  allows drivers to return `arbitrary Java objects
  <https://docs.oracle.com/javase/8/docs/api/java/sql/Types.html#OTHER>`_.

  The driver can use Opaque[Null] as a placeholder during schema conversion,
  only erroring if the application tries to fetch the actual data.  That way,
  clients can at least introspect result schemas to decide whether it can
  proceed to fetch the data, or only query certain columns.

  An example of the extension metadata would be::

    {"type_name": "OTHER", "vendor_name": "JDBC driver name"}

8-bit Boolean
=============

Bool8 represents a boolean value using 1 byte (8 bits) to store each value instead of only 1 bit as in
the original Arrow Boolean type. Although less compact than the original representation, Bool8 may have
better zero-copy compatibility with various systems that also store booleans using 1 byte.

* Extension name: ``arrow.bool8``.

* The storage type of this extension is ``Int8`` where:

  * **false** is denoted by the value ``0``.
  * **true** can be specified using any non-zero value. Preferably ``1``.

* Extension type parameters:

  This type does not have any parameters.

* Description of the serialization:

  Metadata is an empty string.

.. _parquet_variant_extension:

Parquet Variant
===============

Variant represents a value that may be one of:

* Primitive: a type and corresponding value (e.g. ``INT``, ``STRING``)

* Array: An ordered list of Variant values

* Object: An unordered collection of string/Variant pairs (i.e. key/value pairs). An object may not contain duplicate keys

Particularly, this provides a way to represent semi-structured data which is stored as a
`Parquet Variant <https://github.com/apache/parquet-format/blob/master/VariantEncoding.md>`__ value within Arrow columns in
a lossless fashion. This also provides the ability to represent `shredded <https://github.com/apache/parquet-format/blob/master/VariantShredding.md>`__
variant values. The canonical extension type allows systems to pass Variant encoded data around without special handling unless
they want to directly interact with the encoded variant data. See the Parquet format specification for details on what the actual
binary values look like.

* Extension name: ``arrow.parquet.variant``.

* The storage type of this extension is a ``Struct`` that obeys the following rules:

  * A *non-nullable* field named ``metadata`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``.

  * At least one (or both) of the following:

    * A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``.
      (unshredded variants consist of just the ``metadata`` and ``value`` fields only)

    * A field named ``typed_value`` which can be a :ref:`variant_primitive_type_mapping` or a ``List``, ``LargeList``, ``ListView`` or ``Struct``

      * If the ``typed_value`` field is a ``List``, ``LargeList`` or ``ListView`` its elements **must** be *non-nullable* and **must**
        be a ``Struct`` consisting of at least one (or both) of the following:

        * A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``.

        * A field named ``typed_value`` which follows the rules outlined above (this allows for arbitrarily nested data).

      * If the ``typed_value`` field is a ``Struct``, then its fields **must** be *non-nullable*, representing the fields being shredded
        from the objects, and **must** be a ``Struct`` consisting of at least one (or both) of the following:

        * A field named ``value`` which is of type ``Binary``, ``LargeBinary``, or ``BinaryView``.

        * A field named ``typed_value`` which follows the rules outlined above (this allows for arbitrarily nested data).

* Extension type parameters:

  This type does not have any parameters.

* Description of the serialization:

  Extension metadata is an empty string.

.. note::

   It is also *permissible* for the ``metadata`` field to be dictionary-encoded with a preferred (*but not required*) index type of ``int8``,
   or run-end-encoded with a preferred (*but not required*) runs type of ``int8``.

.. note::

   The fields may be in any order, and thus must be accessed by **name** not by *position*. The field names are case sensitive.

.. _variant_primitive_type_mapping:

Primitive Type Mappings
-----------------------

+----------------------+------------------------+
| Arrow Primitive Type | Variant Primitive Type |
+======================+========================+
| Null                 | Null                   |
+----------------------+------------------------+
| Boolean              | Boolean (true/false)   |
+----------------------+------------------------+
| Int8                 | Int8                   |
+----------------------+------------------------+
| Uint8                | Int16                  |
+----------------------+------------------------+
| Int16                | Int16                  |
+----------------------+------------------------+
| Uint16               | Int32                  |
+----------------------+------------------------+
| Int32                | Int32                  |
+----------------------+------------------------+
| Uint32               | Int64                  |
+----------------------+------------------------+
| Int64                | Int64                  |
+----------------------+------------------------+
| Float                | Float                  |
+----------------------+------------------------+
| Double               | Double                 |
+----------------------+------------------------+
| Decimal32            | decimal4               |
+----------------------+------------------------+
| Decimal64            | decimal8               |
+----------------------+------------------------+
| Decimal128           | decimal16              |
+----------------------+------------------------+
| Date32               | Date                   |
+----------------------+------------------------+
| Time64               | TimeNTZ                |
+----------------------+------------------------+
| Timestamp(us, UTC)   | Timestamp (micro)      |
+----------------------+------------------------+
| Timestamp(us)        | TimestampNTZ (micro)   |
+----------------------+------------------------+
| Timestamp(ns, UTC)   | Timestamp (nano)       |
+----------------------+------------------------+
| Timestamp(ns)        | TimestampNTZ (nano)    |
+----------------------+------------------------+
| Binary               | Binary                 |
+----------------------+------------------------+
| LargeBinary          | Binary                 |
+----------------------+------------------------+
| BinaryView           | Binary                 |
+----------------------+------------------------+
| String               | String                 |
+----------------------+------------------------+
| LargeString          | String                 |
+----------------------+------------------------+
| StringView           | String                 |
+----------------------+------------------------+
| UUID extension type  | UUID                   |
+----------------------+------------------------+

.. _timestamp_with_offset_extension:

Timestamp With Offset
=====================
This type represents a timestamp column that stores potentially different timezone offsets per value. The timestamp is stored in UTC alongside the original timezone offset in minutes.
This extension type is intended to be compatible with ANSI SQL's ``TIMESTAMP WITH TIME ZONE``, which is supported by multiple database engines.

* Extension name: ``arrow.timestamp_with_offset``.

* The storage type of the extension is a ``Struct`` with 2 fields, in order:

  * ``timestamp``: a non-nullable ``Timestamp(time_unit, "UTC")``, where ``time_unit`` is any Arrow ``TimeUnit`` (s, ms, us or ns).

  * ``offset_minutes``: a non-nullable signed 16-bit integer (``Int16``) representing the offset in minutes from the UTC timezone. Negative offsets represent time zones west of UTC, while positive offsets represent east. Offsets normally range from -779 (-12:59) to +780 (+13:00).

* Extension type parameters:

  This type does not have any parameters.

* Description of the serialization:

  Extension metadata is an empty string.

.. note::

   It is also *permissible* for the ``offset_minutes`` field to be dictionary-encoded or run-end-encoded.

Community Extension Types
=========================

In addition to the canonical extension types listed above, there exist Arrow
extension types that have been established as standards within specific domain
areas. These have not been officially designated as canonical through a
discussion and vote on the Arrow development mailing list but are well known
within subcommunities of Arrow developers.

GeoArrow
========

`GeoArrow <https://github.com/geoarrow/geoarrow>`_ defines a collection of
Arrow extension types for representing vector geometries. It is well known
within the Arrow geospatial subcommunity. The GeoArrow specification is not yet
finalized.

.. _rfc8259: https://datatracker.ietf.org/doc/html/rfc8259