File: CDeviceDataInterface.rst

package info (click to toggle)
apache-arrow 23.0.1-1
  • links: PTS
  • area: main
  • in suites: sid
  • size: 76,220 kB
  • sloc: cpp: 654,608; python: 70,522; ruby: 45,964; ansic: 18,742; sh: 7,365; makefile: 669; javascript: 125; xml: 41
file content (1068 lines) | stat: -rw-r--r-- 46,499 bytes parent folder | download | duplicates (5)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
.. Licensed to the Apache Software Foundation (ASF) under one
.. or more contributor license agreements.  See the NOTICE file
.. distributed with this work for additional information
.. regarding copyright ownership.  The ASF licenses this file
.. to you under the Apache License, Version 2.0 (the
.. "License"); you may not use this file except in compliance
.. with the License.  You may obtain a copy of the License at

..   http://www.apache.org/licenses/LICENSE-2.0

.. Unless required by applicable law or agreed to in writing,
.. software distributed under the License is distributed on an
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
.. KIND, either express or implied.  See the License for the
.. specific language governing permissions and limitations
.. under the License.

.. highlight:: c

.. _c-device-data-interface:

=================================
The Arrow C Device data interface
=================================

.. warning:: The Arrow C Device Data Interface should be considered experimental

Rationale
=========

The current :ref:`C Data Interface <c-data-interface>`, and most
implementations of it, make the assumption that all data buffers provided
are CPU buffers. Since Apache Arrow is designed to be a universal in-memory
format for representing tabular ("columnar") data, there will be the desire
to leverage this data on non-CPU hardware such as GPUs. One example of such
a case is the `RAPIDS cuDF library`_ which uses the Arrow memory format with
CUDA for NVIDIA GPUs. Since copying data from host to device and back is
expensive, the ideal would be to be able to leave the data on the device
for as long as possible, even when passing it between runtimes and
libraries.

The Arrow C Device data interface builds on the existing C data interface
by adding a very small, stable set of C definitions to it. These definitions
are equivalents to the ``ArrowArray`` and ``ArrowArrayStream`` structures
from the C Data Interface which add members to allow specifying the device
type and pass necessary information to synchronize with the producer.
For non-C/C++ languages and runtimes, translating the C definitions to
corresponding C FFI declarations should be just as simple as with the
current C data interface.

Applications and libraries can then use Arrow schemas and Arrow formatted
memory on non-CPU devices to exchange data just as easily as they do
now with CPU data. This will enable leaving data on those devices longer
and avoiding costly copies back and forth between the host and device
just to leverage new libraries and runtimes.

Goals
-----

* Expose an ABI-stable interface built on the existing C data interface.
* Make it easy for third-party projects to implement support with little
  initial investment.
* Allow zero-copy sharing of Arrow formatted device memory between
  independent runtimes and components running in the same process.
* Avoid the need for one-to-one adaptation layers such as the
  `CUDA Array Interface`_ for Python processes to pass CUDA data.
* Enable integration without explicit dependencies (either at compile-time
  or runtime) on the Arrow software project itself.

The intent is for the Arrow C Device data interface to expand the reach
of the current C data interface, allowing it to also become the standard
low-level building block for columnar processing on devices like GPUs or
FPGAs.

Structure definitions
=====================

Because this is built on the C data interface, the C Device data interface
uses the ``ArrowSchema`` and ``ArrowArray`` structures as defined in the
:ref:`C data interface spec <c-data-interface-struct-defs>`. It then adds the
following free-standing definitions. Like the rest of the Arrow project,
they are available under the Apache License 2.0.

.. code-block:: c

    #ifndef ARROW_C_DEVICE_DATA_INTERFACE
    #define ARROW_C_DEVICE_DATA_INTERFACE

    // Device type for the allocated memory
    typedef int32_t ArrowDeviceType;

    // CPU device, same as using ArrowArray directly
    #define ARROW_DEVICE_CPU 1
    // CUDA GPU Device
    #define ARROW_DEVICE_CUDA 2
    // Pinned CUDA CPU memory by cudaMallocHost
    #define ARROW_DEVICE_CUDA_HOST 3
    // OpenCL Device
    #define ARROW_DEVICE_OPENCL 4
    // Vulkan buffer for next-gen graphics
    #define ARROW_DEVICE_VULKAN 7
    // Metal for Apple GPU
    #define ARROW_DEVICE_METAL 8
    // Verilog simulator buffer
    #define ARROW_DEVICE_VPI 9
    // ROCm GPUs for AMD GPUs
    #define ARROW_DEVICE_ROCM 10
    // Pinned ROCm CPU memory allocated by hipMallocHost
    #define ARROW_DEVICE_ROCM_HOST 11
    // Reserved for extension
    //
    // used to quickly test extension devices, semantics
    // can differ based on implementation
    #define ARROW_DEVICE_EXT_DEV 12
    // CUDA managed/unified memory allocated by cudaMallocManaged
    #define ARROW_DEVICE_CUDA_MANAGED 13
    // Unified shared memory allocated on a oneAPI
    // non-partitioned device.
    //
    // A call to the oneAPI runtime is required to determine the
    // device type, the USM allocation type and the sycl context
    // that it is bound to.
    #define ARROW_DEVICE_ONEAPI 14
    // GPU support for next-gen WebGPU standard
    #define ARROW_DEVICE_WEBGPU 15
    // Qualcomm Hexagon DSP
    #define ARROW_DEVICE_HEXAGON 16

    struct ArrowDeviceArray {
      struct ArrowArray array;
      int64_t device_id;
      ArrowDeviceType device_type;
      void* sync_event;

      // reserved bytes for future expansion
      int64_t reserved[3];
    };

    #endif  // ARROW_C_DEVICE_DATA_INTERFACE

.. note::
   The canonical guard ``ARROW_C_DEVICE_DATA_INTERFACE`` is meant to avoid
   duplicate definitions if two projects copy the definitions in their own
   headers, and a third-party project includes from these two projects. It
   is therefore important that this guard is kept exactly as-is when these
   definitions are copied.

ArrowDeviceType
---------------

The ``ArrowDeviceType`` typedef is used to indicate what type of device the
provided memory buffers were allocated on. This, in conjunction with the
``device_id``, should be sufficient to reference the correct data buffers.

We then use macros to define values for different device types. The provided
macro values are compatible with the widely used `dlpack`_ ``DLDeviceType``
definition values, using the same value for each as the equivalent
``kDL<type>`` enum from ``dlpack.h``. The list will be kept in sync with those
equivalent enum values over time to ensure compatibility, rather than
potentially diverging. To avoid the Arrow project having to be in the
position of vetting new hardware devices, new additions should first be
added to dlpack before we add a corresponding macro here.

To ensure predictability with the ABI, we use macros instead of an ``enum``
so the storage type is not compiler dependent.

.. c:macro:: ARROW_DEVICE_CPU

    CPU Device, equivalent to just using ``ArrowArray`` directly instead of
    using ``ArrowDeviceArray``.

.. c:macro:: ARROW_DEVICE_CUDA

    A `CUDA`_ GPU Device. This could represent data allocated either with the
    runtime library (``cudaMalloc``) or the device driver (``cuMemAlloc``).

.. c:macro:: ARROW_DEVICE_CUDA_HOST

    CPU memory that was pinned and page-locked by CUDA by using
    ``cudaMallocHost`` or ``cuMemAllocHost``.

.. c:macro:: ARROW_DEVICE_OPENCL

    Data allocated on the device by using the `OpenCL (Open Computing Language)`_
    framework.

.. c:macro:: ARROW_DEVICE_VULKAN

    Data allocated by the `Vulkan`_ framework and libraries.

.. c:macro:: ARROW_DEVICE_METAL

    Data on Apple GPU devices using the `Metal`_ framework and libraries.

.. c:macro:: ARROW_DEVICE_VPI

    Indicates usage of a Verilog simulator buffer.

.. c:macro:: ARROW_DEVICE_ROCM

    An AMD device using the `ROCm`_ stack.

.. c:macro:: ARROW_DEVICE_ROCM_HOST

    CPU memory that was pinned and page-locked by ROCm by using ``hipMallocHost``.

.. c:macro:: ARROW_DEVICE_EXT_DEV

    This value is an escape-hatch for devices to extend which aren't
    currently represented otherwise. Producers would need to provide
    additional information/context specific to the device if using
    this device type. This is used to quickly test extension devices
    and semantics can differ based on the implementation.

.. c:macro:: ARROW_DEVICE_CUDA_MANAGED

    CUDA managed/unified memory which is allocated by ``cudaMallocManaged``.

.. c:macro:: ARROW_DEVICE_ONEAPI

    Unified shared memory allocated on an Intel `oneAPI`_ non-partitioned
    device. A call to the ``oneAPI`` runtime is required to determine
    the specific device type, the USM allocation type and the sycl context
    that it is bound to.

.. c:macro:: ARROW_DEVICE_WEBGPU

    GPU support for next-gen WebGPU standards

.. c:macro:: ARROW_DEVICE_HEXAGON

    Data allocated on a Qualcomm Hexagon DSP device.

The ArrowDeviceArray structure
------------------------------

The ``ArrowDeviceArray`` structure embeds the C data ``ArrowArray`` structure
and adds additional information necessary for consumers to use the data. It
has the following fields:

.. c:member:: struct ArrowArray ArrowDeviceArray.array

    *Mandatory.* The allocated array data. The values in the ``void**`` buffers (along
    with the buffers of any children) are what is allocated on the device.
    The buffer values should be device pointers. The rest of the structure
    should be accessible to the CPU.

    The ``private_data`` and ``release`` callback of this structure should
    contain any necessary information and structures related to freeing
    the array according to the device it is allocated on, rather than
    having a separate release callback and ``private_data`` pointer here.

.. c:member:: int64_t ArrowDeviceArray.device_id

    *Mandatory.* The device id to identify a specific device if multiple devices of this
    type are on the system. The semantics of the id will be hardware dependent,
    but we use an ``int64_t`` to future-proof the id as devices change over time.

    For device types that do not have an intrinsic notion of a device identifier (e.g.,
    ``ARROW_DEVICE_CPU``), it is recommended to use a ``device_id`` of -1 as a
    convention.

.. c:member:: ArrowDeviceType ArrowDeviceArray.device_type

    *Mandatory.* The type of the device which can access the buffers in the array.

.. c:member:: void* ArrowDeviceArray.sync_event

    *Optional.* An event-like object to synchronize on if needed.

    Many devices, like GPUs, are primarily asynchronous with respect to
    CPU processing. As such, in order to safely access device memory, it is often
    necessary to have an object to synchronize processing with. Since
    different devices will use different types to specify this, we use a
    ``void*`` which can be coerced into a pointer to whatever the device
    appropriate type is.

    If synchronization is not needed, this can be null. If this is non-null
    then it MUST be used to call the appropriate sync method for the device
    (e.g. ``cudaStreamWaitEvent`` or ``hipStreamWaitEvent``) before attempting
    to access the memory in the buffers.

    If an event is provided, then the producer MUST ensure that the exported
    data is available on the device before the event is triggered. The
    consumer SHOULD wait on the event before trying to access the exported
    data.

.. seealso::
    The :ref:`synchronization event types <c-device-data-interface-event-types>`
    section below.

.. c:member:: int64_t ArrowDeviceArray.reserved[3]

    As non-CPU development expands, there may be a need to expand this
    structure. In order to do so without potentially breaking ABI changes,
    we reserve 24 bytes at the end of the object. These bytes MUST be zero'd
    out after initialization by the producer in order to ensure safe
    evolution of the ABI in the future.

.. _c-device-data-interface-event-types:

Synchronization event types
---------------------------

The table below lists the expected event types for each device type.
If no event type is supported ("N/A"), then the ``sync_event`` member
should always be null.

Remember that the event *CAN* be null if synchronization is not needed
to access the data.

+---------------------------+--------------------+---------+
| Device Type               | Actual Event Type  | Notes   |
+===========================+====================+=========+
| ARROW_DEVICE_CPU          | N/A                |         |
+---------------------------+--------------------+---------+
| ARROW_DEVICE_CUDA         | ``cudaEvent_t*``   |         |
+---------------------------+--------------------+---------+
| ARROW_DEVICE_CUDA_HOST    | ``cudaEvent_t*``   |         |
+---------------------------+--------------------+---------+
| ARROW_DEVICE_OPENCL       | ``cl_event*``      |         |
+---------------------------+--------------------+---------+
| ARROW_DEVICE_VULKAN       | ``VkEvent*``       |         |
+---------------------------+--------------------+---------+
| ARROW_DEVICE_METAL        | ``MTLEvent*``      |         |
+---------------------------+--------------------+---------+
| ARROW_DEVICE_VPI          | N/A                | (1)     |
+---------------------------+--------------------+---------+
| ARROW_DEVICE_ROCM         | ``hipEvent_t*``    |         |
+---------------------------+--------------------+---------+
| ARROW_DEVICE_ROCM_HOST    | ``hipEvent_t*``    |         |
+---------------------------+--------------------+---------+
| ARROW_DEVICE_EXT_DEV      |                    | (2)     |
+---------------------------+--------------------+---------+
| ARROW_DEVICE_CUDA_MANAGED | ``cudaEvent_t*``   |         |
+---------------------------+--------------------+---------+
| ARROW_DEVICE_ONEAPI       | ``sycl::event*``   |         |
+---------------------------+--------------------+---------+
| ARROW_DEVICE_WEBGPU       | N/A                | (1)     |
+---------------------------+--------------------+---------+
| ARROW_DEVICE_HEXAGON      | N/A                | (1)     |
+---------------------------+--------------------+---------+

Notes:

* \(1) Currently unknown if framework has an event type to support.
* \(2) Extension Device has producer defined semantics and thus if
  synchronization is needed for an extension device, the producer
  should document the type.

.. _c-device-data-interface-semantics:

Semantics
=========

Memory management
-----------------

First and foremost: Out of everything in this interface, it is *only* the
data buffers themselves which reside in device memory (i.e. the ``buffers``
member of the ``ArrowArray`` struct). Everything else should be in CPU
memory.

The ``ArrowDeviceArray`` structure contains an ``ArrowArray`` object which
itself has :ref:`specific semantics <c-data-interface-semantics>` for releasing
memory. The term *"base structure"* below refers to the ``ArrowDeviceArray``
object that is passed directly between the producer and consumer -- not any
child structure thereof.

It is intended for the base structure to be stack- or heap-allocated by the
*consumer*. In this case, the producer API should take a pointer to the
consumer-allocated structure.

However, any data pointed to by the struct MUST be allocated and maintained
by the producer. This includes the ``sync_event`` member if it is not null,
along with any pointers in the ``ArrowArray`` object as usual. Data lifetime
is managed through the ``release`` callback of the ``ArrowArray`` member.

For an ``ArrowDeviceArray``, the semantics of a released structure and the
callback semantics are identical to those for
:ref:`ArrowArray itself <c-data-interface-released>`. Any producer specific context
information necessary for releasing the device data buffers, in addition to
any allocated event, should be stored in the ``private_data`` member of
the ``ArrowArray`` and managed by the ``release`` callback.

Moving an array
'''''''''''''''

The consumer can *move* the ``ArrowDeviceArray`` structure by bitwise copying
or shallow member-wise copying. Then it MUST mark the source structure released
by setting the ``release`` member of the embedded ``ArrowArray`` structure to
``NULL``, but *without* calling that release callback. This ensures that only
one live copy of the struct is active at any given time and that lifetime is
correctly communicated to the producer.

As usual, the release callback will be called on the destination structure
when it is not needed anymore.

Record batches
--------------
As with the C data interface itself, a record batch can be trivially considered
as an equivalent struct array. In this case the metadata of the top-level
``ArrowSchema`` can be used for schema-level metadata of the record batch.

Mutability
----------

Both the producer and the consumer SHOULD consider the exported data (that
is, the data reachable on the device through the ``buffers`` member of
the embedded ``ArrowArray``) to be immutable, as either party could otherwise
see inconsistent data while the other is mutating it.

Synchronization
---------------

If the ``sync_event`` member is non-NULL, the consumer should not attempt
to access or read the data until they have synchronized on that event. If
the ``sync_event`` member is NULL, then it MUST be safe to access the data
without any synchronization necessary on the part of the consumer.

C producer example
====================

Exporting a simple ``int32`` device array
-----------------------------------------

Export a non-nullable ``int32`` type with empty metadata. An example of this
can be seen in the :ref:`C data interface docs directly <c-data-interface-export-int32-schema>`.

To export the data itself, we transfer ownership to the consumer through
the release callback. This example will use CUDA, but the equivalent calls
could be used for any device:

.. code-block:: c

    static void release_int32_device_array(struct ArrowArray* array) {
        assert(array->n_buffers == 2);
        // destroy the event
        cudaEvent_t* ev_ptr = (cudaEvent_t*)(array->private_data);
        cudaError_t status = cudaEventDestroy(*ev_ptr);
        assert(status == cudaSuccess);
        free(ev_ptr);

        // free the buffers and the buffers array
        status = cudaFree(array->buffers[1]);
        assert(status == cudaSuccess);
        free(array->buffers);

        // mark released
        array->release = NULL;
    }

    void export_int32_device_array(void* cudaAllocedPtr,
                                   cudaStream_t stream,
                                   int64_t length,
                                   struct ArrowDeviceArray* array) {
        // get device id
        int device;
        cudaError_t status;
        status = cudaGetDevice(&device);
        assert(status == cudaSuccess);

        cudaEvent_t* ev_ptr = (cudaEvent_t*)malloc(sizeof(cudaEvent_t));
        assert(ev_ptr != NULL);
        status = cudaEventCreate(ev_ptr);
        assert(status == cudaSuccess);

        // record event on the stream, assuming that the passed in
        // stream is where the work to produce the data will be processing.
        status = cudaEventRecord(*ev_ptr, stream);
        assert(status == cudaSuccess);

        memset(array, 0, sizeof(struct ArrowDeviceArray));
        // initialize fields
        *array = (struct ArrowDeviceArray) {
            .array = (struct ArrowArray) {
                .length = length,
                .null_count = 0,
                .offset = 0,
                .n_buffers = 2,
                .n_children = 0,
                .children = NULL,
                .dictionary = NULL,
                // bookkeeping
                .release = &release_int32_device_array,
                // store the event pointer as private data in the array
                // so that we can access it in the release callback.
                .private_data = (void*)(ev_ptr),
            },
            .device_id = (int64_t)(device),
            .device_type = ARROW_DEVICE_CUDA,
            // pass the event pointer to the consumer
            .sync_event = (void*)(ev_ptr),
        };

        // allocate list of buffers
        array->array.buffers = (const void**)malloc(sizeof(void*) * array->array.n_buffers);
        assert(array->array.buffers != NULL);
        array->array.buffers[0] = NULL;
        array->array.buffers[1] = cudaAllocedPtr;
    }

    // calling the release callback should be done using the array member
    // of the device array.
    static void release_device_array_helper(struct ArrowDeviceArray* arr) {
        arr->array.release(&arr->array);
    }

.. _c-device-stream-interface:

Device Stream Interface
=======================

Like the :ref:`C stream interface <c-stream-interface>`, the C Device data
interface also specifies a higher-level structure for easing communication
of streaming data within a single process.

Semantics
---------

An Arrow C device stream exposes a streaming source of data chunks, each with
the same schema. Chunks are obtained by calling a blocking pull-style iteration
function. It is expected that all chunks should be providing data on the same
device type (but not necessarily the same device id). If it is necessary
to provide a stream of data on multiple device types, a producer should
provide a separate stream object for each device type.

Structure definition
--------------------

The C device stream interface is defined by a single ``struct`` definition:

.. code-block:: c

    #ifndef ARROW_C_DEVICE_STREAM_INTERFACE
    #define ARROW_C_DEVICE_STREAM_INTERFACE

    struct ArrowDeviceArrayStream {
        // device type that all arrays will be accessible from
        ArrowDeviceType device_type;
        // callbacks
        int (*get_schema)(struct ArrowDeviceArrayStream*, struct ArrowSchema*);
        int (*get_next)(struct ArrowDeviceArrayStream*, struct ArrowDeviceArray*);
        const char* (*get_last_error)(struct ArrowDeviceArrayStream*);

        // release callback
        void (*release)(struct ArrowDeviceArrayStream*);

        // opaque producer-specific data
        void* private_data;
    };

    #endif  // ARROW_C_DEVICE_STREAM_INTERFACE

.. note::
    The canonical guard ``ARROW_C_DEVICE_STREAM_INTERFACE`` is meant to avoid
    duplicate definitions if two projects copy the C device stream interface
    definitions into their own headers, and a third-party project includes
    from these two projects. It is therefore important that this guard is
    kept exactly as-is when these definitions are copied.

The ArrowDeviceArrayStream structure
''''''''''''''''''''''''''''''''''''

The ``ArrowDeviceArrayStream`` provides a device type that can access the
resulting data along with the required callbacks to interact with a
streaming source of Arrow arrays. It has the following fields:

.. c:member:: ArrowDeviceType device_type

    *Mandatory.* The device type that this stream produces data on. All
    ``ArrowDeviceArray`` s that are produced by this stream should have the
    same device type as is set here. This is a convenience for the consumer
    to not have to check every array that is retrieved and instead allows
    higher-level coding constructs for streams.

.. c:member:: int (*ArrowDeviceArrayStream.get_schema)(struct ArrowDeviceArrayStream*, struct ArrowSchema* out)

    *Mandatory.* This callback allows the consumer to query the schema of
    the chunks of data in the stream. The schema is the same for all data
    chunks.

    This callback must NOT be called on a released ``ArrowDeviceArrayStream``.

    *Return value:* 0 on success, a non-zero
    :ref:`error code <c-stream-interface-error-codes>` otherwise.

.. c:member:: int (*ArrowDeviceArrayStream.get_next)(struct ArrowDeviceArrayStream*, struct ArrowDeviceArray* out)

    *Mandatory.* This callback allows the consumer to get the next chunk of
    data in the stream.

    This callback must NOT be called on a released ``ArrowDeviceArrayStream``.

    The next chunk of data MUST be accessible from a device type matching the
    :c:member:`ArrowDeviceArrayStream.device_type`.

    *Return value:* 0 on success, a non-zero
    :ref:`error code <c-stream-interface-error-codes>` otherwise.

    On success, the consumer must check whether the ``ArrowDeviceArray``'s
    embedded ``ArrowArray`` is marked :ref:`released <c-data-interface-released>`.
    If the embedded ``ArrowDeviceArray.array`` is released, then the end of the
    stream has been reached. Otherwise, the ``ArrowDeviceArray`` contains a
    valid data chunk.

.. c:member:: const char* (*ArrowDeviceArrayStream.get_last_error)(struct ArrowDeviceArrayStream*)

    *Mandatory.* This callback allows the consumer to get a textual description
    of the last error.

    This callback must ONLY be called if the last operation on the
    ``ArrowDeviceArrayStream`` returned an error. It must NOT be called on a
    released ``ArrowDeviceArrayStream``.

    *Return value:* a pointer to a NULL-terminated character string
    (UTF8-encoded). NULL can also be returned if no detailed description is
    available.

    The returned pointer is only guaranteed to be valid until the next call
    of one of the stream's callbacks. The character string it points to should
    be copied to consumer-managed storage if it is intended to survive longer.

.. c:member:: void (*ArrowDeviceArrayStream.release)(struct ArrowDeviceArrayStream*)

    *Mandatory.* A pointer to a producer-provided release callback.

.. c:member:: void* ArrowDeviceArrayStream.private_data

    *Optional.* An opaque pointer to producer-provided private data.

    Consumers MUST NOT process this member. Lifetime of this member is
    handled by the producer, and especially by the release callback.

Result lifetimes
''''''''''''''''

The data returned by the ``get_schema`` and ``get_next`` callbacks must be
released independently. Their lifetimes are not tied to that of
``ArrowDeviceArrayStream``.

Stream lifetime
'''''''''''''''

Lifetime of the C stream is managed using a release callback with similar
usage as in :ref:`C data interface <c-data-interface-released>`.

Thread safety
'''''''''''''

The stream source is not assumed to be thread-safe. Consumers wanting to
call ``get_next`` from several threads should ensure those calls are
serialized.

Async Device Stream Interface
=============================

.. warning::

    Experimental: The Async C Device Stream interface is experimental in its current
    form. Based on feedback and usage the protocol definition may change until
    it is fully standardized.

The :ref:`C stream interface <c-device-stream-interface>` provides a synchronous
API centered around the consumer calling the producer functions to retrieve
the next record batch. For concurrent communication between producer and consumer,
the ``ArrowAsyncDeviceStreamHandler`` can be used. This interface is non-opinionated
and may fit into different concurrent communication models.

Semantics
---------

Rather than the producer providing a structure of callbacks for a consumer to
call and retrieve records, the Async interface is a structure allocated and populated by the consumer.
The consumer allocated struct provides handler callbacks for the producer to call
when the schema and chunks of data are available.

In addition to the ``ArrowAsyncDeviceStreamHandler``, there are also two additional
structs used for the full data flow: ``ArrowAsyncTask`` and ``ArrowAsyncProducer``.

Structure Definition
--------------------

The C device async stream interface consists of three ``struct`` definitions:

.. code-block:: c

    #ifndef ARROW_C_ASYNC_STREAM_INTERFACE
    #define ARROW_C_ASYNC_STREAM_INTERFACE

    struct ArrowAsyncTask {
      int (*extract_data)(struct ArrowArrayTask* self, struct ArrowDeviceArray* out);

      void* private_data;
    };

    struct ArrowAsyncProducer {
      ArrowDeviceType device_type;

      void (*request)(struct ArrowAsyncProducer* self, int64_t n);
      void (*cancel)(struct ArrowAsyncProducer* self);

      void (*release)(struct ArrowAsyncProducer* self);
      const char* additional_metadata;
      void* private_data;
    };

    struct ArrowAsyncDeviceStreamHandler {
      // consumer-specific handlers
      int (*on_schema)(struct ArrowAsyncDeviceStreamHandler* self,
                       struct ArrowSchema* stream_schema);
      int (*on_next_task)(struct ArrowAsyncDeviceStreamHandler* self,
                          struct ArrowAsyncTask* task, const char* metadata);
      void (*on_error)(struct ArrowAsyncDeviceStreamHandler* self,
                       int code, const char* message, const char* metadata);

      // release callback
      void (*release)(struct ArrowAsyncDeviceStreamHandler* self);

      // must be populated before calling any callbacks
      struct ArrowAsyncProducer* producer;

      // opaque handler-specific data
      void* private_data;
    };

    #endif  // ARROW_C_ASYNC_STREAM_INTERFACE

.. note::
    The canonical guard ``ARROW_C_ASYNC_STREAM_INTERFACE`` is meant to avoid
    duplicate definitions if two projects copy the C async stream interface
    definitions into their own headers, and a third-party project includes
    from these two projects. It is therefore important that this guard is kept
    exactly as-is when these definitions are copied.

The ArrowAsyncDeviceStreamHandler structure
'''''''''''''''''''''''''''''''''''''''''''

The structure has the following fields:

.. c:member:: int (*ArrowAsyncDeviceStreamHandler.on_schema)(struct ArrowAsyncDeviceStreamHandler*, struct ArrowSchema*)

    *Mandatory.* Handler for receiving the schema of the stream. All incoming records should
    match the provided schema. If successful, the function should return 0, otherwise
    it should return an ``errno``-compatible error code.

    If there is any extra contextual information that the producer wants to provide, it can set
    :c:member:`ArrowAsyncProducer.additional_metadata` to a non-NULL value. This is encoded in the
    same format as :c:member:`ArrowSchema.metadata`. The lifetime of this metadata, if not ``NULL``,
    should be tied to the lifetime of the ``ArrowAsyncProducer`` object.

    Unless the ``on_error`` handler is called, this will always get called exactly once and will be
    the first method called on this object. As such the producer *MUST* populate the ``ArrowAsyncProducer``
    member before calling this function to allow the consumer to apply back-pressure and control the flow of data.
    The producer maintains ownership of the ``ArrowAsyncProducer`` and must clean it up *after*
    calling the release callback on the ``ArrowAsyncDeviceStreamHandler``.

    A producer that receives a non-zero result here must not subsequently call anything other than
    the release callback on this object.

.. c:member:: int (*ArrowAsyncDeviceStreamHandler.on_next_task)(struct ArrowAsyncDeviceStreamHandler*, struct ArrowAsyncTask*, const char*)

    *Mandatory.* Handler to be called when a new record is available for processing. The
    schema for each record should be the same as the schema that ``on_schema`` was called with.
    If successfully handled, the function should return 0, otherwise it should return an
    ``errno``-compatible error code.

    Rather than passing the record itself it receives an ``ArrowAsyncTask`` instead to facilitate
    better consumer-focused thread control as far as receiving the data. A call to this function
    simply indicates that data is available via the provided task.

    The producer signals the end of the stream by passing ``NULL`` for the ``ArrowAsyncTask``
    pointer instead of a valid address. This task object is only valid during the lifetime of
    this function call. If the consumer wants to use the task beyond the scope of this method, it
    must copy or move its contents to a new ArrowAsyncTask object.

    The ``const char*`` parameter exists for producers to provide any extra contextual information
    they want. This is encoded in the same format as :c:member:`ArrowSchema.metadata`. If not ``NULL``,
    the lifetime is only the scope of the call to this function. A consumer who wants to maintain
    the additional metadata beyond the lifetime of this call *MUST* copy the value themselves.

    A producer *MUST NOT* call this concurrently from multiple threads.

    The :c:member:`ArrowAsyncProducer.request` callback must be called to start receiving calls to this
    handler.

.. c:member:: void (*ArrowAsyncDeviceStreamHandler.on_error)(struct ArrowAsyncDeviceStreamHandler, int, const char*, const char*)

    *Mandatory.* Handler to be called when an error is encountered by the producer. After calling
    this, the ``release`` callback will be called as the last call on this struct. The parameters
    are an ``errno``-compatible error code and an optional error message and metadata.

    If the message and metadata are not ``NULL``, their lifetime is only valid during the scope
    of this call. A consumer who wants to maintain these values past the return of this function
    *MUST* copy the values themselves.

    If the metadata parameter is not ``NULL``, to provide key-value error metadata, then it should
    be encoded identically to the way that metadata is encoded in :c:member:`ArrowSchema.metadata`.

    It is valid for this to be called by a producer with or without a preceding call to
    :c:member:`ArrowAsyncProducer.request`. This callback *MUST NOT* call any methods of an
    ``ArrowAsyncProducer`` object.

.. c:member:: void (*ArrowAsyncDeviceStreamHandler.release)(struct ArrowAsyncDeviceStreamHandler*)

    *Mandatory.* A pointer to a consumer-provided release callback for the handler.

    It is valid for this to be called by a producer with or without a preceding call to
    :c:member:`ArrowAsyncProducer.request`. This must not call any methods of an ``ArrowAsyncProducer``
    object.

.. c:member:: struct ArrowAsyncProducer ArrowAsyncDeviceStreamHandler.producer

    *Mandatory.* The producer object that the consumer will use to request additional data or cancel.

    This object *MUST* be populated by the producer before calling the :c:member:`ArrowAsyncDeviceStreamHandler.on_schema`
    callback. The producer maintains ownership of this object and must clean it up *after* calling
    the release callback on the ``ArrowAsyncDeviceStreamHandler``.

    The consumer *CANNOT* assume that this is valid until the ``on_schema`` callback is called.

.. c:member:: void* ArrowAsyncDeviceStreamHandler.private_data

    *Optional.* An opaque pointer to consumer-provided private data.

    Producers *MUST NOT* process this member. Lifetime of this member is handled by
    the consumer, and especially by the release callback.

The ArrowAsyncTask structure
''''''''''''''''''''''''''''

The purpose of using a Task object rather than passing the array directly to the ``on_next``
callback is to allow for more complex and efficient thread handling. Utilizing a Task
object allows for a producer to separate the "decoding" logic from the I/O, enabling a
consumer to avoid transferring data between CPU cores (e.g. from one L1/L2 cache to another).

This producer-provided structure has the following fields:

.. c:member:: int (*ArrowArrayTask.extract_data)(struct ArrowArrayTask*, struct ArrowDeviceArray*)

  *Mandatory.* A callback to populate the provided ``ArrowDeviceArray`` with the available data.
  The order of ``ArrowAsyncTasks`` provided by the producer enables a consumer to know the order of
  the data to process. If the consumer does not care about the data that is owned by this task,
  it must still call ``extract_data`` so that the producer can perform any required cleanup. ``NULL``
  should be passed as the device array pointer to indicate that the consumer doesn't want the
  actual data, letting the task perform necessary cleanup.

  If a non-zero value is returned from this, it should be followed only by the producer calling
  the ``on_error`` callback of the ``ArrowAsyncDeviceStreamHandler``. Because calling this method
  is likely to be separate from the current control flow, returning a non-zero value to signal
  an error occurring allows the current thread to decide handle the case accordingly, while still
  allowing all error logging and handling to be centralized in the
  :c:member:`ArrowAsyncDeviceStreamHandler.on_error` callback.

  Rather than having a separate release callback, any required cleanup should be performed as part
  of the invocation of this callback. Ownership of the Array is given to the pointer passed in as
  a parameter, and this array must be released separately.

  It is only valid to call this method exactly once.

.. c:member:: void* ArrowArrayTask.private_data

  *Optional.* An opaque pointer to producer-provided private data.

  Consumers *MUST NOT* process this member. Lifetime of this member is handled by
  the producer who created this object, and should be cleaned up if necessary during
  the call to :c:member:`ArrowArrayTask.extract_data`.

The ArrowAsyncProducer structure
''''''''''''''''''''''''''''''''

This producer-provided and managed object has the following fields:

.. c:member:: ArrowDeviceType ArrowAsyncProducer.device_type

  *Mandatory.* The device type that this producer will provide data on. All
  ``ArrowDeviceArray`` structs that are produced by this producer should have the
  same device type as is set here.

.. c:member:: void (*ArrowAsyncProducer.request)(struct ArrowAsyncProducer*, uint64_t)

  *Mandatory.* This function must be called by a consumer to start receiving calls to
  :c:member:`ArrowAsyncDeviceStreamHandler.on_next_task`. It *MUST* be valid to call
  this synchronously from within :c:member:`ArrowAsyncDeviceStreamHandler.on_next_task`
  or :c:member:`ArrowAsyncDeviceStreamHandler.on_schema`. As a result, this function
  *MUST NOT* synchronously call ``on_next_task`` or ``on_error`` to avoid recursive
  and reentrant callbacks.

  After ``cancel`` is called, additional calls to this function must be a NOP, but allowed.

  While not cancelled, calling this function registers the given number of additional
  arrays/batches to be produced by the producer. A producer should only call
  the appropriate ``on_next_task`` callback up to a maximum of the total sum of calls to
  this method before propagating back-pressure / waiting.

  Any error encountered by calling request must be propagated by calling the ``on_error``
  callback of the ``ArrowAsyncDeviceStreamHandler``.

  It is invalid to call this function with a value of ``n`` that is ``<= 0``. Producers should
  error (e.g. call ``on_error``) if receiving such a value for ``n``.

.. c:member:: void (*ArrowAsyncProducer.cancel)(struct ArrowAsyncProducer*)

  *Mandatory.* This function signals to the producer that it must *eventually* stop calling
  ``on_next_task``. Calls to ``cancel`` must be idempotent and thread-safe. After calling
  it once, subsequent calls *MUST* be a NOP. This *MUST NOT* call any consumer-side handlers
  other than ``on_error``.

  It is not required that calling ``cancel`` affect the producer *immediately*, only that it
  must eventually stop calling ``on_next_task`` and then subsequently call ``release``
  on the async handler object. As such, a consumer *MUST* be prepared to receive one or more
  calls to ``on_next_task`` or ``on_error`` even after calling ``cancel`` if there are still
  requested arrays pending.

  Successful cancelling *MUST NOT* result in a producer calling
  :c:member:`ArrowAsyncDeviceStreamHandler.on_error`, instead it should finish out any remaining
  tasks (calling ``on_next_task`` accordingly) and eventually just call ``release``.

  Any error encountered during handling a call to cancel must be reported via the ``on_error``
  callback on the async stream handler.

.. c:member:: const char* ArrowAsyncProducer.additional_metadata

    *Optional.* An additional metadata string to provide any extra context to the consumer. This *MUST*
    either be ``NULL`` or a valid string that is encoded in the same way as :c:member:`ArrowSchema.metadata`.
    As an example, a producer could utilize this metadata to provide the total number of rows and/or batches
    in the stream if known.

    If not ``NULL`` it *MUST* be valid for at least the lifetime of this object.

.. c:member:: void* ArrowAsyncProducer.private_data

  *Optional.* An opaque pointer to producer-provided specific data.

  Consumers *MUST NOT* process this member, the lifetime is owned by the producer
  that constructed this object.

Error Handling
''''''''''''''

Unlike the regular C Stream interface, the Async interface allows for errors to flow in
both directions. As a result, error handling can be slightly more complex. Thus this spec
designates the following rules:

* If the producer encounters an error during processing, it should call the ``on_error``
  callback, and then call ``release`` after it returns.

* If ``on_schema`` or ``on_next_task`` returns a non-zero integer value, the producer *should not*
  call the ``on_error`` callback, but instead should eventually call ``release`` at some point
  before or after any logging or processing of the error code.

Result lifetimes
''''''''''''''''

The ``ArrowSchema`` passed to the ``on_schema`` callback must be released independently,
with the object itself needing to be moved to a consumer owned ``ArrowSchema`` object. The
``ArrowSchema*`` passed as a parameter to the callback *MUST NOT* be stored and kept.

The ``ArrowAsyncTask`` object provided to ``on_next_task`` is owned by the producer and
will be cleaned up during the invocation of calling ``extract_data`` on it. If the consumer
doesn't care about the data, it should pass ``NULL`` instead of a valid ``ArrowDeviceArray*``.

The ``const char*`` error ``message`` and ``metadata`` which are passed to ``on_error``
are only valid within the scope of the ``on_error`` function itself. They must be copied
if it is necessary for them to exist after it returns.

Stream Handler Lifetime
'''''''''''''''''''''''

Lifetime of the async stream handler is managed using a release callback with similar
usage as in :ref:`C data interface <c-data-interface-released>`.

ArrowAsyncProducer Lifetime
'''''''''''''''''''''''''''

The lifetime of the ``ArrowAsyncProducer`` is owned by the producer itself and should
be managed by it. It *MUST* be populated before calling any methods other than ``release``
and *MUST* remain valid at least until just before calling ``release`` on the stream handler object.

Thread safety
'''''''''''''

All handler functions on the ``ArrowAsyncDeviceStreamHandler`` should only be called in a
serialized manner, but are not guaranteed to be called from the same thread every time. A
producer should wait for handler callbacks to return before calling the next handler callback,
and before calling the ``release`` callback.

Back-pressure is managed by the consumer making calls to :c:member:`ArrowAsyncProducer.request`
to indicate how many arrays it is ready to receive.

The ``ArrowAsyncDeviceStreamHandler`` object should be able to handle callbacks as soon as
it is passed to the producer, any initialization should be performed before it is provided.

Possible Sequence Diagram
-------------------------

.. mermaid::

  sequenceDiagram
    Consumer->>+Producer: ArrowAsyncDeviceStreamHandler*
    Producer-->>+Consumer: on_schema(ArrowAsyncProducer*, ArrowSchema*)
    Consumer->>Producer: ArrowAsyncProducer->request(n)

    par
        loop up to n times
            Producer-->>Consumer: on_next_task(ArrowAsyncTask*)
        end
    and for each task
        Consumer-->>Producer: ArrowAsyncTask.extract_data(...)
        Consumer-->>Producer: ArrowAsyncProducer->request(1)
    end

    break Optionally
        Consumer->>-Producer: ArrowAsyncProducer->cancel()
    end

    loop possible remaining
        Producer-->>Consumer: on_next_task(ArrowAsyncTask*)
    end

    Producer->>-Consumer: ArrowAsyncDeviceStreamHandler->release()


Interoperability with other interchange formats
===============================================

Other interchange APIs, such as the `CUDA Array Interface`_, include
members to pass the shape and the data types of the data buffers being
exported. This information is necessary to interpret the raw bytes in the
device data buffers that are being shared. Rather than store the
shape / types of the data alongside the ``ArrowDeviceArray``, users
should utilize the existing ``ArrowSchema`` structure to pass any data
type and shape information.

Updating this specification
===========================

.. note::
    Since this specification is still considered experimental, there is the
    (still very low) possibility it might change slightly. The reason for
    tagging this as "experimental" is because we don't know what we don't know.
    Work and research was done to ensure a generic ABI compatible with many
    different frameworks, but it is always possible something was missed.
    Once this is supported in an official Arrow release and usage is observed
    to confirm there aren't any modifications necessary, the "experimental"
    tag will be removed and the ABI frozen.

Once this specification is supported in an official Arrow release, the C ABI
is frozen. This means that the ``ArrowDeviceArray`` structure definition
should not change in any way -- including adding new members.

Backwards-compatible changes are allowed, for example new macro values for
:c:type:`ArrowDeviceType` or converting the reserved 24 bytes into a
different type/member without changing the size of the structure.

Any incompatible changes should be part of a new specification, for example
``ArrowDeviceArrayV2``.


.. _RAPIDS cuDF library: https://docs.rapids.ai/api/cudf/stable/
.. _CUDA Array Interface: https://numba.readthedocs.io/en/stable/cuda/cuda_array_interface.html
.. _dlpack: https://dmlc.github.io/dlpack/latest/c_api.html#c-api
.. _CUDA: https://developer.nvidia.com/cuda-toolkit
.. _OpenCL (Open Computing Language): https://www.khronos.org/opencl/
.. _Vulkan: https://www.vulkan.org/
.. _Metal: https://developer.apple.com/metal/
.. _ROCm: https://www.amd.com/en/graphics/servers-solutions-rocm
.. _oneAPI: https://www.intel.com/content/www/us/en/developer/tools/oneapi/overview.html