File: Glossary.rst

package info (click to toggle)
apache-arrow 23.0.1-1
  • links: PTS
  • area: main
  • in suites: sid
  • size: 76,220 kB
  • sloc: cpp: 654,608; python: 70,522; ruby: 45,964; ansic: 18,742; sh: 7,365; makefile: 669; javascript: 125; xml: 41
file content (219 lines) | stat: -rw-r--r-- 8,181 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
.. Licensed to the Apache Software Foundation (ASF) under one
.. or more contributor license agreements.  See the NOTICE file
.. distributed with this work for additional information
.. regarding copyright ownership.  The ASF licenses this file
.. to you under the Apache License, Version 2.0 (the
.. "License"); you may not use this file except in compliance
.. with the License.  You may obtain a copy of the License at

..   http://www.apache.org/licenses/LICENSE-2.0

.. Unless required by applicable law or agreed to in writing,
.. software distributed under the License is distributed on an
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
.. KIND, either express or implied.  See the License for the
.. specific language governing permissions and limitations
.. under the License.

.. _glossary:

========
Glossary
========

.. glossary::
   :sorted:

   array
   vector
       A *contiguous*, *one-dimensional* sequence of values with known
       length where all values have the same type.  An array consists
       of zero or more :term:`buffers <buffer>`, a non-negative
       length, and a :term:`data type`.  The buffers of an array are
       laid out according to the data type as defined by the columnar
       format.

       Arrays are contiguous in the sense that iterating the values of
       an array will iterate through a single set of buffers, even
       though an array may consist of multiple disjoint buffers, or
       may consist of child arrays that themselves span multiple
       buffers.

       Arrays are one-dimensional in that they are a sequence of
       :term:`slots <slot>` or singular values, even though for some
       data types (like structs or unions), a slot may represent
       multiple values.

       Defined by the :doc:`./Columnar`.

   buffer
       A *contiguous* region of memory with a given length.  Buffers
       are used to store data for arrays.

       Buffers may be in CPU memory, memory-mapped from a file, in
       device (e.g. GPU) memory, etc., though not all Arrow
       implementations support all of these possibilities.

   canonical extension type
       An :term:`extension type` that has been standardized by the
       Arrow community so as to improve interoperability between
       implementations.

       .. seealso::
          :ref:`format_canonical_extensions`.

   child array
   parent array
       In an array of a :term:`nested type`, the parent array
       corresponds to the :term:`parent type` and the child array(s)
       correspond to the :term:`child type(s) <child type>`.  For
       example, a ``List[Int32]``-type parent array has an
       ``Int32``-type child array.

   child type
   parent type
       In a :term:`nested type`, the nested type is the parent type,
       and the child type(s) are its parameters.  For example, in
       ``List[Int32]``, ``List`` is the parent type and ``Int32`` is
       the child type.

   chunked array
       A *discontiguous*, *one-dimensional* sequence of values with
       known length where all values have the same type.  Consists of
       zero or more :term:`arrays <array>`, the "chunks".

       Chunked arrays are discontiguous in the sense that iterating
       the values of a chunked array may require iterating through
       different buffers for different indices.

       Not part of the columnar format; this term is specific to
       certain language implementations of Arrow (primarily C++ and
       its bindings).

       .. seealso:: :term:`record batch`, :term:`table`

   complex type
   nested type
       A :term:`data type` whose structure depends on one or more
       other :term:`child data types <child type>`. For instance,
       ``List`` is a nested type that has one child.

       Two nested types are equal if and only if their child types are
       also equal.

   data type
   type
       A type that a value can take, such as ``Int8`` or
       ``List[Utf8]``. The type of an array determines how its values
       are laid out in memory according to :doc:`./Columnar`.

       .. seealso:: :term:`nested type`, :term:`primitive type`

   dictionary
       An array of values that accompany a :term:`dictionary-encoded
       <dictionary-encoding>` array.

   dictionary-encoding
       An array that stores its values as indices into a
       :term:`dictionary` array instead of storing the values
       directly.

       .. seealso:: :ref:`dictionary-encoded-layout`

   extension type
   storage type
       An extension type is an user-defined :term:`data type` that adds
       additional semantics to an existing data type.  This allows
       implementations that do not support a particular extension type to
       still handle the underlying data type (the "storage type").

       For example, a UUID can be represented as a 16-byte fixed-size
       binary type.

       .. seealso:: :ref:`format_metadata_extension_types`

   field
       A column in a :term:`schema`.  Consists of a field name, a
       :term:`data type`, a flag indicating whether the field is
       nullable or not, and optional key-value metadata.

   IPC format
       A specification for how to serialize Arrow data, so it can be
       sent between processes/machines, or persisted on disk.

       .. seealso:: :term:`IPC file format`,
                    :term:`IPC streaming format`

   IPC file format
   file format
   random-access format
       An extension of the :term:`IPC streaming format` that can be
       used to serialize Arrow data to disk, then read it back with
       random access to individual record batches.

   IPC message
   message
       The IPC representation of a particular in-memory structure, like a :term:`record
       batch` or :term:`schema`. Will always be one of the members of ``MessageHeader``
       in the `Flatbuffers protocol file
       <https://github.com/apache/arrow/blob/main/format/Message.fbs>`_.


   IPC streaming format
   streaming format
       A protocol for streaming Arrow data or for serializing data to
       a file, consisting of a stream of :term:`IPC messages <IPC
       message>`.

   physical layout
       A specification for how to arrange values in memory.

       .. seealso:: :ref:`format_layout`

   primitive type
       A data type that does not have any child types.

       .. seealso:: :term:`data type`

   record batch
       **In the** :ref:`IPC format <format-ipc>`: the primitive unit
       of data.  A record batch consists of an ordered list of
       :term:`buffers <buffer>` corresponding to a :term:`schema`.

       **In some implementations** (primarily C++ and its bindings): a
       *contiguous*, *two-dimensional* chunk of data.  A record batch
       consists of an ordered collection of :term:`arrays <array>` of
       the same length.

       Like arrays, record batches are contiguous in the sense that
       iterating the rows of a record batch will iterate through a
       single set of buffers.

   schema
       A collection of :term:`fields <field>` with optional metadata
       that determines all the :term:`data types <data type>` of an
       object like a :term:`record batch` or :term:`table`.

   slot
       A single logical value within an array, i.e. a "row".

   table
       A *discontiguous*, *two-dimensional* chunk of data consisting
       of an ordered collection of :term:`chunked arrays <chunked
       array>`.  All chunked arrays have the same length, but may have
       different types.  Different columns may be chunked
       differently.

       Like chunked arrays, tables are discontiguous in the sense that
       iterating the rows of a table may require iterating through
       different buffers for different indices.

       Not part of the columnar format; this term is specific to
       certain language implementations of Arrow (for example C++ and
       its bindings, and Go).

       .. image:: ../cpp/tables-versus-record-batches.svg
          :alt: A graphical representation of an Arrow Table and a
                Record Batch, with structure as described in text above.

       .. seealso:: :term:`chunked array`, :term:`record batch`