File: env_vars.rst

package info (click to toggle)
apache-arrow 23.0.1-1
  • links: PTS
  • area: main
  • in suites: sid
  • size: 76,220 kB
  • sloc: cpp: 654,608; python: 70,522; ruby: 45,964; ansic: 18,742; sh: 7,365; makefile: 669; javascript: 125; xml: 41
file content (217 lines) | stat: -rw-r--r-- 8,903 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
.. Licensed to the Apache Software Foundation (ASF) under one
.. or more contributor license agreements.  See the NOTICE file
.. distributed with this work for additional information
.. regarding copyright ownership.  The ASF licenses this file
.. to you under the Apache License, Version 2.0 (the
.. "License"); you may not use this file except in compliance
.. with the License.  You may obtain a copy of the License at

..   http://www.apache.org/licenses/LICENSE-2.0

.. Unless required by applicable law or agreed to in writing,
.. software distributed under the License is distributed on an
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
.. KIND, either express or implied.  See the License for the
.. specific language governing permissions and limitations
.. under the License.

.. _cpp_env_vars:

=====================
Environment Variables
=====================

The following environment variables can be used to affect the behavior of
Arrow C++ at runtime.  Many of these variables are inspected only once per
process (for example, when the Arrow C++ DLL is loaded), so you cannot assume
that changing their value later will have an effect.

.. envvar:: ACERO_ALIGNMENT_HANDLING

   Arrow C++'s Acero module performs computation on streams of data.  This
   computation may involve a form of "type punning" that is technically
   undefined behavior if the underlying array is not properly aligned.  On
   most modern CPUs this is not an issue, but some older CPUs may crash or
   suffer poor performance.  For this reason it is recommended that all
   incoming array buffers are properly aligned, but some data sources
   such as :ref:`Flight <flight-rpc>` may produce unaligned buffers.

   The value of this environment variable controls what will happen when
   Acero detects an unaligned buffer:

   - ``warn``: a warning is emitted
   - ``ignore``: nothing, alignment checking is disabled
   - ``reallocate``: the buffer is reallocated to a properly aligned address
   - ``error``: the operation fails with an error

   The default behavior is ``warn``.  On modern hardware it is usually safe
   to change this to ``ignore``.  Changing to ``reallocate`` is the safest
   option but this will have a significant performance impact as the buffer
   will need to be copied.

.. envvar:: ARROW_DEBUG_MEMORY_POOL

   Enable rudimentary memory checks to guard against buffer overflows.
   The value of this environment variable selects the behavior when a
   buffer overflow is detected:

   - ``abort`` exits the processus with a non-zero return value;
   - ``trap`` issues a platform-specific debugger breakpoint / trap instruction;
   - ``warn`` prints a warning on stderr and continues execution;
   - ``none`` disables memory checks;

   If this variable is not set, or has an empty value, it has the same effect
   as the value ``none`` - memory checks are disabled.

   .. note::
      While this functionality can be useful and has little overhead, it
      is not a replacement for more sophisticated memory checking utilities
      such as `Valgrind <https://valgrind.org/>`_ or
      `Address Sanitizer <https://clang.llvm.org/docs/AddressSanitizer.html>`_.

.. envvar:: ARROW_DEFAULT_MEMORY_POOL

   Override the backend to be used for the default
   :ref:`memory pool <cpp_memory_pool>`. Possible values are among ``jemalloc``,
   ``mimalloc`` and ``system``, depending on which backends were enabled when
   :ref:`building Arrow C++ <building-arrow-cpp>`.

.. envvar:: ARROW_IO_THREADS

   Override the default number of threads for the global IO thread pool.
   The value of this environment variable should be a positive integer.

.. envvar:: ARROW_LIBHDFS_DIR

   The directory containing the C HDFS library (``hdfs.dll`` on Windows,
   ``libhdfs.dylib`` on macOS, ``libhdfs.so`` on other platforms).
   Alternatively, one can set :envvar:`HADOOP_HOME`.

.. envvar:: ARROW_S3_LOG_LEVEL

   Controls the verbosity of logging produced by S3 calls. Defaults to ``FATAL``
   which only produces output in the case of fatal errors. ``DEBUG`` is recommended
   when you're trying to troubleshoot issues.

   Possible values include:

   - ``FATAL`` (the default)
   - ``ERROR``
   - ``WARN``
   - ``INFO``
   - ``DEBUG``
   - ``TRACE``
   - ``OFF``

   .. seealso::

      `Logging - AWS SDK For C++
      <https://docs.aws.amazon.com/sdk-for-cpp/v1/developer-guide/logging.html>`__

.. envvar:: ARROW_S3_THREADS

   The number of threads to configure when creating AWS' I/O event loop.

   Defaults to 1 as recommended by AWS' doc when the # of connections is
   expected to be, at most, in the hundreds.

.. envvar:: ARROW_TRACING_BACKEND

   The backend where to export `OpenTelemetry <https://opentelemetry.io/>`_-based
   execution traces.  Possible values are:

   - ``ostream``: emit textual log messages to stdout;
   - ``otlp_http``: emit OTLP JSON encoded traces to a HTTP server (by default,
     the endpoint URL is "http://localhost:4318/v1/traces");
   - ``arrow_otlp_stdout``: emit JSON traces to stdout;
   - ``arrow_otlp_stderr``: emit JSON traces to stderr.

   If this variable is not set, no traces are exported.

   This environment variable has no effect if Arrow C++ was not built with
   tracing enabled.

   .. seealso::

      `OpenTelemetry configuration for remote endpoints
      <https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/exporter.md>`__

.. envvar:: ARROW_USER_SIMD_LEVEL

   The maximum SIMD optimization level selectable at runtime.  Useful for
   comparing the performance impact of enabling or disabling respective code
   paths or working around situations where instructions are supported but are
   not performant or cause other issues.

   By default, Arrow C++ detects the capabilities of the current CPU at runtime
   and chooses the best execution paths based on that information.  This
   behavior can be overridden by setting this environment variable to a
   well-defined value.  Supported values are:

   - ``NONE`` disables any runtime-selected SIMD optimization;
   - ``SSE4_2`` enables any SSE2-based optimizations until SSE4.2 (included);
   - ``AVX`` enables any AVX-based optimizations and earlier;
   - ``AVX2`` enables any AVX2-based optimizations and earlier;
   - ``AVX512`` enables any AVX512-based optimizations and earlier.

   This environment variable only has an effect on x86 platforms.  Other
   platforms currently do not implement any form of runtime dispatch.

   .. note::
      In addition to runtime-selected SIMD optimizations dispatch, Arrow C++ can
      also be compiled with SIMD optimizations that cannot be disabled at
      runtime.  For example, by default, SSE4.2 optimizations are enabled on x86
      builds: therefore, with this default setting, Arrow C++ does not work at
      all on a CPU without support for SSE4.2.  This setting can be changed
      using the ``ARROW_SIMD_LEVEL`` CMake variable so as to either raise or
      lower the optimization level.

      Finally, the ``ARROW_RUNTIME_SIMD_LEVEL`` CMake variable sets a
      compile-time upper bound to runtime-selected SIMD optimizations.  This is
      useful in cases where a compiler reports support for an instruction set
      but does not actually support it in full.

.. envvar:: AWS_ENDPOINT_URL

   Endpoint URL used for S3-like storage, for example Minio or s3.scality.
   Alternatively, one can set :envvar:`AWS_ENDPOINT_URL_S3`.

.. envvar:: AWS_ENDPOINT_URL_S3

   Endpoint URL used for S3-like storage, for example Minio or s3.scality.
   This takes precedence over :envvar:`AWS_ENDPOINT_URL` if both variables
   are set.

.. envvar:: GANDIVA_CACHE_SIZE

   The number of entries to keep in the Gandiva JIT compilation cache.
   The cache is in-memory and does not persist across processes.

   The default cache size is 5000.  The value of this environment variable
   should be a positive integer and should not exceed the maximum value
   of int32.  Otherwise the default value is used.

.. envvar:: HADOOP_HOME

   The path to the Hadoop installation.

.. envvar:: JAVA_HOME

   Set the path to the Java Runtime Environment installation. This may be
   required for HDFS support if Java is installed in a non-standard location.

.. envvar:: OMP_NUM_THREADS

   The number of worker threads in the global (process-wide) CPU thread pool.
   If this environment variable is not defined, the available hardware
   concurrency is determined using a platform-specific routine.

.. envvar:: OMP_THREAD_LIMIT

   An upper bound for the number of worker threads in the global
   (process-wide) CPU thread pool.

   For example, if the current machine has 4 hardware threads and
   ``OMP_THREAD_LIMIT`` is 8, the global CPU thread pool will have 4 worker
   threads.  But if ``OMP_THREAD_LIMIT`` is 2, the global CPU thread pool
   will have 2 worker threads.