File: kernel.rst

package info (click to toggle)
numba 0.61.2%2Bdfsg-2
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 17,316 kB
  • sloc: python: 211,580; ansic: 15,233; cpp: 6,544; javascript: 424; sh: 322; makefile: 173
file content (712 lines) | stat: -rw-r--r-- 24,395 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
CUDA Kernel API
===============

.. cuda-deprecated::

Kernel declaration
------------------

The ``@cuda.jit`` decorator is used to create a CUDA dispatcher object that can
be configured and launched:

.. autofunction:: numba.cuda.jit


Dispatcher objects
------------------

The usual syntax for configuring a Dispatcher with a launch configuration uses
subscripting, with the arguments being as in the following:

.. code-block:: python

   # func is some function decorated with @cuda.jit
   func[griddim, blockdim, stream, sharedmem]


The ``griddim`` and ``blockdim`` arguments specify the size of the grid and
thread blocks, and may be either integers or tuples of length up to 3. The
``stream`` parameter is an optional stream on which the kernel will be launched,
and the ``sharedmem`` parameter specifies the size of dynamic shared memory in
bytes.

Subscripting the Dispatcher returns a configuration object that can be called
with the kernel arguments:

.. code-block:: python

   configured = func[griddim, blockdim, stream, sharedmem]
   configured(x, y, z)


However, it is more idiomatic to configure and call the kernel within a single
statement:

.. code-block:: python

   func[griddim, blockdim, stream, sharedmem](x, y, z)

This is similar to launch configuration in CUDA C/C++:

.. code-block:: cuda

   func<<<griddim, blockdim, sharedmem, stream>>>(x, y, z)

.. note:: The order of ``stream`` and ``sharedmem`` are reversed in Numba
   compared to in CUDA C/C++.

Dispatcher objects also provide several utility methods for inspection and
creating a specialized instance:

.. autoclass:: numba.cuda.dispatcher.CUDADispatcher
   :members: inspect_asm, inspect_llvm, inspect_sass, inspect_types,
             get_regs_per_thread, specialize, specialized, extensions, forall,
             get_shared_mem_per_block, get_max_threads_per_block,
             get_const_mem_size, get_local_mem_per_thread



Intrinsic Attributes and Functions
----------------------------------

The remainder of the attributes and functions in this section may only be called
from within a CUDA Kernel.

Thread Indexing
~~~~~~~~~~~~~~~

.. attribute:: numba.cuda.threadIdx

    The thread indices in the current thread block, accessed through the
    attributes ``x``, ``y``, and ``z``. Each index is an integer spanning the
    range from 0 inclusive to the corresponding value of the attribute in
    :attr:`numba.cuda.blockDim` exclusive.

.. attribute:: numba.cuda.blockIdx

    The block indices in the grid of thread blocks, accessed through the
    attributes ``x``, ``y``, and ``z``. Each index is an integer spanning the
    range from 0 inclusive to the corresponding value of the attribute in
    :attr:`numba.cuda.gridDim` exclusive.

.. attribute:: numba.cuda.blockDim

    The shape of a block of threads, as declared when instantiating the
    kernel.  This value is the same for all threads in a given kernel, even
    if they belong to different blocks (i.e. each block is "full").

.. attribute:: numba.cuda.gridDim

    The shape of the grid of blocks, accessed through the attributes ``x``,
    ``y``, and ``z``.

.. attribute:: numba.cuda.laneid

    The thread index in the current warp, as an integer spanning the range
    from 0 inclusive to the :attr:`numba.cuda.warpsize` exclusive.

.. attribute:: numba.cuda.warpsize

    The size in threads of a warp on the GPU. Currently this is always 32.

.. function:: numba.cuda.grid(ndim)

   Return the absolute position of the current thread in the entire
   grid of blocks.  *ndim* should correspond to the number of dimensions
   declared when instantiating the kernel.  If *ndim* is 1, a single integer
   is returned.  If *ndim* is 2 or 3, a tuple of the given number of
   integers is returned.

   Computation of the first integer is as follows::

      cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x

   and is similar for the other two indices, but using the ``y`` and ``z``
   attributes.

.. function:: numba.cuda.gridsize(ndim)

   Return the absolute size (or shape) in threads of the entire grid of
   blocks. *ndim* should correspond to the number of dimensions declared when
   instantiating the kernel.

   Computation of the first integer is as follows::

       cuda.blockDim.x * cuda.gridDim.x

   and is similar for the other two indices, but using the ``y`` and ``z``
   attributes.

Memory Management
~~~~~~~~~~~~~~~~~

.. function:: numba.cuda.shared.array(shape, dtype)

   Creates an array in the local memory space of the CUDA kernel with
   the given ``shape`` and ``dtype``.

   Returns an array with its content uninitialized.

   .. note:: All threads in the same thread block sees the same array.

.. function:: numba.cuda.local.array(shape, dtype)

   Creates an array in the local memory space of the CUDA kernel with the
   given ``shape`` and ``dtype``.

   Returns an array with its content uninitialized.

   .. note:: Each thread sees a unique array.

.. function:: numba.cuda.const.array_like(ary)

   Copies the ``ary`` into constant memory space on the CUDA kernel at compile
   time.

   Returns an array like the ``ary`` argument.

   .. note:: All threads and blocks see the same array.

Synchronization and Atomic Operations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. function:: numba.cuda.atomic.add(array, idx, value)

    Perform ``array[idx] += value``. Support int32, int64, float32 and
    float64 only. The ``idx`` argument can be an integer or a tuple of integer
    indices for indexing into multiple dimensional arrays. The number of element
    in ``idx`` must match the number of dimension of ``array``.

    Returns the value of ``array[idx]`` before storing the new value.
    Behaves like an atomic load.

.. function:: numba.cuda.atomic.sub(array, idx, value)

    Perform ``array[idx] -= value``. Supports int32, int64, float32 and
    float64 only. The ``idx`` argument can be an integer or a tuple of integer
    indices for indexing into multi-dimensional arrays. The number of elements
    in ``idx`` must match the number of dimensions of ``array``.

    Returns the value of ``array[idx]`` before storing the new value.
    Behaves like an atomic load.

.. function:: numba.cuda.atomic.and_(array, idx, value)

    Perform ``array[idx] &= value``. Supports int32, uint32, int64,
    and uint64 only. The ``idx`` argument can be an integer or a tuple of
    integer indices for indexing into multi-dimensional arrays. The number
    of elements in ``idx`` must match the number of dimensions of ``array``.

    Returns the value of ``array[idx]`` before storing the new value.
    Behaves like an atomic load.

.. function:: numba.cuda.atomic.or_(array, idx, value)

    Perform ``array[idx] |= value``. Supports int32, uint32, int64,
    and uint64 only. The ``idx`` argument can be an integer or a tuple of
    integer indices for indexing into multi-dimensional arrays. The number
    of elements in ``idx`` must match the number of dimensions of ``array``.

    Returns the value of ``array[idx]`` before storing the new value.
    Behaves like an atomic load.

.. function:: numba.cuda.atomic.xor(array, idx, value)

    Perform ``array[idx] ^= value``. Supports int32, uint32, int64,
    and uint64 only. The ``idx`` argument can be an integer or a tuple of
    integer indices for indexing into multi-dimensional arrays. The number
    of elements in ``idx`` must match the number of dimensions of ``array``.

    Returns the value of ``array[idx]`` before storing the new value.
    Behaves like an atomic load.

.. function:: numba.cuda.atomic.exch(array, idx, value)

    Perform ``array[idx] = value``. Supports int32, uint32, int64,
    and uint64 only. The ``idx`` argument can be an integer or a tuple of
    integer indices for indexing into multi-dimensional arrays. The number
    of elements in ``idx`` must match the number of dimensions of ``array``.

    Returns the value of ``array[idx]`` before storing the new value.
    Behaves like an atomic load.

.. function:: numba.cuda.atomic.inc(array, idx, value)

    Perform ``array[idx] = (0 if array[idx] >= value else array[idx] + 1)``.
    Supports uint32, and uint64 only. The ``idx`` argument can be an integer
    or a tuple of integer indices for indexing into multi-dimensional arrays.
    The number of elements in ``idx`` must match the number of dimensions of
    ``array``.

    Returns the value of ``array[idx]`` before storing the new value.
    Behaves like an atomic load.

.. function:: numba.cuda.atomic.dec(array, idx, value)

    Perform ``array[idx] =
    (value if (array[idx] == 0) or (array[idx] > value) else array[idx] - 1)``.
    Supports uint32, and uint64 only. The ``idx`` argument can be an integer
    or a tuple of integer indices for indexing into multi-dimensional arrays.
    The number of elements in ``idx`` must match the number of dimensions of
    ``array``.

    Returns the value of ``array[idx]`` before storing the new value.
    Behaves like an atomic load.

.. function:: numba.cuda.atomic.max(array, idx, value)

    Perform ``array[idx] = max(array[idx], value)``. Support int32, int64,
    float32 and float64 only. The ``idx`` argument can be an integer or a
    tuple of integer indices for indexing into multiple dimensional arrays.
    The number of element in ``idx`` must match the number of dimension of
    ``array``.

    Returns the value of ``array[idx]`` before storing the new value.
    Behaves like an atomic load.

.. function:: numba.cuda.atomic.cas(array, idx, old, value)

    Perform ``if array[idx] == old: array[idx] = value``. Supports int32,
    int64, uint32, uint64 indexes only. The ``idx`` argument can be an integer
    or a tuple of integer indices for indexing into multi-dimensional arrays.
    The number of elements in ``idx`` must match the number of dimensions of
    ``array``.

    Returns the value of ``array[idx]`` before storing the new value.
    Behaves like an atomic compare and swap.


.. function:: numba.cuda.syncthreads

    Synchronize all threads in the same thread block.  This function implements
    the same pattern as barriers in traditional multi-threaded programming: this
    function waits until all threads in the block call it, at which point it
    returns control to all its callers.

.. function:: numba.cuda.syncthreads_count(predicate)

    An extension to :attr:`numba.cuda.syncthreads` where the return value is a count
    of the threads where ``predicate`` is true.

.. function:: numba.cuda.syncthreads_and(predicate)

    An extension to :attr:`numba.cuda.syncthreads` where 1 is returned if ``predicate`` is
    true for all threads or 0 otherwise.

.. function:: numba.cuda.syncthreads_or(predicate)

    An extension to :attr:`numba.cuda.syncthreads` where 1 is returned if ``predicate`` is
    true for any thread or 0 otherwise.

    .. warning:: All syncthreads functions must be called by every thread in the
                 thread-block. Falling to do so may result in undefined behavior.


Cooperative Groups
~~~~~~~~~~~~~~~~~~

.. function:: numba.cuda.cg.this_grid()

   Get the current grid group.

   :return: The current grid group
   :rtype: numba.cuda.cg.GridGroup

.. class:: numba.cuda.cg.GridGroup

   A grid group. Users should not construct a GridGroup directly - instead, get
   the current grid group using :func:`cg.this_grid() <numba.cuda.cg.this_grid>`.

   .. method:: sync()

      Synchronize the current grid group.


Memory Fences
~~~~~~~~~~~~~

The memory fences are used to guarantee the effect of memory operations
are visible by other threads within the same thread-block, the same GPU device,
and the same system (across GPUs on global memory). Memory loads and stores
are guaranteed to not move across the memory fences by optimization passes.

.. warning:: The memory fences are considered to be advanced API and most
             usercases should use the thread barrier (e.g. ``syncthreads()``).



.. function:: numba.cuda.threadfence

   A memory fence at device level (within the GPU).

.. function:: numba.cuda.threadfence_block

   A memory fence at thread block level.

.. function:: numba.cuda.threadfence_system


   A memory fence at system level (across GPUs).

Warp Intrinsics
~~~~~~~~~~~~~~~

The argument ``membermask`` is a 32 bit integer mask with each bit
corresponding to a thread in the warp, with 1 meaning the thread is in the
subset of threads within the function call. The ``membermask`` must be all 1 if
the GPU compute capability is below 7.x.

.. function:: numba.cuda.syncwarp(membermask)

   Synchronize a masked subset of the threads in a warp.

.. function:: numba.cuda.all_sync(membermask, predicate)

    If the ``predicate`` is true for all threads in the masked warp, then
    a non-zero value is returned, otherwise 0 is returned.

.. function:: numba.cuda.any_sync(membermask, predicate)

    If the ``predicate`` is true for any thread in the masked warp, then
    a non-zero value is returned, otherwise 0 is returned.

.. function:: numba.cuda.eq_sync(membermask, predicate)

    If the boolean ``predicate`` is the same for all threads in the masked warp,
    then a non-zero value is returned, otherwise 0 is returned.

.. function:: numba.cuda.ballot_sync(membermask, predicate)

    Returns a mask of all threads in the warp whose ``predicate`` is true,
    and are within the given mask.

.. function:: numba.cuda.shfl_sync(membermask, value, src_lane)

    Shuffles ``value`` across the masked warp and returns the ``value``
    from ``src_lane``. If this is outside the warp, then the
    given ``value`` is returned.

.. function:: numba.cuda.shfl_up_sync(membermask, value, delta)

    Shuffles ``value`` across the masked warp and returns the ``value``
    from ``laneid - delta``. If this is outside the warp, then the
    given ``value`` is returned.

.. function:: numba.cuda.shfl_down_sync(membermask, value, delta)

    Shuffles ``value`` across the masked warp and returns the ``value``
    from ``laneid + delta``. If this is outside the warp, then the
    given ``value`` is returned.

.. function:: numba.cuda.shfl_xor_sync(membermask, value, lane_mask)

    Shuffles ``value`` across the masked warp and returns the ``value``
    from ``laneid ^ lane_mask``.

.. function:: numba.cuda.match_any_sync(membermask, value, lane_mask)

    Returns a mask of threads that have same ``value`` as the given ``value``
    from within the masked warp.

.. function:: numba.cuda.match_all_sync(membermask, value, lane_mask)

    Returns a tuple of (mask, pred), where mask is a mask of threads that have
    same ``value`` as the given ``value`` from within the masked warp, if they
    all have the same value, otherwise it is 0. And pred is a boolean of whether
    or not all threads in the mask warp have the same warp.

.. function:: numba.cuda.activemask()

    Returns a 32-bit integer mask of all currently active threads in the
    calling warp. The Nth bit is set if the Nth lane in the warp is active when
    activemask() is called. Inactive threads are represented by 0 bits in the
    returned mask. Threads which have exited the kernel are always marked as
    inactive.

.. function:: numba.cuda.lanemask_lt()

    Returns a 32-bit integer mask of all lanes (including inactive ones) with
    ID less than the current lane.


Integer Intrinsics
~~~~~~~~~~~~~~~~~~

A subset of the CUDA Math API's integer intrinsics are available. For further
documentation, including semantics, please refer to the `CUDA Toolkit
documentation
<https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__INT.html>`_.


.. function:: numba.cuda.popc(x)

   Returns the number of bits set in ``x``.

.. function:: numba.cuda.brev(x)

   Returns the reverse of the bit pattern of ``x``. For example, ``0b10110110``
   becomes ``0b01101101``.

.. function:: numba.cuda.clz(x)

   Returns the number of leading zeros in ``x``.

.. function:: numba.cuda.ffs(x)

   Returns the position of the first (least significant) bit set to 1 in ``x``,
   where the least significant bit position is 1. ``ffs(0)`` returns 0.


Floating Point Intrinsics
~~~~~~~~~~~~~~~~~~~~~~~~~

A subset of the CUDA Math API's floating point intrinsics are available. For further
documentation, including semantics, please refer to the `single
<https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__SINGLE.html>`_ and
`double <https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__DOUBLE.html>`_
precision parts of the CUDA Toolkit documentation.


.. function:: numba.cuda.fma

   Perform the fused multiply-add operation. Named after the ``fma`` and ``fmaf`` in
   the C api, but maps to the ``fma.rn.f32`` and ``fma.rn.f64`` (round-to-nearest-even)
   PTX instructions.

.. function:: numba.cuda.cbrt (x)

   Perform the cube root operation, x ** (1/3). Named after the functions
   ``cbrt`` and ``cbrtf`` in the C api. Supports float32, and float64 arguments
   only.

16-bit Floating Point Intrinsics
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The functions in the ``cuda.fp16`` module are used to operate on 16-bit
floating point operands. These functions return a 16-bit floating point result.

To determine whether Numba supports compiling code that uses the ``float16``
type in the current configuration, use:

   .. function:: numba.cuda.is_float16_supported ()

   Return ``True`` if 16-bit floats are supported, ``False`` otherwise.

To check whether a device supports ``float16``, use its
:attr:`supports_float16 <numba.cuda.cudadrv.driver.Device.supports_float16>`
attribute.

.. function:: numba.cuda.fp16.hfma (a, b, c)

   Perform the fused multiply-add operation ``(a * b) + c`` on 16-bit
   floating point arguments in round to nearest mode. Maps to the ``fma.rn.f16``
   PTX instruction.

   Returns the 16-bit floating point result of the fused multiply-add.

.. function:: numba.cuda.fp16.hadd (a, b)

   Perform the add operation ``a + b`` on 16-bit floating point arguments in
   round to nearest mode. Maps to the ``add.f16`` PTX instruction.

   Returns the 16-bit floating point result of the addition.

.. function:: numba.cuda.fp16.hsub (a, b)

   Perform the subtract operation ``a - b`` on 16-bit floating point arguments in
   round to nearest mode. Maps to the ``sub.f16`` PTX instruction.

   Returns the 16-bit floating point result of the subtraction.

.. function:: numba.cuda.fp16.hmul (a, b)

   Perform the multiply operation ``a * b`` on 16-bit floating point arguments in
   round to nearest mode. Maps to the ``mul.f16`` PTX instruction.

   Returns the 16-bit floating point result of the multiplication.

.. function:: numba.cuda.fp16.hdiv (a, b)

   Perform the divide operation ``a / b`` on 16-bit floating point arguments in
   round to nearest mode.

   Returns the 16-bit floating point result of the division.

.. function:: numba.cuda.fp16.hneg (a)

   Perform the negation operation ``-a`` on the 16-bit floating point argument.
   Maps to the ``neg.f16`` PTX instruction.

   Returns the 16-bit floating point result of the negation.

.. function:: numba.cuda.fp16.habs (a)

   Perform the absolute value operation ``|a|`` on the 16-bit floating point argument.

   Returns the 16-bit floating point result of the absolute value operation.

.. function:: numba.cuda.fp16.hsin (a)

   Calculates the trigonometry sine function of the 16-bit floating point argument.

   Returns the 16-bit floating point result of the sine operation.

.. function:: numba.cuda.fp16.hcos (a)

   Calculates the trigonometry cosine function of the 16-bit floating point argument.

   Returns the 16-bit floating point result of the cosine operation.

.. function:: numba.cuda.fp16.hlog (a)

   Calculates the natural logarithm of the 16-bit floating point argument.

   Returns the 16-bit floating point result of the natural log operation.

.. function:: numba.cuda.fp16.hlog10 (a)

   Calculates the base 10 logarithm of the 16-bit floating point argument.

   Returns the 16-bit floating point result of the log base 10 operation.

.. function:: numba.cuda.fp16.hlog2 (a)

   Calculates the base 2 logarithm on the 16-bit floating point argument.

   Returns the 16-bit floating point result of the log base 2 operation.

.. function:: numba.cuda.fp16.hexp (a)

   Calculates the natural exponential operation of the 16-bit floating point argument.

   Returns the 16-bit floating point result of the exponential operation.

.. function:: numba.cuda.fp16.hexp10 (a)

   Calculates the base 10 exponential of the 16-bit floating point argument.

   Returns the 16-bit floating point result of the exponential operation.

.. function:: numba.cuda.fp16.hexp2 (a)

   Calculates the base 2 exponential of the 16-bit floating point argument.

   Returns the 16-bit floating point result of the exponential operation.

.. function:: numba.cuda.fp16.hfloor (a)

   Calculates the floor operation, the largest integer less than or equal to ``a``,
   on the 16-bit floating point argument.

   Returns the 16-bit floating point result of the floor operation.

.. function:: numba.cuda.fp16.hceil (a)

   Calculates the ceiling operation, the smallest integer greater than or equal to ``a``,
   on the 16-bit floating point argument.

   Returns the 16-bit floating point result of the ceil operation.

.. function:: numba.cuda.fp16.hsqrt (a)

   Calculates the square root operation of the 16-bit floating point argument.

   Returns the 16-bit floating point result of the square root operation.

.. function:: numba.cuda.fp16.hrsqrt (a)

   Calculates the reciprocal of the square root of the 16-bit floating point argument.

   Returns the 16-bit floating point result of the reciprocal square root operation.

.. function:: numba.cuda.fp16.hrcp (a)

   Calculates the reciprocal of the 16-bit floating point argument.

   Returns the 16-bit floating point result of the reciprocal.

.. function:: numba.cuda.fp16.hrint (a)

   Round the input 16-bit floating point argument to nearest integer value.

   Returns the 16-bit floating point result of the rounding.

.. function:: numba.cuda.fp16.htrunc (a)

   Truncate the input 16-bit floating point argument to the nearest integer
   that does not exceed the input argument in magnitude.

   Returns the 16-bit floating point result of the truncation.

.. function:: numba.cuda.fp16.heq (a, b)

   Perform the comparison operation ``a == b`` on 16-bit floating point arguments.

   Returns a boolean.

.. function:: numba.cuda.fp16.hne (a, b)

   Perform the comparison operation ``a != b`` on 16-bit floating point arguments.

   Returns a boolean.

.. function:: numba.cuda.fp16.hgt (a, b)

   Perform the comparison operation ``a > b`` on 16-bit floating point arguments.

   Returns a boolean.

.. function:: numba.cuda.fp16.hge (a, b)

   Perform the comparison operation ``a >= b`` on 16-bit floating point arguments.

   Returns a boolean.

.. function:: numba.cuda.fp16.hlt (a, b)

   Perform the comparison operation ``a < b`` on 16-bit floating point arguments.

   Returns a boolean.

.. function:: numba.cuda.fp16.hle (a, b)

   Perform the comparison operation ``a <= b`` on 16-bit floating point arguments.

   Returns a boolean.

.. function:: numba.cuda.fp16.hmax (a, b)

   Perform the operation ``a if a > b else b.``

   Returns a 16-bit floating point value.

.. function:: numba.cuda.fp16.hmin (a, b)

   Perform the operation ``a if a < b else b.``

   Returns a 16-bit floating point value.

Control Flow Instructions
~~~~~~~~~~~~~~~~~~~~~~~~~

A subset of the CUDA's control flow instructions are directly available as
intrinsics. Avoiding branches is a key way to improve CUDA performance, and
using these intrinsics mean you don't have to rely on the ``nvcc`` optimizer
identifying and removing branches. For further documentation, including
semantics, please refer to the `relevant CUDA Toolkit documentation
<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#comparison-and-selection-instructions>`_.


.. function:: numba.cuda.selp

    Select between two expressions, depending on the value of the first
    argument. Similar to LLVM's ``select`` instruction.


Timer Intrinsics
~~~~~~~~~~~~~~~~

.. function:: numba.cuda.nanosleep(ns)

    Suspends the thread for a sleep duration approximately close to the delay
    ``ns``, specified in nanoseconds.