1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437
|
.. _performance:
Writing Fast Futhark Programs
=============================
This document contains tips, tricks, and hints for writing efficient
Futhark code. Ideally you'd need to know nothing more than an
abstract cost model, but sometimes it is useful to have an idea of how
the compiler will transform your program, what values look like in
memory, and what kind of code the compiler will generate for you.
These details are documented below. Don't be discouraged by the
complexities mentioned here - most Futhark programs are written
without worrying about any of these details, and they still manage to
run with good performance. This document focuses on corner cases and
pitfalls, which easily makes for depressing reading.
Parallelism
-----------
The Futhark compiler only generates parallel code for explicitly
parallel constructs such as ``map`` and ``reduce``. A plain ``loop``
will *not* result in parallel code (unless the loop body itself
contains parallel operations). The most important parallel constructs
are the *second-order array combinators* (SOACs) such as ``map`` and
``reduce``, but functions such as ``copy`` are also parallel.
When describing the asymptotic cost of a Futhark function, it is not
enough to give a traditional big-O measure of the total amount of
work. Both ``foldl`` and ``reduce`` involve *O(n)* work, where *n* is
the size of the input array, but ``foldl`` is sequential while
``reduce`` is parallel, and this is an important distinction. To make
this distinction, each function is described by *two* costs: the
*work*, which is the total amount of operations, and the *span*
(sometimes called *depth*) which is intuitively the "longest chain of
sequential dependencies". We say that ``foldl`` has span *O(n)*,
while ``reduce`` has span *O(log(n))*. This explains that
``reduce`` is more parallel than ``foldl``. The documentation for a
Futhark function should mention both its work and span. `See this
<https://sigkill.dk/writings/par/cost.html>`_ for more details on
parallel cost models and pointers to literature.
Scans and reductions
~~~~~~~~~~~~~~~~~~~~
The ``scan`` and ``reduce`` SOACs are rather inefficient when their
operators are on arrays. If possible, use tuples instead (see
:ref:`performance-small-arrays`). The one exception is when the
operator is a ``map2`` or equivalent. Example:
.. code-block:: futhark
reduce (map2 (+)) (replicate n 0) xss
Such "vectorised" operators are typically handled quite efficiently.
Although to be on the safe side, you can rewrite the above by
interchanging the ``reduce`` and ``map``:
.. code-block:: futhark
map (reduce (+) 0) (transpose xss)
Avoid reductions over tiny arrays, e.g. ``reduce (+) 0 [x,y,z]``. In
such cases the compiler will generate complex code to exploit a
negligible amount of parallelism. Instead, just unroll the loop
manually (``x+y+z``) or perhaps use ``foldl (+) 0 [x,z,y]``, which
produces a sequential loop.
Histograms
~~~~~~~~~~
The ``reduce_by_index`` construct ("generalised histogram") has a
clever and adaptive implementation that handles multiple updates of
the same bin efficiently. Its main weakness is when computing a very
large histogram (many millions of bins) where only a tiny fraction of
the bins are updated. This is because the main mechanism for
optimising conflicts is by duplicating the histogram in memory, but
this is not efficient when it is very large. If you know your program
involves such a situation, it may be better to implement the histogram
operation by sorting and then performing an irregular segmented
reduction.
Particularly with the GPU backends, ``reduce_by_index`` is much faster
when the operator involves a single 32-bit or 64-bit value. Even if
you really want an 8-bit or 16-bit result, it may be faster to compute
it with a 32-bit or 64-bit type and manually mask off the excess bits.
Nested parallelism
~~~~~~~~~~~~~~~~~~
Futhark allows nested parallelism, understood as a parallel construct
used inside some other parallel construct. The simplest example is
nested SOACs. Example:
.. code-block:: futhark
map (\xs -> reduce (+) 0 xs) xss
Nested parallelism is allowed and encouraged, but its compilation to
efficient code is rather complicated, depending on the compiler
backend that is used. The problem is that sometimes exploiting all
levels of parallelism is not optimal, yet how much to exploit depends
on run-time information that is not available to the compiler.
Sequential backends
!!!!!!!!!!!!!!!!!!!
The sequential backends are straightforward: all parallel operations
are compiled into sequential loops. Due to Futhark's low-overhead
data representation (see below), this is often surprisingly efficient.
Multicore backend
!!!!!!!!!!!!!!!!!
Whenever the multicore backend encounters nested parallelism, it
generates two code versions: one where the nested parallel constructs
are also parallelised (possibly recursively involving further nested
parallelism), and one where they are turned into sequential loops. At
runtime, based on the amount of work available and self-tuning
heuristics, the scheduler picks the version that it believes best
balances overhead with exploitation of parallelism.
GPU backends
!!!!!!!!!!!!
The GPU backends handle parallelism through an elaborate program
transformation called *incremental flattening*. The full details are
beyond the scope of this document, but some properties are useful to
know of. `See this paper
<https://futhark-lang.org/publications/ppopp19.pdf>`_ for more
details.
The main restriction is that the GPU backends can only handle
*regular* nested parallelism, meaning that the sizes of inner parallel
dimensions are invariant to the outer parallel dimensions. For
example, this expression contains *irregular* nested parallelism:
.. code-block:: futhark
map (\i -> reduce (+) 0 (iota i)) is
This is because the size of the nested parallel construct is ``i``,
and ``i`` has a different value for every iteration of the outer
``map``. The Futhark compiler will currently turn the irregular
constructs (here, the ``reduce``) into a sequential loop. Depending
on how complicated the irregularity is, it may even refuse to generate
code entirely.
Incremental flattening is based on generating multiple code versions
to cater to different classes of datasets. At run-time, one of these
versions will be picked for execution by comparing properties of the
input (its size) with a *threshold parameter*. These threshold
parameters have sensible defaults, but for optimal performance, they
can be tuned with :ref:`futhark-autotune(1)`.
Value Representation
--------------------
The compiler discards all type abstraction when compiling. Using the
module system to make a type abstract causes no run-time overhead.
Scalars
~~~~~~~
Scalar values (``i32``, ``f64``, ``bool``, etc) are represented as
themselves. The internal representation does not distinguish signs,
so ``i32`` and ``u32`` have the same representation, and converting
between types that differ only in sign is free.
Tuples
~~~~~~
Tuples are flattened and then represented directly by their individual
components - there are no *tuple objects* at runtime. A function that
takes an argument of type ``(f64,f64)`` corresponds to a C function
that takes two arguments of type ``double``. This has one performance
implication: whenever you pass a tuple to a function, the *entire*
tuple is copied (except any embedded arrays, which are always passed
by reference, see below). Due to the compiler's heavy use of
inlining, this is rarely a problem in practice, but it can be a
concern when using the ``loop`` construct with a large tuple as the
loop variant parameter.
Records
~~~~~~~
Records are turned into tuples by simply sorting their fields and
discarding the labels. This means there is no overhead to using a
record compared to using a tuple.
Sum Types
~~~~~~~~~
A sum type value is represented as a tuple containing all the payload
components in order, prefixed with an `i8` tag to identify the
constructor. For example,
.. code-block:: futhark
#foo i32 bool | #bar i32
would be represented as a tuple of type
.. code-block:: futhark
(i8, i32, bool, i32)
where the value
.. code-block:: futhark
#foo 42 false
is represented as
.. code-block:: futhark
(1, 42, false, 0)
where ``#foo`` is assigned the tag ``1`` because it is alphabetically
after ``#bar``.
To shrink the tuples, if multiple constructors have payload elements
of the *same* type, the compiler assigns them to the same elements in
the result tuple. The representation of the above sum type is actually
the following:
.. code-block:: futhark
(i8, i32, bool)
The types must be the *same* for deduplication to take place - despite
`i32` and `f32` being of the same size, they cannot be assigned the
same tuple element. This means that the type
.. code-block:: futhark
#foo [n]i32 | #bar [n]i32
is efficiently represented as
.. code-block:: futhark
(u8, [n]i32)
However the type
.. code-block:: futhark
#foo [n]i32 | #bar [n]f32
is represented as
.. code-block:: futhark
(u8, [n]i32, [n]f32)
which is not great. Take caution when you use sum types with large
arrays in their payloads.
Functions
~~~~~~~~~
Higher-order functions are implemented via defunctionalisation. At
run-time, they are represented by a record containing their lexical
closure. Since the type system forbids putting functions in arrays,
this is essentially a constant cost, and not worth worrying about.
Arrays
~~~~~~
Arrays are the only Futhark values that are boxed - that is, are
stored on the heap.
The elements of an array are unboxed, stored adjacent to each other in
memory. There is zero memory overhead except for the minuscule amount
needed to track the shape of the array.
Multidimensional arrays
!!!!!!!!!!!!!!!!!!!!!!!
At the surface language level, Futhark may appear to support "arrays
of arrays", and this is indeed a convenient aspect of its programming
model, but at runtime multi-dimensional arrays are stored in flattened
form. A value of type ``[x][y]i32`` is laid out in memory simply as
one array containing *x\*y* integers. This means that constructing an
array ``[x,y,x]`` can be (relatively) expensive if ``x``, ``y``, ``z``
are themselves large arrays, as they must be copied in their entirety.
Since arrays cannot contain other arrays, memory management only has
to be concerned with one level of indirection. In practice, it means
that Futhark can use straightforward reference counting to keep track
of when to free the memory backing an array, as circular references
are not possible. Further, since arrays tend to be large and
relatively few in number, the usual performance impact of naive
reference counting is not present.
Arrays of tuples
!!!!!!!!!!!!!!!!
For arrays of tuples, Futhark uses the so-called `structure of arrays
<https://en.wikipedia.org/wiki/AoS_and_SoA>`_ representation. In
Futhark terms, an array ``[n](a,b,c)`` is at runtime represented as
the tuple ``([n]a,[n]b,[n]c)``. This means that the final memory
representation always consists of arrays of scalars.
This has some significant implications. For example, ``zip`` and
``unzip`` are very cheap, as the actual runtime representation is in
always "unzipped", so these functions don't actually have to do
anything.
Since records and sum types are represented as tuples, this also
explains how arrays of these are represented.
Element order
!!!!!!!!!!!!!
The exact in-memory element ordering is up to the compiler, and
depends on how the array is constructed and how it is used. Absent
any other information, Futhark represents multidimensional arrays in
row-major order. However, depending on how the array is traversed,
the compiler may insert code to represent it in some other order. For
particularly tricky programs, an array may even be duplicated in
memory, represented in different ways, to ensure efficient traversal.
This means you should normally *not* worry about how to represent your
arrays to ensure coalesced access on GPUs or similar. That is the
compiler's job.
Crucial Optimisations
---------------------
Some of the optimisations done by the Futhark compiler are important,
complex, or subtle enough that it may be useful to know how they work,
and how to write code that caters to their quirks.
Fusion
~~~~~~
Futhark performs fusion of SOACs. For an expression ``map f (map g
A)``, then the compiler will optimise this into a single ``map`` with
the composition of ``f`` and ``g``, which prevents us from storing an
intermediate array in memory. This is called *vertical fusion* or
*producer-consumer fusion*. In this case the *producer* is ``map g``
and the *consumer* is ``map f``.
Fusion does not depend on the expressions being adjacent as in this
example, as the optimisation is performed on a data dependency graph
representing the program. This means that you can decompose your
programs into many small parallel operations without worrying about
the overhead, as the compiler will fuse them together automatically.
Not all producer-consumer relationships between SOACs can be fused.
Generally, ``map`` can always be fused as a producer, but ``scan``,
``reduce``, and similar SOACs can only act as consumers.
*Horizontal fusion* occurs when two SOACs take as input the same
array, but are not themselves in a producer-consumer relationship.
Example:
.. code-block:: futhark
(map f xs, map g xs)
Such cases are fused into a single operation that traverses ``xs``
just once. More than two SOACs can be involved in horizontal fusion,
and they need not be of the same kind (e.g. one could be a ``map`` and
the other a ``reduce``).
Free Operations
---------------
Some operations such as array slicing, ``take``, ``drop``,
``transpose`` and ``reverse`` are "free" in the sense that they merely
return a different view of some underlying array. In most cases they
have constant cost, no matter the size of the array they operate on.
This is because they are index space transformations that simply
result in different code being generated when the arrays are
eventually used.
However, there are some cases where the compiler is forced to manifest
such a "view" as an actual array in memory, which involves a full
copy. An incomplete list follows:
* Any array returned by an entry point is converted to row-major
order.
* An array returned by an ``if`` branch may be copied if its
representation is substantially different from that of the other
branch.
* An array returned by a ``loop`` body may be copied if its
representation is substantially different from that of the initial
loop values.
* An array is copied whenever it becomes the element of another
multidimensional array. This is most obviously the case for array
literals (``[x,y,z]``), but also for ``map`` expressions where the
mapped function returns an array.
Note that this notion of "views" is not part of the Futhark type
system - it is merely an implementation detail. Strictly speaking,
all functions that return an array (e.g. ``reverse``) should be
considered to have a cost proportional to the size of the array, even
if that cost will almost never actually be paid at run-time. If you
want to be sure no copy takes place, it may be better to explicitly
maintain tuples of indexes into some other array.
.. _performance-small-arrays:
Small Arrays
------------
The compiler assumes arrays are "large", which for example means that
operations across them are worth parallelising. It also means they
are boxed and heap-allocated, even when the size is a small constant.
This can cause unexpectedly bad performance when using small
constant-size arrays (say, five elements or less). Consider using
tuples or records instead. `This post
<https://futhark-lang.org/blog/2019-01-13-giving-programmers-what-they-want.html>`_
contains more information on how and why. If in doubt, try both and
measure which is faster.
Inlining
--------
The compiler currently inlines all functions at their call site,
unless they have been marked with the ``noinline`` attribute (see
:ref:`attributes`). This can lead to code explosion, which mostly
results in slow compile times, but can also affect run-time
performance. In many cases this is currently unavoidable, but
sometimes the program can be rewritten such that instead of calling
the same function in multiple places, it is called in a single place,
in a loop. E.g. we might rewrite ``f x (f y (f z v))`` as:
.. code-block:: futhark
loop acc = v for a in [z,y,x] do
f a acc
|