File: runtime_compilation.rst

package info (click to toggle)
rocfft 6.4.3-1
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 6,968 kB
  • sloc: cpp: 72,181; python: 6,506; sh: 387; xml: 204; makefile: 63
file content (423 lines) | stat: -rw-r--r-- 15,309 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
.. meta::
  :description: rocFFT documentation and API reference library
  :keywords: rocFFT, ROCm, API, documentation

.. _runtime_compilation:

********************************************************************
Runtime Compilation Design Document for rocFFT
********************************************************************

Summary
=======

This document describes runtime compilation (RTC) as it is used in
rocFFT.  Runtime compilation helps reduce binary size and build times
for the library, and can allow for optimizations that are not
practical versus ahead-of-time compiled kernels.

Problem
=======

Stockham FFT kernels make up the vast majority of the rocFFT library.
Kernels handling specific problem sizes are chosen as part of the
rocFFT build process, and are compiled for all the variants that
might be required at runtime.

The count of variants for each problem size has a number of stacking
multipliers applied to it.  Any given problem size needs variants for:

* Each supported GPU architecture;

* Six interleaved/planar and in-place/out-of-place variants;

* Forward and inverse transforms;

* At least two precisions (single/double);

* Unit- and non-unit- strides;

* With and without callback support.

Runtime compilation has advantages over pre-compiling all the
above variants for all problem sizes.  These include:

* Handling of new problem sizes does not require rebuilding the
  library.

* Build times are faster.

* The library binary is smaller.  This in turn means:
  installation is faster; and difficulties arising from limited
  memory addressing in shared objects (the default memory model for
  shared objects built with ``-fPIC`` only allows for 2 GiB binaries,
  resulting in `build breaks`_) are reduced.

.. _build breaks: https://www.ibm.com/support/pages/intel-compiler-error-relocation-truncated-fit-rx8664pc32

Solution
========

HIP provides runtime compilation facilities, in the hiprtc.h header.

The code generator is embedded into the library itself.  During FFT
planning, we can run the code generator to produce suitable FFT
source code for the specific problem being solved.  Then, we can
runtime-compile that source into GPU kernels that we launch at FFT
execution time.

Empirical testing of runtime compilation on our FFT kernels shows
that compilation times for single kernels range between 0.5 and 2
seconds on modern hardware, with more complicated kernels taking more
time.  Kernel execution time is identical to the ahead-of-time
compiled version.

Implementation
==============

Embedding and running the generator
-----------------------------------

A generator implemented in C++ can be built into the library like any
other C++ code that makes up the library.

During plan building, we can execute the generator code to
produce a string of source code for the required problem sizes.

The generator needs sufficient input to have it produce exactly the
variant that is required, e.g. length-336, inverse, out-of-place,
interleaved-to-planar, double precision, etc.

Compilation
-----------

Compilation should be done during plan building, and the generated
kernels can be attached directly to the ``TreeNode`` for that step of
the FFT.

If the kernels are available on the ``TreeNode``, we will have less
overhead at execution time, since we don't need to do any work to
find the right kernel to run.

Caching kernels at runtime
--------------------------

If a process needs to create multiple plans that would compile the
same FFT kernel variant, it's nice to reuse kernels we've already
compiled.  Reusing an already-compiled kernel would save compilation
time on subsequent runs.

Compiled kernels may be persisted to disk for maximal reuse.
However, rocFFT may be used in distributed systems where the
file system is shared among multiple compute nodes, and having
multiple nodes all contend for the same shared file is problematic
for performance.

By default, kernels are only cached in memory, to prevent this
contention.  The cache location may still be overridden at runtime
using mechanisms described below.

Cache keys
^^^^^^^^^^

The cache keys need to be chosen carefully to ensure that an obsolete
kernel is not reused when a new version really **should** be
recompiled.

The kernel function name shall be the main key field in the cache.
The function name shall encode all the parameters by which kernel
functions could differ, including:

* scheme (e.g. whether this is a standard Stockham kernel, or a
  variant that does different read/write tiling)

* length (typically 1D length, may be 2D or more for single kernels
  that handle higher dimension transforms)

* placement (in-place or out-of-place)

* direction (forward or backward)

* input and output formats (planar, interleaved, real)

* precision (single or double)

* stride type (unit stride or non-unit stride)

* large 1D twiddle type and base (for kernels that need to integrate a
  twiddle multiplication step)

* callback type (run callbacks or don't run callbacks)

Encoding all of these parameters into the kernel name is necessary
anyway, so that logs and profilers will tell users and developers
exactly which kernel is running, even if it's been runtime-compiled.

Using just the kernel name as the main key is also helpful because
the caching code needn't be aware of all the possible parameters that
kernels could differ by.  New parameters can be added at anytime, and
as long as the kernel names are updated accordingly, the cache will
just work.

The cache will also need to store other key fields to ensure that a
kernel is compiled if any of these changes:

* GPU architecture

* HIP runtime version

* Kernel generator version

Practically, these key field choices will ensure that users are
always running the latest kernels that rocFFT provides and which are
appropriate for the hardware present.

User control of cache
^^^^^^^^^^^^^^^^^^^^^

Distributed workflows will want additional control over the cache.
For example, a workload that distributes FFT computation over many MPI nodes will want to ensure that the kernels are built
once centrally rather than by each node.

MPI nodes might also have no access to disk (either shared with other
nodes or local to each node).

rocFFT needs to expose APIs to:

* Serialize the current cache to a library-allocated buffer

* Free the library-allocated serialization buffer

* Deserialize a buffer into a cache (which might need to be in-memory
  for diskless nodes)

The example MPI computation described above would be able to build
plans on the rank 0 node to populate the cache once.  Then, it can
use these new APIs along with MPI APIs to distribute the cache to
each work node.

Backing store implementation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The cache may be written to disk, and if so it must be robust in the
face of concurrent access, crashes during library operation, and so
on.

We really would like the cache to have ACID properties of database
systems.

The easiest way to achieve this is to use SQLite to manage the
storage.  It's easily embeddable in our library (or is readily
available as its own library), and provides all the properties
we'd want for the storage backend.

It also provides APIs to serialize a database, as required for the
distributed workflows described above.

Pre-built kernels
^^^^^^^^^^^^^^^^^

Even if rocFFT is prepared to runtime-compile any FFT kernel, we can
still pre-compile kernels by populating a cache at library build time
and shipping the cache with the library.

Cache location
^^^^^^^^^^^^^^

The main challenge here is installing this pre-built cache in a place
that the library will be able to find.

The easiest solution here, as employed by `other math libraries` is
to look for this the cache file relative to the shared library itself.

.. _other math libraries: https://github.com/ROCmSoftwarePlatform/rocBLAS/blob/d8e00e169ccc7ca21211705643e85545e98e455a/library/src/tensile_host.cpp#L521

Environment variables can override the locations of caches used by
rocFFT.  During normal operation, we would expect one read-only cache
shipped with the library and one modifiable cache updated as the user
runs transforms that use new kernels.

We support two environment variables for these two locations:

* ROCFFT_RTC_SYS_CACHE_PATH - the pre-built read-only system-level cache.
* ROCFFT_RTC_CACHE_PATH - the read-write user-level cache.

Note that if the library is linked statically, we will not be able to
find any files relative to the library.  The
ROCFFT_RTC_SYS_CACHE_PATH environment variable will then be required
for rocFFT to find the system-level cache, but rocFFT will still
update the user-level cache and have correct behavior without a
system-level cache.

Populating the cache
^^^^^^^^^^^^^^^^^^^^

Populating this shipped cache is done via a helper executable that is
built and run during the rocFFT build.  A separate helper executable
(which is not itself shipped with rocFFT) is necessary so that it can
share rocFFT's generator and RTC code, without requiring rocFFT to
expose extra symbols just for this task.

This helper should work at the kernel level, e.g. build Stockham
kernels for all desired combinations of:

* supported architectures (gfx908, gfx90a, gfx1030, etc.)
* precisions
* problem sizes
* array formats
* etc.

The criteria for which kernels to pre-build can be arbitrary.  Less
common choices will be runtime-compiled, and runtime compilation is
still a fallback in case a pre-built kernel is not available for
whatever reason.

An inferior option would be for the helper to work at the plan level
(i.e. use rocFFT to build a set of plans and save the resulting RTC
kernels).  However, creating plans involves doing a lot of other
unnecessary work, like generating twiddle tables and deciding on
buffer assignment.

Impact on tests
^^^^^^^^^^^^^^^

Accuracy tests are maximally affected in terms of runtime by this
change, since they run a huge number of problem sizes in the context
of a single process.  That means the costs of generating and
compiling a large variety of kernel variants will be the most painful
here, once more problem sizes are handled by the new generator.

An increase in test runtime is an unfortunate side effect of runtime
compilation.  This cost is made more acceptable because the compile
time of the library has already been reduced prior to running the tests.

A possible solution here might be to do a parallel traversal of the
test cases, building rocFFT plans for each of them (but not actually
executing plans).  This would runtime-compile the whole suite's
kernels in parallel, which would save a lot of time.

Interaction with callbacks
^^^^^^^^^^^^^^^^^^^^^^^^^^

Callback-enabled FFTs require a different kernel variant to be
generated, but the decision of whether to actually run with a
callback is made by the user after the plan is constructed.

To solve this, we generate both a callback and non-callback variant
where necessary during plan creation.

Parallel compilation
^^^^^^^^^^^^^^^^^^^^

Because of the potential need for callback-enabled kernels, most
plans will be generated faster if kernels can be compiled in
parallel.  Unfortunately, hipRTC has process-wide locks in it that
prevent useful multithreading of compilation.

Instead, we can spawn a helper process for subsequent compilations if
a compilation is already in-progress in the original process.  This
helper would need to be shipped with the library, in a location
that's knowable by the library.  If we fail to find or spawn that
helper, compilation must fall back to compiling in-process.

Code organization
=================

The whole of rocFFT runtime compilation can be broken down into
separate subsystems:

1. Generating source to be compiled, further subdivided into
   generators for each type of kernel (Stockham, transpose,
   Bluestein, etc).  Input specifications of the desired kernel
   include problem size, precision, result placement, and so on.

   Files to implement this are named:

   * rtc_stockham_gen.cpp
   * rtc_transpose_gen.cpp
   * etc.

2. Compiling source code into object code, which can be further subdivided:

   a. Compiling code in the current process
   b. Compiling code in a sub-process

   The files to implement these are named:

   rtc_compile.cpp
   rtc_subprocess.cpp

3. Reading/writing the cache of compiled object code.

   The file to implement this is named:

   rtc_cache.cpp

4. Compiling and launching the correct kernel for a TreeNode in an
   FFT plan.  This subsystem would need to derive the correct input
   specifications for the generator, given the data in the TreeNode.
   It would also need to derive the correct launch arguments to pass
   to the kernel.

   Files to implement this are named:

   * rtc_stockham_kernel.cpp
   * rtc_transpose_kernel.cpp
   * etc.

   These files are named rtc_*_kernel.cpp because they implement
   subclasses of the generic RTCKernel type.

In this list, 1 and 2 are independent.  2b depends on 2a.  3 depends
on 1 and 2.  4 depends on 3.  2a requires the hipRTC library, 3
requires the SQLite library, and 4 requires the full HIP runtime
library (amdhip64).

Build-time processes that populate a cache to ship with the library
depend on 3.  The helper process to support parallel compilation
depends on 2a.

It's important to avoid using the full HIP runtime at build time -
Windows build environments in particular may not have the sufficient
libraries or infrastructure to successfully load the full runtime,
but they are able to load hipRTC.

Future work
===========

Moving away from chosen problem sizes
-------------------------------------

Once the infrastructure is in place, we could consider enabling
runtime compilation for all FFT sizes, not just those that are chosen
ahead of time.  The generator is already able to auto-factorize
arbitrary sizes, though we haven't yet tested the limits of this
ability.

Copyright and disclaimer
========================

The information contained herein is for informational purposes only,
and is subject to change without notice. While every precaution has
been taken in the preparation of this document, it may contain
technical inaccuracies, omissions and typographical errors, and AMD is
under no obligation to update or otherwise correct this information.
Advanced Micro Devices, Inc. makes no representations or warranties
with respect to the accuracy or completeness of the contents of this
document, and assumes no liability of any kind, including the implied
warranties of non-infringement, merchantability or fitness for
particular purposes, with respect to the operation or use of AMD
hardware, software or other products described herein.  No license,
including implied or arising by estoppel, to any intellectual property
rights is granted by this document.  Terms and limitations applicable
to the purchase or use of AMD’s products are as set forth in a signed
agreement between the parties or in AMD's Standard Terms and
Conditions of Sale.

AMD is a trademark of Advanced Micro Devices, Inc. Other product names
used in this publication are for identification purposes only and may
be trademarks of their respective companies.

Copyright (c) 2022 - 2024 Advanced Micro Devices, Inc. All rights
reserved.