File: codegen.rst

package info (click to toggle)
rocfft 6.4.3-1
  • links: PTS, VCS
  • area: main
  • in suites: sid
  • size: 6,968 kB
  • sloc: cpp: 72,181; python: 6,506; sh: 387; xml: 204; makefile: 63
file content (339 lines) | stat: -rw-r--r-- 10,546 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
.. meta::
  :description: rocFFT documentation and API reference library
  :keywords: rocFFT, ROCm, API, documentation

.. _codegen:

********************************************************************
Code Generator Design Document for rocFFT
********************************************************************

Proposal
========

Create a new code generator for rocFFT.

Rationale
---------

The current code generator:

* dates from clFFT
* is based on string concatenation

Ideally, a new code generator:

* based on an abstract-syntax-tree (AST)
* generates faster, more robust kernels

ASTs allow generated code to be transformed and manipulated before
being emitted.  A concrete example of this for FFT kernels would be:
automatically translating a kernel from interleaved to planar format.

How the generator is designed and implemented is crucial for both
conciseness and ease-of-use.


Required kernels (scope)
========================

For rocFFT, we need/want to generate:

* Host functions to launch the FFT kernels
* Tiled (row/column) + strided + batched Stockham kernels for
  arbitrary factorization
* May want to extend to Cooley-Tukey kernels as well

Kernels need to handle all combinations of:

* single/double precision (and be extendable to half-float and bfloat)
* in-place/out-of-place
* planar/interleaved
* real/complex
* small/large twiddle tables
* unit/non-unit stride
* transposed output, including with twiddle multiplies for large 1D
* fusing with pre and post-processing kernels (e.g. real even-length)

Ideally any configuration/runtime parameters required by the kernels
would be defined in a single place to avoid repetition between rocFFT
and the generator.

We have flexibility in handling these combinations at compile-time or
run-time.  For example, multiple kernels could be generated for
single/double precision, but unit/non-unit stride could be handled at
runtime.

Fundamentally, all multidimensional and batched FFTs can be written in
terms of 1D transforms (with affine indexing).  As such, an FFT is
broken down into:

* A *host* function that is aware of dimensions, strides, batches, and
  tiling.  This function would be responsible for determining how the
  problem will be broken down into GPU thread blocks.
* A *global* function that is aware of GPU thread blocks, dimensions,
  strides, batches, and tiling.  This function would be responsible
  for determining offsets and strides for the device function, and
  declaring LDS memory buffers.
* A *device* function that is passed offsets and strides, and is aware
  of GPU threads.  The device function would perform a (short) 1D
  transform.

A device function may be called so that a thread block is actually
transforming multiple batches.  As such, indexes (the spatial index in
the FFT) should be computed as:

.. code-block:: c

   int fft_index = threadIdx.x % width;


Tiling
======

Launching device kernels in a way that traverses memory in tiles will
be handled at the host/global level.

Kernels need to support reading/writing in columns/rows.  These are
the block CC/RC/CR flavors (where C and R refer to column and row) of
the existing kernels.


Strides and batches
===================

Host
----

Host/global functions should support arbitrary dimensions, lengths,
strides, offsets, and batches.

Users should be allowed to store their arrays arbitrarily.  For an
:math:`N` dimensional dataset, the flat index :math:`a` corresponding
to indices :math:`(i_1,\ldots,i_N,i_b)`, where :math:`i_b` is the
batch index, is given by

.. math::

   a(i_1,\ldots,i_N,i_b) = s_b i_b + \sum_{d=1}^N s_d i_d

Where :math:`s_d` is the stride along dimension :math:`d`.  To support
these strides, the device function to compute the FFT along dimension
:math:`D` would be passed:

.. code-block:: c

   int offset = 0;
   offset += batch_index * batch_stride;
   for (int d=0; d < N; ++d)
     if (d != D)
       offset += spatial_index[d] * strides[d];

   int stride = strides[D];

For example, in three dimensions, to compute the FFT along the
y-dimension given x and z indices ``i`` and ``k`` for batch ``b``,
the device function would be passed:

.. code-block:: c

   int offset = 0;
   offset += b * batch_stride;
   offset += i * strides[0];
   offset += k * strides[2];

   int stride = strides[1];

Device
------

Device functions should support arbitrary offsets and strides.  Array
indexes in device functions should be computed as, eg:

.. code-block:: c

   int fft_index = threadIdx.x % width;
   int array_index = offset + fft_index * stride;


Large twiddle tables
====================

Large 1D transforms are decomposed into multiple transforms.  To
reduce the size of twiddle tables, rotations can be decomposed into
multiple stages as well.  For example, the rotation through
:math:`2\pi \cdot 280 / 256^2` can be decomposed into :math:`2\pi
\cdot 1 / 256 + 2\pi 24 / 256^2`.  The resulting twiddle table
contains 512 entries instead of 65536 entries.

Generated kernels should support these "large twiddle tables".


Launching
=========

For a specific transform length, the generator is free to choose among
several algorithms and related tuning parameters.  These choices may
influence how the kernel is launched.  The generator will create both
the kernel and the accompanying struct, which gives indications of how
the kernel may be used in both rocFFT and other applications.

The generator will populate a function pool with structs of the form

.. code-block:: c++

    struct ROCFFTKernel
    {
        void *device_function = nullptr;
        std::vector<int> factors;
        int              transforms_per_block = 0;
        int              workgroup_size = 0;
        // ...
    };

This moves the responsibility of figuring how a kernel should be
launched to the generator.

Currently, kernels are launched with:

* dimension
* number of blocks (batches)
* number of threads (threads per batch; kernel parameter)
* stream
* twiddle table
* length(s)
* strides
* batch count
* in/out buffers


Implementation
==============

The code generator will be implemented in Python using only standard
modules.

The AST will be represented as a tree structure, with nodes in the
tree representing operations, such as assignment, addition, or a block
containing multiple operations.  Nodes will be represented as objects
(e.g., ``Add``) extending the base class ``BaseNode``.  Operands will be
stored in a simple list called ``args``:

.. code-block:: python

    class BaseNode:
        args: List[Any]


To facilitate building ASTs, the base node will have a constructor
that simply stores its arguments as operands:

.. code-block:: python

    class BaseNode:
        args: List[Any]
        def __init__(self, *args, **kwargs):
            self.args = list(args)


To facilitate rewriting ASTs, node object's constructors should accept
a simple list of argument/operands.

This, for example, allows a depth-first tree re-write to be
implemented trivially as:

.. code-block:: python

    def depth_first(x, f):
        '''Depth first traversal of the AST in 'x'.  Each node is transformed by 'f(x)'.'''
        if isinstance(x, BaseNode):
            y = type(x)(*[ depth_first(a, f) for a in x.args ])
            return f(y)
        return f(x)

To emit code, each node must implement ``__str__``.  For example:

.. code-block:: python

    class Add(BaseNode):
        def __str__(self):
            return ' + '.join([ str(x) for x in self.args ])


Stockham tiling implementation
------------------------------

To support tiling, the *global* function is responsible for loading
data from global memory into LDS memory in a tiled manner.  Once in
LDS memory, a singly strided *device* function performs an
interleaved, in-place FFT entirely within LDS.

Polymorphism will be used to abstract tiling strategies.  Different
tiling strategies should extend the ``StockhamTiling`` object and
overload the ``load_from_global`` and ``store_to_global`` methods.

For example:

.. code-block:: python

    tiling = StockhamTilingRR()
    scheme = StockhamDeviceKernelUWide()

    body = StatementList()
    body += tiling.compute_offsets(...)
    body += tiling.load_from_global(out=lds, in=global_buffer)
    body += scheme.fft(lds)
    body += tiling.store_to_global(out=global_buffer, in=lds)

Different tiling strategies may require new template parameters and/or
function arguments.  Tiling strategies can manipulate the following methods:

* ``add_templates``
* ``add_global_arguments``
* ``add_device_arguments``
* ``add_device_call_arguments``

Each of these methods is passed a ``TemplateList`` or
``ArgumentList`` argument, and should return a new template/argument
list with any extra parameters added.


Large twiddle tables
--------------------

Device kernels may need to apply additional twiddles during their
execution.  These extra twiddle tables are implemented similarly to
tiling.  Different twiddle table strategies should extend the
``StockhamLargeTwiddles`` object and overload the ``load`` and
``multiply`` methods.

Twiddle tables may also require additional templates and arguments.
See :ref:`Stockham tiling implementation`.

Copyright and disclaimer
========================

The information contained herein is for informational purposes only,
and is subject to change without notice. While every precaution has
been taken in the preparation of this document, it may contain
technical inaccuracies, omissions and typographical errors, and AMD is
under no obligation to update or otherwise correct this information.
Advanced Micro Devices, Inc. makes no representations or warranties
with respect to the accuracy or completeness of the contents of this
document, and assumes no liability of any kind, including the implied
warranties of non-infringement, merchantability or fitness for
particular purposes, with respect to the operation or use of AMD
hardware, software or other products described herein.  No license,
including implied or arising by estoppel, to any intellectual property
rights is granted by this document.  Terms and limitations applicable
to the purchase or use of AMD’s products are as set forth in a signed
agreement between the parties or in AMD's Standard Terms and
Conditions of Sale.

AMD is a trademark of Advanced Micro Devices, Inc. Other product
names used in this publication are for identification purposes only
and may be trademarks of their respective companies.

Copyright (C) 2021 - 2024 Advanced Micro Devices, Inc. All rights reserved.