File: CHANGELOG.md

package info (click to toggle)
rocfft 6.1.2-1
links: PTS, VCS
area: main
in suites: forky, sid, trixie
size: 5,672 kB
sloc: cpp: 55,735; python: 5,774; sh: 428; xml: 204; makefile: 56
file content (479 lines) | stat: -rw-r--r-- 16,558 bytes
# Changelog for rocFFT

Documentation for rocFFT is available at
[https://rocm.docs.amd.com/projects/rocFFT/en/latest/](https://rocm.docs.amd.com/projects/rocFFT/en/latest/).

## rocFFT 1.0.27 for ROCm 6.1.1

### Fixes

* Fixed kernel launch failure on execute of very large odd-length real-complex transforms.

### Additions

* Enable multi-gpu testing on systems without direct GPU-interconnects

## rocFFT 1.0.26 for ROCm 6.1.0

### Changes

* Multi-device FFTs now allow batch greater than 1
* Multi-device, real-complex FFTs are now supported
* rocFFT now statically links libstdc++ when only `std::experimental::filesystem` is available (to guard
  against ABI incompatibilities with newer libstdc++ libraries that include `std::filesystem`)

## rocFFT 1.0.25 for ROCm 6.0.0

### Additions

* Implemented experimental APIs to allow computing FFTs on data distributed across multiple devices
  in a single process

  * `rocfft_field` is a new type that can be added to a plan description to describe the layout of FFT
    input or output
  * `rocfft_field_add_brick` can be called to describe the brick decomposition of an FFT field, where each
    brick can be assigned a different device

  These interfaces are still experimental and subject to change. We are interested in getting feedback.
  You can raise questions and concerns by opening issues in the
  [rocFFT issue tracker](https://github.com/ROCmSoftwarePlatform/rocFFT/issues).

  Note that multi-device FFTs currently have several limitations (we plan to address these in future
  releases):

  * Real-complex (forward or inverse) FFTs are not supported
  * Planar format fields are not supported
  * Batch (the `number_of_transforms` provided to `rocfft_plan_create`) must be 1
  * FFT input is gathered to the current device at run time, so all FFT data must fit on that device

### Optimizations

* Improved the performance of several 2D/3D real FFTs supported by `2D_SINGLE` kernel. Offline
  tuning provides more optimization for fx90a
* Removed an extra kernel launch from even-length, real-complex FFTs that use callbacks

### Changes

* Built kernels in a solution map to the library kernel cache
* Real forward transforms (real-to-complex) no longer overwrite input; rocFFT may still overwrite real
  inverse (complex-to-real) input, as this allows for faster performance

* `rocfft-rider` and `dyna-rocfft-rider` have been renamed to `rocfft-bench` and `dyna-rocfft-bench`;
  these are controlled by the `BUILD_CLIENTS_BENCH` CMake option
  * Links for the former file names are installed, and the former `BUILD_CLIENTS_RIDER` CMake option
    is accepted for compatibility, but both will be removed in a future release
* Binaries in debug builds no longer have a `-d` suffix

### Fixes

* rocFFT now correctly handles load callbacks that convert data from a smaller data type (e.g., 16-bit
  integers -> 32-bit float)

## rocFFT 1.0.24 for ROCm 5.7.0

### Optimizations

* Improved the performance of complex forward/inverse 1D FFTs (2049 <= length <= 131071) that use
  Bluestein's algorithm

### Additions

* Implemented a solution map version converter and finished the first conversion from ver.0 to ver.1
  * Version 1 removes some incorrect kernels (sbrc/sbcr using `half_lds`)

### Changes

* Moved `rocfft_rtc_helper` executable to the `lib/rocFFT` directory on Linux
* Moved library kernel cache to the `lib/rocFFT` directory

## rocFFT 1.0.23 for ROCm 5.6.0

### Additions

* Implemented half-precision transforms; these can be requested by passing `rocfft_precision_half` to
  `rocfft_plan_create`
* Implemented a hierarchical solution map that saves information on how to decompose a problem
  and the kernels that are used
* Implemented a first version of offline-tuner to support tuning kernels for C2C and Z2Z problems

### Changes

* Replaced `std::complex` with hipComplex data types for the data generator
* FFT plan dimensions are now sorted to be row-major internally where possible, which produces
  better plans if the dimensions were accidentally specified in a different order (column-major, for
  example)
* Added the `--precision` argument to benchmark and test clients (`--double` is still accepted but is
  deprecated as a method to request a double-precision transform)
* Improved performance test suite statistical framework

### Fixes

* Fixed over-allocation of LDS in some real-complex kernels, which was resulting in kernel launch
  failure

## rocFFT 1.0.22 for ROCm 5.5.0

### Optimizations

* Improved the performance of 1D lengths < 2048 that use Bluestein's algorithm
* Reduced code generation time during plan creation
* Optimized 3D R2C and C2R lengths 32, 84, 128
* Optimized batched small 1D R2C and C2R cases

### Additions

* Added gfx1101 to default `AMDGPU_TARGETS`

### Changes

* Moved client programs to C++17
* Moved planar kernels and infrequently used Stockham kernels to be runtime-compiled
* Moved transpose, real-complex, Bluestein, and Stockham kernels to the library kernel cache

### Fixes

* Removed zero-length twiddle table allocations, which fixes errors from `hipMallocManaged`
* Fixed incorrect freeing of HIP stream handles during twiddle computation when multiple devices are
  present

## rocFFT 1.0.21 for ROCm 5.4.3

### Fixes

* Removed the source directory from `rocm_install_targets` to prevent the installation of `rocfft.h` in an
  unintended location

## rocFFT 1.0.20 for ROCm 5.4.1

### Fixes

* Fixed incorrect results on strided large 1D FFTs where batch size does not equal the stride

## rocFFT 1.0.19 for ROCm 5.4.0

### Optimizations

* Optimized some strided large 1D plans

### Additions

* Added the `rocfft_plan_description_set_scale_factor` API to efficiently multiply each output element of
  an FFT by a given scaling factor
* Created a `rocfft_kernel_cache.db` file next to the installed library; SBCC, CR, and RC kernels are
  moved to this file when built with the library, and are runtime-compiled for new GPU architectures
* Added gfx1100 and gfx1102 to default `AMDGPU_TARGETS`

### Changes

* Moved the runtime compilation cache to in-memory by default
  * A default on-disk cache can encounter contention problems on multi-node clusters with a shared
    filesystem
  * rocFFT can still use an on-disk cache by setting the `ROCFFT_RTC_CACHE_PATH` environment
    variable

## rocFFT 1.0.18 for ROCm 5.3.0

### Changes

* The runtime compilation cache now looks for environment variables `XDG_CACHE_HOME` (on Linux)
  and `LOCALAPPDATA` (on Windows) before falling back to `HOME`
* Moved computation of the twiddle table from the host to the device

### Optimizations

* Optimized 2D R2C and C2R to use 2-kernel plans where possible
* Improved performance of the Bluestein algorithm
* Optimized sbcc-168 and 100 by using half-LDS
* Optimized length-280 2D and 3D transforms
* Added kernels for factorizable 1D lengths < 128

### Fixes

* Fixed occasional failures to parallelize runtime compilation of kernels (failures would be retried
  serially and ultimately succeed, but this would take extra time)
* Fixed failures of some R2C 3D transforms that use the unsupported `TILE_UNALGNED` SBRC kernels
  (an example is 98^3 R2C out-of-place)
* Fixed bugs in the `SBRC_ERC` type

## rocFFT 1.0.17 for ROCm 5.2.0

### Additions

* Packages for test and benchmark executables on all supported operating systems using CPack
* Added file and folder reorganization changes, with backward compatibility support, using
  `rocm-cmake` wrapper functions

### Changes

* Improved reuse of twiddle memory between plans
* Set a default load/store callback when only one callback type is set via the API (for improved
  performance)
* Updated the GoogleTest dependency to version 1.11

### Optimizations

* Introduced a new access pattern of LDS (non-linear) and applied it on sbcc kernels len 64 and 81 for a
  performance improvement
* Applied `lds-non-linear`, `direct-load-to-register`, and `direct-store-from-register` on sbcr kernels for
  a performance improvement

### Fixes

* Correctness of certain transforms with unusual strides
* Incorrect handling of user-specified stream for runtime-compiled kernels
* Incorrect buffer allocation in `rocfft-test` on in-place transforms with different input and output sizes

## rocFFT 1.0.16 for ROCm 5.1.0

### Changes

* Supported unaligned tile dimension for `SBRC_2D` kernels
* Improved test and benchmark infrastructure by adding RAII
* Enabled runtime compilation of length-2304 FFT kernel during plan creation
* Added tokenizer for test suite
* Reduce twiddle memory requirements for even-length, real-complex transforms
* Clients can now be built separately from the main library

### Optimizations

* Optimized more large 1D cases by using `L1D_CC` plan
* Optimized the 3D 200^3 C2R case
* Optimized the 1D 2^30 double precision on MI200
* Added padding to work buffer sizes to improve performance in many cases

### Fixes

* Fixed the correctness of some R2C transforms with unusual strides

### Removals

* The hipFFT API (header) has been removed; use the
  [hipFFT](https://github.com/ROCmSoftwarePlatform/hipFFT) package or repository to obtain the API

## rocFFT 1.0.15 for ROCm 5.0.0

### Changes

* Enabled runtime compilation of single FFT kernels > length 1024
* Re-aligned the split device library into four roughly equal libraries
* Implemented the FuseShim framework to replace the original OptimizePlan
* Implemented the generic buffer-assignment framework
  * The buffer assignment is no longer performed by each node--we designed a generic algorithm to
    test and pick the best assignment path
  * With the help of FuseShim, we can achieve the most kernel-fusions possible
* Don't read the imaginary part of the DC and Nyquist modes for even-length complex-to-real
  transforms

### Optimizations

* Optimized twiddle conjugation; complex-to-complex inverse transforms should now have similar
  performance to forward transforms
* Improved performance of single-kernel, small 2D transforms

## rocFFT 1.0.14 for ROCm 4.5.0

### Optimizations

* Optimized SBCC kernels of lengths 52, 60, 72, 80, 84, 96, 104, 108, 112, 160, 168, 208, 216, 224, and
  240 with a new kernel generator

### Additions

* Added support for Windows 10 as a build target

### Changes

* Packaging has been split into a runtime package (`rocfft`) and a development package
  (`rocfft-devel`):
  The development package depends on the runtime package. When installing the runtime package,
  the package manager will suggest the installation of the development package to aid users
  transitioning from the previous version's combined package. This suggestion by package manager is
  for all supported operating systems (except CentOS 7) to aid in the transition. The `suggestion`
  feature in the runtime package is introduced as a deprecated feature and will be removed in a future
  ROCm release.

### Fixes

* Fixed validation failures for even-length R2C inplace 2D and 3D cubics sizes, such as 100^2 (or ^3),
  200^2 (or ^3), and 256^2 (or ^3)
  * We combine two kernels (`r2c-transpose`) instead of combining the three kernels
    (`stockham-r2c-transpose`)

### Changes

* Split 2D device code into separate libraries

## rocFFT 1.0.13 for ROCm 4.4.0

### Optimizations

* Improved plans by removing unnecessary transpose steps
* Optimized scheme selection for 3D problems
  * Imposed fewer restrictions on `3D_BLOCK_RC` selection (more problems can use `3D_BLOCK_RC` and
    have performance gains)
  * Enabled `3D_RC`; some 3D problems with SBCC-supported z-dim can use fewer kernels to get
    benefits
  * Forced `--length` 336 336 56 (dp) to use faster `3D_RC` to prevent it from being skipped by a
    conservative threshold test
* Optimized some even-length R2C/C2R cases by doing more in-place operations and combining
  pre- and post-processing into Stockham kernels
* Added radix-17

### Additions

* Added a new kernel generator for select fused 2D transforms

### Fixes

* Improved large 1D transform decompositions

## rocFFT 1.0.12 for ROCm 4.3.0

### Changes

* Re-split device code into single-precision, double-precision, and miscellaneous kernels

### Fixes

* Fixed potential crashes in double-precision planar->planar transpose
* Fixed potential crashes in 3D transforms with unusual strides for SBCC-optimized sizes
* Improved buffer placement logic

### Additions

* Added a new kernel generator for select lengths; new kernels have improved performance
* Added public `rocfft_execution_info_set_load_callback` and`rocfft_execution_info_set_store_callback`
  API functions to allow running extra logic when loading data from and storing data to global
  memory during a transform

### Removals

* Removed R2C pair schemes and kernels

### Optimizations

* Optimized 2D and 3D R2C 100 and 1D Z2Z 2500
* Reduced number of kernels for 2D/3D sizes where higher dimension is 64, 128, 256

### Fixes

* Fixed potential crashes in 3D transforms with unusual strides, for SBCC-optimized sizes

## rocFFT 1.0.11 for ROCm 4.2.0

### Changes

* Move device code into the main library

### Optimizations

* Improved performance for single-precision kernels exercising all except radix-2/7 butterfly ops
* Minor optimization for C2R 3D 100 and 200 cube sizes
* Optimized some C2C and R2C 3D 64, 81, 100, 128, 200, and 256 rectangular sizes
* When factoring, test to see if the remaining length is explicitly supported
* Explicitly added radix-7 lengths 14, 21, and 224 to list of supported lengths
* Optimized R2C 2D and 3D 128, 200, and 256 cube sizes

### Known issues

* Fixed potential crashes in small 3D transforms with unusual strides
  ([issue 311](https://github.com/ROCmSoftwarePlatform/rocFFT/issues/311))
* Fixed potential crashes when running transforms on multiple devices
  ([issue 310](https://github.com/ROCmSoftwarePlatform/rocFFT/issues/310))

## rocFFT 1.0.10 for ROCm 4.1.0

### Additions

* Explicitly specify `MAX_THREADS_PER_BLOCK` through `__launch_bounds_` for all kernels
* Switched to a new syntax for specifying AMD GPU architecture names and features

### Optimizations

* Optimized C2C and R2C 3D 64, 81, 100, 128, 200, and 256 cube sizes
* Improved the performance of the standalone out-of-place transpose kernel
* Optimized the 1D length 40000 C2C case
* Enabled radix-7 for size 336
* New radix-11 and radix-13 kernels; used in length 11 and 13 (and some of their multiples)
  transforms

### Changes

* rocFFT now automatically allocates a work buffer if the plan requires one and none is provided
* An explicit `rocfft_status_invalid_work_buffer` error is now returned when a work buffer of insufficient
  size is provided
* Updated online documentation
* Updated Debian package name version with separated underscore ( _ )
* Adjusted accuracy test tolerances and how they are compared

### Fixes

* Fixed a 4x4x8192 accuracy failure

## rocFFT 1.0.8 for ROCm 3.10.0

### Optimizations

* Optimized the 1D length 10000 C2C case

### Changes

* Added the `BUILD_CLIENTS_ALL` CMake option

### Fixes

* Fixed the correctness of SBCC and SBRC kernels with non-unit strides
* Fixed fused C2R kernel when a Bluestein transform follows it

## rocFFT 1.0.7 for ROCm 3.9.0

### Optimizations

* New R2C and C2R fused kernels to combine pre- and post-processing steps with transpose
* Enabled diagonal transpose for 1D and 2D power-of-2 cases
* New single kernels for small power-of-2, 3, and 5 sizes
* Added more radix-7 kernels

### Changes

* Explicitly disabled XNACK and SRAM-ECC features on AMDGPU hardware

### Fixes

* Fixed 2D C2R transform with length 1 on one dimension
* Fixed a potential thread unsafety in logging

## rocFFT 1.0.6 for ROCm 3.8.0

### Optimizations

* Improved the performance of 1D batch-paired R2C transforms of odd length
* Added some radix-7 kernels
* Improved the performance for 1D length 6561 and 10000
* Improved the performance for certain 2D transform sizes

### Changes

* Allowed a static library build with `BUILD_SHARED_LIBS=OFF` CMake option
* Updated GoogleTest dependency to version 1.10

### Fixes

* Correctness of certain large 2D sizes

## rocFFT 1.0.5 for ROCM 3.7.0

### Optimizations

* Optimized C2C power-of-2 middle sizes

### Changes

* Parallelized work in unit tests and eliminated duplicate cases

### Fixes

* Correctness of certain large 1D, and 2D power-of-3 and 5 sizes
* Incorrect buffer assignment for some even-length R2C transforms
* `<cstddef>` inclusion on C compilers
* Incorrect results on non-unit strides with SBCC/SBRC kernels