1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479
|
# Changelog for rocFFT
Documentation for rocFFT is available at
[https://rocm.docs.amd.com/projects/rocFFT/en/latest/](https://rocm.docs.amd.com/projects/rocFFT/en/latest/).
## rocFFT 1.0.27 for ROCm 6.1.1
### Fixes
* Fixed kernel launch failure on execute of very large odd-length real-complex transforms.
### Additions
* Enable multi-gpu testing on systems without direct GPU-interconnects
## rocFFT 1.0.26 for ROCm 6.1.0
### Changes
* Multi-device FFTs now allow batch greater than 1
* Multi-device, real-complex FFTs are now supported
* rocFFT now statically links libstdc++ when only `std::experimental::filesystem` is available (to guard
against ABI incompatibilities with newer libstdc++ libraries that include `std::filesystem`)
## rocFFT 1.0.25 for ROCm 6.0.0
### Additions
* Implemented experimental APIs to allow computing FFTs on data distributed across multiple devices
in a single process
* `rocfft_field` is a new type that can be added to a plan description to describe the layout of FFT
input or output
* `rocfft_field_add_brick` can be called to describe the brick decomposition of an FFT field, where each
brick can be assigned a different device
These interfaces are still experimental and subject to change. We are interested in getting feedback.
You can raise questions and concerns by opening issues in the
[rocFFT issue tracker](https://github.com/ROCmSoftwarePlatform/rocFFT/issues).
Note that multi-device FFTs currently have several limitations (we plan to address these in future
releases):
* Real-complex (forward or inverse) FFTs are not supported
* Planar format fields are not supported
* Batch (the `number_of_transforms` provided to `rocfft_plan_create`) must be 1
* FFT input is gathered to the current device at run time, so all FFT data must fit on that device
### Optimizations
* Improved the performance of several 2D/3D real FFTs supported by `2D_SINGLE` kernel. Offline
tuning provides more optimization for fx90a
* Removed an extra kernel launch from even-length, real-complex FFTs that use callbacks
### Changes
* Built kernels in a solution map to the library kernel cache
* Real forward transforms (real-to-complex) no longer overwrite input; rocFFT may still overwrite real
inverse (complex-to-real) input, as this allows for faster performance
* `rocfft-rider` and `dyna-rocfft-rider` have been renamed to `rocfft-bench` and `dyna-rocfft-bench`;
these are controlled by the `BUILD_CLIENTS_BENCH` CMake option
* Links for the former file names are installed, and the former `BUILD_CLIENTS_RIDER` CMake option
is accepted for compatibility, but both will be removed in a future release
* Binaries in debug builds no longer have a `-d` suffix
### Fixes
* rocFFT now correctly handles load callbacks that convert data from a smaller data type (e.g., 16-bit
integers -> 32-bit float)
## rocFFT 1.0.24 for ROCm 5.7.0
### Optimizations
* Improved the performance of complex forward/inverse 1D FFTs (2049 <= length <= 131071) that use
Bluestein's algorithm
### Additions
* Implemented a solution map version converter and finished the first conversion from ver.0 to ver.1
* Version 1 removes some incorrect kernels (sbrc/sbcr using `half_lds`)
### Changes
* Moved `rocfft_rtc_helper` executable to the `lib/rocFFT` directory on Linux
* Moved library kernel cache to the `lib/rocFFT` directory
## rocFFT 1.0.23 for ROCm 5.6.0
### Additions
* Implemented half-precision transforms; these can be requested by passing `rocfft_precision_half` to
`rocfft_plan_create`
* Implemented a hierarchical solution map that saves information on how to decompose a problem
and the kernels that are used
* Implemented a first version of offline-tuner to support tuning kernels for C2C and Z2Z problems
### Changes
* Replaced `std::complex` with hipComplex data types for the data generator
* FFT plan dimensions are now sorted to be row-major internally where possible, which produces
better plans if the dimensions were accidentally specified in a different order (column-major, for
example)
* Added the `--precision` argument to benchmark and test clients (`--double` is still accepted but is
deprecated as a method to request a double-precision transform)
* Improved performance test suite statistical framework
### Fixes
* Fixed over-allocation of LDS in some real-complex kernels, which was resulting in kernel launch
failure
## rocFFT 1.0.22 for ROCm 5.5.0
### Optimizations
* Improved the performance of 1D lengths < 2048 that use Bluestein's algorithm
* Reduced code generation time during plan creation
* Optimized 3D R2C and C2R lengths 32, 84, 128
* Optimized batched small 1D R2C and C2R cases
### Additions
* Added gfx1101 to default `AMDGPU_TARGETS`
### Changes
* Moved client programs to C++17
* Moved planar kernels and infrequently used Stockham kernels to be runtime-compiled
* Moved transpose, real-complex, Bluestein, and Stockham kernels to the library kernel cache
### Fixes
* Removed zero-length twiddle table allocations, which fixes errors from `hipMallocManaged`
* Fixed incorrect freeing of HIP stream handles during twiddle computation when multiple devices are
present
## rocFFT 1.0.21 for ROCm 5.4.3
### Fixes
* Removed the source directory from `rocm_install_targets` to prevent the installation of `rocfft.h` in an
unintended location
## rocFFT 1.0.20 for ROCm 5.4.1
### Fixes
* Fixed incorrect results on strided large 1D FFTs where batch size does not equal the stride
## rocFFT 1.0.19 for ROCm 5.4.0
### Optimizations
* Optimized some strided large 1D plans
### Additions
* Added the `rocfft_plan_description_set_scale_factor` API to efficiently multiply each output element of
an FFT by a given scaling factor
* Created a `rocfft_kernel_cache.db` file next to the installed library; SBCC, CR, and RC kernels are
moved to this file when built with the library, and are runtime-compiled for new GPU architectures
* Added gfx1100 and gfx1102 to default `AMDGPU_TARGETS`
### Changes
* Moved the runtime compilation cache to in-memory by default
* A default on-disk cache can encounter contention problems on multi-node clusters with a shared
filesystem
* rocFFT can still use an on-disk cache by setting the `ROCFFT_RTC_CACHE_PATH` environment
variable
## rocFFT 1.0.18 for ROCm 5.3.0
### Changes
* The runtime compilation cache now looks for environment variables `XDG_CACHE_HOME` (on Linux)
and `LOCALAPPDATA` (on Windows) before falling back to `HOME`
* Moved computation of the twiddle table from the host to the device
### Optimizations
* Optimized 2D R2C and C2R to use 2-kernel plans where possible
* Improved performance of the Bluestein algorithm
* Optimized sbcc-168 and 100 by using half-LDS
* Optimized length-280 2D and 3D transforms
* Added kernels for factorizable 1D lengths < 128
### Fixes
* Fixed occasional failures to parallelize runtime compilation of kernels (failures would be retried
serially and ultimately succeed, but this would take extra time)
* Fixed failures of some R2C 3D transforms that use the unsupported `TILE_UNALGNED` SBRC kernels
(an example is 98^3 R2C out-of-place)
* Fixed bugs in the `SBRC_ERC` type
## rocFFT 1.0.17 for ROCm 5.2.0
### Additions
* Packages for test and benchmark executables on all supported operating systems using CPack
* Added file and folder reorganization changes, with backward compatibility support, using
`rocm-cmake` wrapper functions
### Changes
* Improved reuse of twiddle memory between plans
* Set a default load/store callback when only one callback type is set via the API (for improved
performance)
* Updated the GoogleTest dependency to version 1.11
### Optimizations
* Introduced a new access pattern of LDS (non-linear) and applied it on sbcc kernels len 64 and 81 for a
performance improvement
* Applied `lds-non-linear`, `direct-load-to-register`, and `direct-store-from-register` on sbcr kernels for
a performance improvement
### Fixes
* Correctness of certain transforms with unusual strides
* Incorrect handling of user-specified stream for runtime-compiled kernels
* Incorrect buffer allocation in `rocfft-test` on in-place transforms with different input and output sizes
## rocFFT 1.0.16 for ROCm 5.1.0
### Changes
* Supported unaligned tile dimension for `SBRC_2D` kernels
* Improved test and benchmark infrastructure by adding RAII
* Enabled runtime compilation of length-2304 FFT kernel during plan creation
* Added tokenizer for test suite
* Reduce twiddle memory requirements for even-length, real-complex transforms
* Clients can now be built separately from the main library
### Optimizations
* Optimized more large 1D cases by using `L1D_CC` plan
* Optimized the 3D 200^3 C2R case
* Optimized the 1D 2^30 double precision on MI200
* Added padding to work buffer sizes to improve performance in many cases
### Fixes
* Fixed the correctness of some R2C transforms with unusual strides
### Removals
* The hipFFT API (header) has been removed; use the
[hipFFT](https://github.com/ROCmSoftwarePlatform/hipFFT) package or repository to obtain the API
## rocFFT 1.0.15 for ROCm 5.0.0
### Changes
* Enabled runtime compilation of single FFT kernels > length 1024
* Re-aligned the split device library into four roughly equal libraries
* Implemented the FuseShim framework to replace the original OptimizePlan
* Implemented the generic buffer-assignment framework
* The buffer assignment is no longer performed by each node--we designed a generic algorithm to
test and pick the best assignment path
* With the help of FuseShim, we can achieve the most kernel-fusions possible
* Don't read the imaginary part of the DC and Nyquist modes for even-length complex-to-real
transforms
### Optimizations
* Optimized twiddle conjugation; complex-to-complex inverse transforms should now have similar
performance to forward transforms
* Improved performance of single-kernel, small 2D transforms
## rocFFT 1.0.14 for ROCm 4.5.0
### Optimizations
* Optimized SBCC kernels of lengths 52, 60, 72, 80, 84, 96, 104, 108, 112, 160, 168, 208, 216, 224, and
240 with a new kernel generator
### Additions
* Added support for Windows 10 as a build target
### Changes
* Packaging has been split into a runtime package (`rocfft`) and a development package
(`rocfft-devel`):
The development package depends on the runtime package. When installing the runtime package,
the package manager will suggest the installation of the development package to aid users
transitioning from the previous version's combined package. This suggestion by package manager is
for all supported operating systems (except CentOS 7) to aid in the transition. The `suggestion`
feature in the runtime package is introduced as a deprecated feature and will be removed in a future
ROCm release.
### Fixes
* Fixed validation failures for even-length R2C inplace 2D and 3D cubics sizes, such as 100^2 (or ^3),
200^2 (or ^3), and 256^2 (or ^3)
* We combine two kernels (`r2c-transpose`) instead of combining the three kernels
(`stockham-r2c-transpose`)
### Changes
* Split 2D device code into separate libraries
## rocFFT 1.0.13 for ROCm 4.4.0
### Optimizations
* Improved plans by removing unnecessary transpose steps
* Optimized scheme selection for 3D problems
* Imposed fewer restrictions on `3D_BLOCK_RC` selection (more problems can use `3D_BLOCK_RC` and
have performance gains)
* Enabled `3D_RC`; some 3D problems with SBCC-supported z-dim can use fewer kernels to get
benefits
* Forced `--length` 336 336 56 (dp) to use faster `3D_RC` to prevent it from being skipped by a
conservative threshold test
* Optimized some even-length R2C/C2R cases by doing more in-place operations and combining
pre- and post-processing into Stockham kernels
* Added radix-17
### Additions
* Added a new kernel generator for select fused 2D transforms
### Fixes
* Improved large 1D transform decompositions
## rocFFT 1.0.12 for ROCm 4.3.0
### Changes
* Re-split device code into single-precision, double-precision, and miscellaneous kernels
### Fixes
* Fixed potential crashes in double-precision planar->planar transpose
* Fixed potential crashes in 3D transforms with unusual strides for SBCC-optimized sizes
* Improved buffer placement logic
### Additions
* Added a new kernel generator for select lengths; new kernels have improved performance
* Added public `rocfft_execution_info_set_load_callback` and`rocfft_execution_info_set_store_callback`
API functions to allow running extra logic when loading data from and storing data to global
memory during a transform
### Removals
* Removed R2C pair schemes and kernels
### Optimizations
* Optimized 2D and 3D R2C 100 and 1D Z2Z 2500
* Reduced number of kernels for 2D/3D sizes where higher dimension is 64, 128, 256
### Fixes
* Fixed potential crashes in 3D transforms with unusual strides, for SBCC-optimized sizes
## rocFFT 1.0.11 for ROCm 4.2.0
### Changes
* Move device code into the main library
### Optimizations
* Improved performance for single-precision kernels exercising all except radix-2/7 butterfly ops
* Minor optimization for C2R 3D 100 and 200 cube sizes
* Optimized some C2C and R2C 3D 64, 81, 100, 128, 200, and 256 rectangular sizes
* When factoring, test to see if the remaining length is explicitly supported
* Explicitly added radix-7 lengths 14, 21, and 224 to list of supported lengths
* Optimized R2C 2D and 3D 128, 200, and 256 cube sizes
### Known issues
* Fixed potential crashes in small 3D transforms with unusual strides
([issue 311](https://github.com/ROCmSoftwarePlatform/rocFFT/issues/311))
* Fixed potential crashes when running transforms on multiple devices
([issue 310](https://github.com/ROCmSoftwarePlatform/rocFFT/issues/310))
## rocFFT 1.0.10 for ROCm 4.1.0
### Additions
* Explicitly specify `MAX_THREADS_PER_BLOCK` through `__launch_bounds_` for all kernels
* Switched to a new syntax for specifying AMD GPU architecture names and features
### Optimizations
* Optimized C2C and R2C 3D 64, 81, 100, 128, 200, and 256 cube sizes
* Improved the performance of the standalone out-of-place transpose kernel
* Optimized the 1D length 40000 C2C case
* Enabled radix-7 for size 336
* New radix-11 and radix-13 kernels; used in length 11 and 13 (and some of their multiples)
transforms
### Changes
* rocFFT now automatically allocates a work buffer if the plan requires one and none is provided
* An explicit `rocfft_status_invalid_work_buffer` error is now returned when a work buffer of insufficient
size is provided
* Updated online documentation
* Updated Debian package name version with separated underscore ( _ )
* Adjusted accuracy test tolerances and how they are compared
### Fixes
* Fixed a 4x4x8192 accuracy failure
## rocFFT 1.0.8 for ROCm 3.10.0
### Optimizations
* Optimized the 1D length 10000 C2C case
### Changes
* Added the `BUILD_CLIENTS_ALL` CMake option
### Fixes
* Fixed the correctness of SBCC and SBRC kernels with non-unit strides
* Fixed fused C2R kernel when a Bluestein transform follows it
## rocFFT 1.0.7 for ROCm 3.9.0
### Optimizations
* New R2C and C2R fused kernels to combine pre- and post-processing steps with transpose
* Enabled diagonal transpose for 1D and 2D power-of-2 cases
* New single kernels for small power-of-2, 3, and 5 sizes
* Added more radix-7 kernels
### Changes
* Explicitly disabled XNACK and SRAM-ECC features on AMDGPU hardware
### Fixes
* Fixed 2D C2R transform with length 1 on one dimension
* Fixed a potential thread unsafety in logging
## rocFFT 1.0.6 for ROCm 3.8.0
### Optimizations
* Improved the performance of 1D batch-paired R2C transforms of odd length
* Added some radix-7 kernels
* Improved the performance for 1D length 6561 and 10000
* Improved the performance for certain 2D transform sizes
### Changes
* Allowed a static library build with `BUILD_SHARED_LIBS=OFF` CMake option
* Updated GoogleTest dependency to version 1.10
### Fixes
* Correctness of certain large 2D sizes
## rocFFT 1.0.5 for ROCM 3.7.0
### Optimizations
* Optimized C2C power-of-2 middle sizes
### Changes
* Parallelized work in unit tests and eliminated duplicate cases
### Fixes
* Correctness of certain large 1D, and 2D power-of-3 and 5 sizes
* Incorrect buffer assignment for some even-length R2C transforms
* `<cstddef>` inclusion on C compilers
* Incorrect results on non-unit strides with SBCC/SBRC kernels
|