WIP : Adding an implementation based on rocfft.#389
Conversation
|
That's quite a saving @PaulMullowney, thanks for taking this forward. I will try to reproduce your findings on LUMI and AAC7 myself, then start having a look at your code. |
|
Here are some initial benchmarks on LUMI-G (MI250A):
So indeed, the first step (which does FFT planning) is significantly accelerated with rocFFT. Unfortunately the bigger runs took so long on this step it overflowed the print format statement... I will try to repeat the runs with a longer format. Not sure what happened with the 1279 crash on develop yet. |
|
@PaulMullowney - could you rebase on develop and then force push your branch back to GH? |
|
Getting a strange warning on LUMI, for both hipFFT and rocFFT: Doesn't seem to have any serious consequences. |
Yes. I will try to do this today. |
That's a new one. Hmmm |
ba9ff2d to
e7fa8b6
Compare
|
I rebased against develop and pushed |
|
Updated results on LUMI-G:
|
e7fa8b6 to
f747f1c
Compare
Co-authored-by: Sam Hatfield <samuel.hatfield@ecmwf.int>
Co-authored-by: Sam Hatfield <samuel.hatfield@ecmwf.int>
This branch adds a rocfft based implementation. rocfft distinguishes between in and out-of-place ffts. hipfft does not. Using rocfft introduces considerable savings in terms of the number of FFT kernels that need to be JIT compiled during plan creation. This is measured by setting the ROCFFT_RTC_CACHE_PATH environment variable to save the JIT compiled kernels in a database.