In PhastFT for smaller sizes I'm calling dispatch! three times when running an FFT operation on 512 bytes of data (64-long batch of f64) and it is degrading performance by 25% (-20% throughput) measured as of commit https://github.com/QuState/PhastFT/tree/e5fcd61f3d540fcef9f8d60173dbfbe777c02e40
Meanwhile RustFFT with its handwritten dispatch does not suffer any penalty at all, and in fact is slightly slower under -C target-cpu=native than it is under its regular dynamic dispatch.
This overhead needs to be removed for code based on fearless_simd to be competitive with handwritten dynamic dispatch.
perf diff and profiling with samply both point to these dispatch! calls as a major source of slowdown: https://github.com/QuState/PhastFT/blob/c7ea3d7aef474e53233834354364fa50bbb0ba6e/src/algorithms/dit.rs#L259-L260
Profile with -C target-cpu=x86-64-v3: https://share.firefox.dev/3LMqjuI
Profile with dynamic dispatch: https://share.firefox.dev/3NTwJZw
I'm not sure what the cause is. I wouldn't expect a handful of perfectly predictable branches to tank performance. Perhaps dispatch! results in subotimal codegen, or perhaps I'm just pushing the boundaries of dynamic dispatch and need a facility to get a function pointer and store it in a struct for reuse instead of just reusing a cached Level.
In PhastFT for smaller sizes I'm calling
dispatch!three times when running an FFT operation on 512 bytes of data (64-long batch of f64) and it is degrading performance by 25% (-20% throughput) measured as of commit https://github.com/QuState/PhastFT/tree/e5fcd61f3d540fcef9f8d60173dbfbe777c02e40Meanwhile RustFFT with its handwritten dispatch does not suffer any penalty at all, and in fact is slightly slower under
-C target-cpu=nativethan it is under its regular dynamic dispatch.This overhead needs to be removed for code based on
fearless_simdto be competitive with handwritten dynamic dispatch.perf diffand profiling withsamplyboth point to thesedispatch!calls as a major source of slowdown: https://github.com/QuState/PhastFT/blob/c7ea3d7aef474e53233834354364fa50bbb0ba6e/src/algorithms/dit.rs#L259-L260Profile with
-C target-cpu=x86-64-v3: https://share.firefox.dev/3LMqjuIProfile with dynamic dispatch: https://share.firefox.dev/3NTwJZw
I'm not sure what the cause is. I wouldn't expect a handful of perfectly predictable branches to tank performance. Perhaps
dispatch!results in subotimal codegen, or perhaps I'm just pushing the boundaries of dynamic dispatch and need a facility to get a function pointer and store it in a struct for reuse instead of just reusing a cachedLevel.