-
Notifications
You must be signed in to change notification settings - Fork 172
Description
While watching the highly anticipated WatchMojo Top 10 Craziest Assembly Language Instructions
(I swear while brushing my teeth), I learned of the PEXT and PDEP assembly instructions. These respectively perform, in a single instruction, precisely what the getValueOfBits() and insertBits() functions perform in a looped fashion. These two functions together appear in about 45 hot loops or kernels throughout the code base.
It may be worthwhile to, for specific compilers and CPU architectures (e.g. Intel and AMD), overload these bitwise functions to make use of PEXT and PDEP compiler intrinsics (e.g. _pdep_u64). The likelihood of a speedup isn't great, because:
- Non-trivial simulation is memory-bandwidth-bound, potentially occluding the speedup.
- The existing looped routines are already unrolled when involving 5 or fewer iterations (via tricks like this), which is the expected majority use-case, so often do not pay a loop penalty.
Furthermore, a minor speedup might be outweighed by the additional portability nuisances from using non-standard intrinsics. It should be principally easy to add macro guards within getValueOfBits and insertBitsWithMaskedValues, but they are famous last words! I note that while the intrinsics are for unsigned integers, whereas the user-facing qindex is signed, but I think that presents no issue.
The intrinsics may present a detectable benefit for big operations (e.g. 6-qubit general matrices) upon small systems (e.g. <10 qubits), on candidate hardware. There may also be analogous instructions for other architectures, and possibly GPUs!