Optimise `getValueOfBits` and `insertBits`

While watching the highly anticipated WatchMojo [Top 10 Craziest Assembly Language Instructions
](https://www.youtube.com/watch?v=Wz_xJPN7lAY) (I swear while brushing my teeth), I learned of the [`PEXT`](https://www.felixcloutier.com/x86/pext) and [`PDEP`](https://www.felixcloutier.com/x86/pdep) assembly instructions. These respectively perform, in a single instruction, precisely what the [`getValueOfBits()`](https://github.com/QuEST-Kit/QuEST/blob/9d7618d7263e3bfba433b88cf1eac0647f08fa0a/quest/src/core/bitwise.hpp#L186-L195) and [`insertBits()`](https://github.com/QuEST-Kit/QuEST/blob/9d7618d7263e3bfba433b88cf1eac0647f08fa0a/quest/src/core/bitwise.hpp#L164-L171) functions perform in a looped fashion. These two functions together appear in about 45 hot loops or kernels throughout the code base.

It _may_ be worthwhile to, for specific compilers and CPU architectures (e.g. Intel and AMD), overload these bitwise functions to make use of `PEXT` and `PDEP` compiler intrinsics (e.g. `_pdep_u64`). The likelihood of a speedup isn't _great_, because:
 - Non-trivial simulation is memory-bandwidth-bound, potentially occluding the speedup.
 - The existing looped routines are already unrolled when involving 5 or fewer iterations (via tricks like [this](https://github.com/QuEST-Kit/QuEST/blob/9d7618d7263e3bfba433b88cf1eac0647f08fa0a/quest/src/cpu/cpu_subroutines.cpp#L706-L708)), which is the expected majority use-case, so often do not pay a loop penalty.

Furthermore, a minor speedup might be outweighed by the additional portability nuisances from using non-standard intrinsics. It _should_ be principally easy to add macro guards within `getValueOfBits` and `insertBitsWithMaskedValues`, but they are famous last words! I note that while the intrinsics are for _unsigned_ integers, whereas the user-facing `qindex` is _signed_, but I _think_ that presents no issue.

The intrinsics _may_ present a detectable benefit for big operations (e.g. `6`-qubit general matrices) upon small systems (e.g. `<10` qubits), on candidate hardware. There may also be analogous instructions for other architectures, and possibly GPUs!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise `getValueOfBits` and `insertBits` #717

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimise getValueOfBits and insertBits #717

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Optimise `getValueOfBits` and `insertBits` #717