Releases · ashvardanian/less_slow.cpp · GitHub

10 Sep 14:51

ashvardanian

Float FMA vs Integer DP4A & DPX Instructions ☣️ Latest

Latest

CUDA natively supports Fused-Multiply-Accumulate operations for every float type, including f16 and bf16. It also provides DP4A instructions for 8-bit integer dot-products with 32-bit accumulators and umul24 instructions for 24-bit integer multiplication. Starting with Hopper, Dynamic Programming eXtensitons (DPX) were added for combinatorial problems that can be used to implement Algebraic Graph Theory algorithms using matrix multiplications over alternative semi-rings.

How do those instructions stack up, and how much performance can we expect from recent State-of-the-Art GPUs like the Nvidia H200?

f64 FMA: 4.5 T
i64 FMA: 3.1 T
f32 FMA: 22 T
i32 FMA: 15.5 T ...so we should always prefer 32-bit ops
u8u32 DP4A: 39.3 T
u24u32 UMUL: 13.4 T ...not really better than i32 FMA
f16 FMA on Volta: 12.2 T
bf16 FMA on Ampere: 12.2 T
DPX for Floyd-Warshall algorithm with u16 and u32 on Hopper: 11 T
DPX for Needleman-Wunsch algorithm with i16 and i32 on Hopper: 11 T
DPX for Smith-Waterman algorithm with i32 on Hopper: 27 T

Check the code and inline comments for more details!
Those goodies are now part of "StringZilla 4 CUDA" release 🥳

Minor

Add: dp4a & umul24 instructions (ce1e3b7)
Add: DPX instructions on Hopper (1ab4f41)
Add: In-register FMA benchmarks for GPUs (97991fd)

Patch

Docs: FMA CUDA throughput (c00e421)
Fix: Initialize FMA inputs (22f52c4)
Improve: Naming variables (80e1d83)
Fix: bf16 requires Ampere (306ee3f)

Assets 2

13 Aug 21:39

ashvardanian

Release v0.10.12

Release: v0.10.12 [skip ci]

Patch

Improve: Bitwise ops for branches (c459c42)

Assets 2

12 Aug 12:01

ashvardanian

Release v0.10.11

Release: v0.10.11 [skip ci]

Patch

Improve: Parsing via simdjson (787f985)
Make: Bump dependencies (89e72b3)

Assets 2

12 Aug 11:21

ashvardanian

Release v0.10.10

Release: v0.10.10 [skip ci]

Patch

Improve: jmp vs cmov (c0a3b12)
Improve: Division via floats (8aa9921)

Assets 2

19 May 06:37

ashvardanian

Release v0.10.9

Release: v0.10.9 [skip ci]

Patch

Make: USE_BLAS option (a27448f)

Assets 2

22 Apr 11:25

ashvardanian

v0.10.8: MacOS compilation fixes 🤗 🍏

Docs: OpenBLAS installation on MacOS (be4a0be)
Fix: Missing const qualifiers in strided_ptr (9120723)
Fix: Can't std::format(time) on macOS (4d00aba)

Thanks to @ab-10 for spotting 🤗

Contributors

ab-10

Assets 2

20 Apr 19:30

ashvardanian

Release v0.10.7

Release: v0.10.7 [skip ci]

Patch

Improve: Include Asm tests into macOS Arm builds (#45) (ecff6e3)

Assets 2

20 Apr 11:04

ashvardanian

v0.10.6: Fixing aligned allocations

Docs: Notes on #pragma regions (ab7bf3f)
Fix: Aligned allocation (#42) (a66cfe2)

Thanks to @bmanga 🤗

Contributors

bmanga

Assets 2

19 Apr 08:36

ashvardanian

Release v0.10.5

Release: v0.10.5 [skip ci]

Patch

Docs: Intro (60e18d8)

Assets 2

18 Apr 22:18

ashvardanian

Release v0.10.4

Release: v0.10.4 [skip ci]

Patch

Improve: Detecting CUDA availability (21dfdf3)

Assets 2