Releases: ashvardanian/less_slow.cpp
Releases · ashvardanian/less_slow.cpp
v0.3: Gather 🔄 Scatter
This release introduces benchmarks for gather & scatter SIMD rarely-used instructions that can be used to accelerate lookups by ~30% on current x86 and Arm machines.
- Serial
- AVX-512 for x86
- SVE for Arm
Minor
Patch
v0.2: Pushing FLOPS in Assembly 🏋️♂️
Release: v0.2.0 [skip ci]
Minor
- Add: Latency Hiding & Port Interleaving (086f8d7)
- Add: AMX kernels (0cb024d)
- Add: Inline Assembly kernels (89095a6)
- Add: BLAS & Eigen TOPs benchmarks (28ca39b)
- Add: AVX2 & low-precision AVX-512 TOPS (0a48108)
- Add:
i8,f16, andbf16kernels (3f54200) - Add: Arm NEON FMAs (d0e521e)
- Add:
vfmadd231pskernels (7ca3161) - Add: Assembly micro-kernels (2e71e76)
Patch
- Docs: Zen4 matmul-benchmarks (2476310)
- Docs: H100 Tensor Cores vs Intel (fa86663)
- Fix:
Illegal instructionfor AMX (a7243dd) - Fix: Duplicate
.globalsymbols (c732234) - Docs: Recommended Eigen macros (7be2d58)
- Fix: Missing
tops_u8_neon(d97bbfc) - Fix: Missing
tops_f64_neon(4afa7e3) - Improve: Shorter TOPS names (be0c94b)