Skip to content

Releases: ashvardanian/less_slow.cpp

v0.3: Gather 🔄 Scatter

20 Jan 09:58

Choose a tag to compare

This release introduces benchmarks for gather & scatter SIMD rarely-used instructions that can be used to accelerate lookups by ~30% on current x86 and Arm machines.

  • Serial
  • AVX-512 for x86
  • SVE for Arm

Minor

  • Add: SVE gather/scatter (107b359)
  • Add: Serial & AVX-512 scatter/gather (089cfa0)

Patch

  • Improve: Timing SVE (daa55f5)
  • Improve: Stabilize gather timings (3fca991)

v0.2: Pushing FLOPS in Assembly 🏋️‍♂️

20 Jan 09:57

Choose a tag to compare

Release: v0.2.0 [skip ci]

Minor

  • Add: Latency Hiding & Port Interleaving (086f8d7)
  • Add: AMX kernels (0cb024d)
  • Add: Inline Assembly kernels (89095a6)
  • Add: BLAS & Eigen TOPs benchmarks (28ca39b)
  • Add: AVX2 & low-precision AVX-512 TOPS (0a48108)
  • Add: i8, f16, and bf16 kernels (3f54200)
  • Add: Arm NEON FMAs (d0e521e)
  • Add: vfmadd231ps kernels (7ca3161)
  • Add: Assembly micro-kernels (2e71e76)

Patch

  • Docs: Zen4 matmul-benchmarks (2476310)
  • Docs: H100 Tensor Cores vs Intel (fa86663)
  • Fix: Illegal instruction for AMX (a7243dd)
  • Fix: Duplicate .global symbols (c732234)
  • Docs: Recommended Eigen macros (7be2d58)
  • Fix: Missing tops_u8_neon (d97bbfc)
  • Fix: Missing tops_f64_neon (4afa7e3)
  • Improve: Shorter TOPS names (be0c94b)

Release v0.1.1

17 Jan 20:08

Choose a tag to compare

Release: v0.1.1 [skip ci]

Patch

  • Docs: Renaming small_string benchmark (1ad50d1)
  • Improve: Validate sorting result (77808a6)
  • Improve: Log bytes/sec for trigonometry (ebc67d6)
  • Docs: Placeholders (84e7710)
  • Improve: BENCHMARK_CAPTURE w/out template (d8fb261)