How to (nano-)benchmark correctly

Hi,

I have written a tutorial on how to port a static implicit B+-Tree described in this [algorithmica.org article](https://en.algorithmica.org/hpc/data-structures/s-tree/#construction-1) to Google Highway. You can find the writeup in this repository:

https://github.com/henrixapp/static-btree-highway

I tried to benchmark the btree against `std::lower_bound` by using `hwy::MeasureClosure` for varying size of `N` inputs, `N` between 32 and 2^26.
The interface for this is `size_t->size_t`. To make it work with more datatypes (the btree also works with floats), I just generated 10k queries (in my datatype) and answered them in a for-loop and meassured the performance (a single element had not enough overhead to measure).
However, that gave me insane performance for N>2^21 on 64 bit values, jumping to a relative speed up of >80. (note the log-scale)

<img width="3600" height="1800" alt="Image" src="https://github.com/user-attachments/assets/c3f2bd16-d0c4-432c-9364-f118cff48c35" />
For 32-bit values the bump occurs for N>2^22.

<img width="3600" height="1800" alt="Image" src="https://github.com/user-attachments/assets/818fed2d-0b4b-4c26-a27e-687f0772b93d" />

When running under perf I could see that the std::lower_bound code would use 0.08 instructions per cycle, while the btree would use 0.6.

I tried using 5 or 10 different queues of queries, but that also resulted in these large speedups.

In the algorithmica.org article they mention to chain the queries, so that no query can be run without finishing the other.
So that brought me to this idea to prevent reordering of runs by calling `hwy::Unpredictable1()`: 
```cpp
    size_t elems = queries[i].size();
    for (size_t j = 0; j < elems; j++) {
      last = instance.lower_bound(queries[i][j * hwy::Unpredictable1()]);
      Mask ^= last;
    }
```
The numbers look more realistic:

<img width="3600" height="1800" alt="Image" src="https://github.com/user-attachments/assets/c619115a-b181-4812-b19b-1defb74fd83c" />
Nevertheless, there is still  a bump around the L3 cache size of the processor and I am fearing that I only added some overhead in the form of a call to the clock ([in the implementation of Unpredictable1](https://github.com/google/highway/blob/master/hwy/nanobenchmark.cc#L240)).

So what would be the recommend way of measuring speed ups/benchmarking with nanobenchmark.h or should I be using googlebenchmark on bigger sized inputs?

Best
Henrik

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to (nano-)benchmark correctly #2890

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

How to (nano-)benchmark correctly #2890

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions