Hi,
I have written a tutorial on how to port a static implicit B+-Tree described in this algorithmica.org article to Google Highway. You can find the writeup in this repository:
https://github.com/henrixapp/static-btree-highway
I tried to benchmark the btree against std::lower_bound by using hwy::MeasureClosure for varying size of N inputs, N between 32 and 2^26.
The interface for this is size_t->size_t. To make it work with more datatypes (the btree also works with floats), I just generated 10k queries (in my datatype) and answered them in a for-loop and meassured the performance (a single element had not enough overhead to measure).
However, that gave me insane performance for N>2^21 on 64 bit values, jumping to a relative speed up of >80. (note the log-scale)

For 32-bit values the bump occurs for N>2^22.
When running under perf I could see that the std::lower_bound code would use 0.08 instructions per cycle, while the btree would use 0.6.
I tried using 5 or 10 different queues of queries, but that also resulted in these large speedups.
In the algorithmica.org article they mention to chain the queries, so that no query can be run without finishing the other.
So that brought me to this idea to prevent reordering of runs by calling hwy::Unpredictable1():
size_t elems = queries[i].size();
for (size_t j = 0; j < elems; j++) {
last = instance.lower_bound(queries[i][j * hwy::Unpredictable1()]);
Mask ^= last;
}
The numbers look more realistic:

Nevertheless, there is still a bump around the L3 cache size of the processor and I am fearing that I only added some overhead in the form of a call to the clock ([in the implementation of Unpredictable1](https://github.com/google/highway/blob/master/hwy/nanobenchmark.cc#L240)).
So what would be the recommend way of measuring speed ups/benchmarking with nanobenchmark.h or should I be using googlebenchmark on bigger sized inputs?
Best
Henrik
Hi,
I have written a tutorial on how to port a static implicit B+-Tree described in this algorithmica.org article to Google Highway. You can find the writeup in this repository:
https://github.com/henrixapp/static-btree-highway
I tried to benchmark the btree against
std::lower_boundby usinghwy::MeasureClosurefor varying size ofNinputs,Nbetween 32 and 2^26.The interface for this is
size_t->size_t. To make it work with more datatypes (the btree also works with floats), I just generated 10k queries (in my datatype) and answered them in a for-loop and meassured the performance (a single element had not enough overhead to measure).However, that gave me insane performance for N>2^21 on 64 bit values, jumping to a relative speed up of >80. (note the log-scale)
When running under perf I could see that the std::lower_bound code would use 0.08 instructions per cycle, while the btree would use 0.6.
I tried using 5 or 10 different queues of queries, but that also resulted in these large speedups.
In the algorithmica.org article they mention to chain the queries, so that no query can be run without finishing the other.
So that brought me to this idea to prevent reordering of runs by calling
hwy::Unpredictable1():The numbers look more realistic:
So what would be the recommend way of measuring speed ups/benchmarking with nanobenchmark.h or should I be using googlebenchmark on bigger sized inputs?
Best
Henrik