Optimize lookup for any descendant leaf by declanvk · Pull Request #74 · declanvk/blart

declanvk · 2026-05-05T04:39:56Z

Instead of using minimum_unchecked to find an arbitrary descendant,
instead use a custom function that tries to quickly find any leaf of
a given subtree.

The default implementation for inner nodes is equivalent to the
minimum_unchecked function, but is override for the InnerNodeSorted
and InnerNodeIndirect. Both of these inner node types maintain
a compact array of child pointers, which makes it easy to select
a child node at random.

Rather than picking randomly, we're trying to find a leaf node as
quickly as possible. I'd previously read
https://brooker.co.za/blog/2012/01/17/two-random.html, which gave
me the idea to try a best-of-two strategy for looking for leaf node
child pointers. I couldn't say that this is the very best option,
only that it is easily proven correct (there are always two child
nodes in an inner node) and that testing later proved that it was
better than a single choice.

I wanted to do this optimization when I started reworking the range
iterator a little while ago. Specifically, the range operation needs
to do a "full prefix" search (as opposed to a pessimistic/optimistic
prefix-based search), which requires searching for a leaf node if a
given inner node has implicit bytes in the stored prefix. "Full
prefix" searches also happen in the insert code path.

Looking at the "full prefix" search code, it occured to me that using
the minimum as a way to find an arbitrary leaf node was a pretty good
option, but not the best. So I tried out the best-of-two stuff
realized that the overall improvement more most cases was marginal at
best. However, for one specific dataset and operation this was a huge
improvement:

iai_callgrind::bench_insert_group::bench_from_iter skewed:...
  Baselines:                               |9b221e2
  Instructions:                   211822957|512411353            (-58.6615%) [-2.41905x]
  L1 Hits:                        283167619|616345853            (-54.0570%) [-2.17661x]
  LL Hits:                             5886|34218438             (-99.9828%) [-5813.53x]
  RAM Hits:                          551645|551715               (-0.01269%) [-1.00013x]
  Total read+write:               283725150|651116006            (-56.4248%) [-2.29488x]
  Estimated Cycles:               302504624|806748068            (-62.5032%) [-2.66689x]

-60%!!! It was consistent over multiple runs too. The reason for
this massive improvement is the structure of the skewed dataset.
skewed keys are an artificial dataset I use for testing, here is
some example data:

[255]
[0, 255]
[0, 0, 255]
[0, 0, 0, 255]
[0, 0, 0, 0, 255]
[0, 0, 0, 0, 0, 255]
[0, 0, 0, 0, 0, 0, 255]
[0, 0, 0, 0, 0, 0, 0, 255]
[0, 0, 0, 0, 0, 0, 0, 0, 255]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 255]

When inserted into the tree structure, this sequence of keys is a
worst case for lookup/insertion/deletion/etc because the number of
inner nodes is maximized. The insert operation is especially bad
because on each step of the lookup portion of the operation (finding
where to insert) we need to see if there is a "full prefix" mismatch.
The "full prefix" mismatch requires going to find a descendant leaf
node, which requires recursing down the whole long tree. However,
this optimization specifically prevents this because it checks two
children at each inner node and prefers the one that points to a
leaf node. That effectively makes the "full prefix" lookup constant
time (for this specific key dataset).

Overall, I don't think this optimization is hugely important, though
it was fun to investigate. I'll probably keep it because it helps the
skewed edge case performance a ton and doesn't hurt other datasets
much at all.

Passed all existing tests, no new tests.

Instead of using `minimum_unchecked` to find an arbitrary descendant, instead use a custom function that tries to quickly find any leaf of a given subtree. The default implementation for inner nodes is equivalent to the `minimum_unchecked` function, but is override for the `InnerNodeSorted` and `InnerNodeIndirect`. Both of these inner node types maintain a compact array of child pointers, which makes it easy to select a child node at random. Rather than picking randomly, we're trying to find a leaf node as quickly as possible. I'd previously read <https://brooker.co.za/blog/2012/01/17/two-random.html>, which gave me the idea to try a best-of-two strategy for looking for leaf node child pointers. I couldn't say that this is the very best option, only that it is easily proven correct (there are always two child nodes in an inner node) and that testing later proved that it was better than a single choice. I wanted to do this optimization when I started reworking the range iterator a little while ago. Specifically, the range operation needs to do a "full prefix" search (as opposed to a pessimistic/optimistic prefix-based search), which requires searching for a leaf node if a given inner node has implicit bytes in the stored prefix. "Full prefix" searches also happen in the insert code path. Looking at the "full prefix" search code, it occured to me that using the minimum as a way to find an arbitrary leaf node was a pretty good option, but not the best. So I tried out the best-of-two stuff realized that the overall improvement more most cases was marginal at best. However, for one specific dataset and operation this was a huge improvement: ```text iai_callgrind::bench_insert_group::bench_from_iter skewed:... Baselines: |9b221e2 Instructions: 211822957|512411353 (-58.6615%) [-2.41905x] L1 Hits: 283167619|616345853 (-54.0570%) [-2.17661x] LL Hits: 5886|34218438 (-99.9828%) [-5813.53x] RAM Hits: 551645|551715 (-0.01269%) [-1.00013x] Total read+write: 283725150|651116006 (-56.4248%) [-2.29488x] Estimated Cycles: 302504624|806748068 (-62.5032%) [-2.66689x] ``` `-60%`!!! It was consistent over multiple runs too. The reason for this massive improvement is the structure of the `skewed` dataset. `skewed` keys are an artificial dataset I use for testing, here is some example data: ```text [255] [0, 255] [0, 0, 255] [0, 0, 0, 255] [0, 0, 0, 0, 255] [0, 0, 0, 0, 0, 255] [0, 0, 0, 0, 0, 0, 255] [0, 0, 0, 0, 0, 0, 0, 255] [0, 0, 0, 0, 0, 0, 0, 0, 255] [0, 0, 0, 0, 0, 0, 0, 0, 0, 255] ``` When inserted into the tree structure, this sequence of keys is a worst case for lookup/insertion/deletion/etc because the number of inner nodes is maximized. The insert operation is especially bad because on each step of the lookup portion of the operation (finding where to insert) we need to see if there is a "full prefix" mismatch. The "full prefix" mismatch requires going to find a descendant leaf node, which requires recursing down the whole long tree. However, this optimization specifically prevents this because it checks two children at each inner node and prefers the one that points to a leaf node. That effectively makes the "full prefix" lookup constant time (for this specific key dataset). Overall, I don't think this optimization is hugely important, though it was fun to investigate. I'll probably keep it because it helps the skewed edge case performance a ton and doesn't hurt other datasets much at all. Passed all existing tests, no new tests.

declanvk merged commit 4e415cc into main May 5, 2026
12 of 16 checks passed

declanvk deleted the optimize-read-full-prefix-intermediate branch May 5, 2026 05:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize lookup for any descendant leaf#74

Optimize lookup for any descendant leaf#74
declanvk merged 1 commit into
mainfrom
optimize-read-full-prefix-intermediate

declanvk commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

declanvk commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant