Skip to content

Optimize lookup for any descendant leaf#74

Merged
declanvk merged 1 commit into
mainfrom
optimize-read-full-prefix-intermediate
May 5, 2026
Merged

Optimize lookup for any descendant leaf#74
declanvk merged 1 commit into
mainfrom
optimize-read-full-prefix-intermediate

Conversation

@declanvk
Copy link
Copy Markdown
Owner

@declanvk declanvk commented May 5, 2026

Instead of using minimum_unchecked to find an arbitrary descendant,
instead use a custom function that tries to quickly find any leaf of
a given subtree.

The default implementation for inner nodes is equivalent to the
minimum_unchecked function, but is override for the InnerNodeSorted
and InnerNodeIndirect. Both of these inner node types maintain
a compact array of child pointers, which makes it easy to select
a child node at random.

Rather than picking randomly, we're trying to find a leaf node as
quickly as possible. I'd previously read
https://brooker.co.za/blog/2012/01/17/two-random.html, which gave
me the idea to try a best-of-two strategy for looking for leaf node
child pointers. I couldn't say that this is the very best option,
only that it is easily proven correct (there are always two child
nodes in an inner node) and that testing later proved that it was
better than a single choice.

I wanted to do this optimization when I started reworking the range
iterator a little while ago. Specifically, the range operation needs
to do a "full prefix" search (as opposed to a pessimistic/optimistic
prefix-based search), which requires searching for a leaf node if a
given inner node has implicit bytes in the stored prefix. "Full
prefix" searches also happen in the insert code path.

Looking at the "full prefix" search code, it occured to me that using
the minimum as a way to find an arbitrary leaf node was a pretty good
option, but not the best. So I tried out the best-of-two stuff
realized that the overall improvement more most cases was marginal at
best. However, for one specific dataset and operation this was a huge
improvement:

iai_callgrind::bench_insert_group::bench_from_iter skewed:...
  Baselines:                               |9b221e2
  Instructions:                   211822957|512411353            (-58.6615%) [-2.41905x]
  L1 Hits:                        283167619|616345853            (-54.0570%) [-2.17661x]
  LL Hits:                             5886|34218438             (-99.9828%) [-5813.53x]
  RAM Hits:                          551645|551715               (-0.01269%) [-1.00013x]
  Total read+write:               283725150|651116006            (-56.4248%) [-2.29488x]
  Estimated Cycles:               302504624|806748068            (-62.5032%) [-2.66689x]

-60%!!! It was consistent over multiple runs too. The reason for
this massive improvement is the structure of the skewed dataset.
skewed keys are an artificial dataset I use for testing, here is
some example data:

[255]
[0, 255]
[0, 0, 255]
[0, 0, 0, 255]
[0, 0, 0, 0, 255]
[0, 0, 0, 0, 0, 255]
[0, 0, 0, 0, 0, 0, 255]
[0, 0, 0, 0, 0, 0, 0, 255]
[0, 0, 0, 0, 0, 0, 0, 0, 255]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 255]

When inserted into the tree structure, this sequence of keys is a
worst case for lookup/insertion/deletion/etc because the number of
inner nodes is maximized. The insert operation is especially bad
because on each step of the lookup portion of the operation (finding
where to insert) we need to see if there is a "full prefix" mismatch.
The "full prefix" mismatch requires going to find a descendant leaf
node, which requires recursing down the whole long tree. However,
this optimization specifically prevents this because it checks two
children at each inner node and prefers the one that points to a
leaf node. That effectively makes the "full prefix" lookup constant
time (for this specific key dataset).

Overall, I don't think this optimization is hugely important, though
it was fun to investigate. I'll probably keep it because it helps the
skewed edge case performance a ton and doesn't hurt other datasets
much at all.

Passed all existing tests, no new tests.

Instead of using `minimum_unchecked` to find an arbitrary descendant,
instead use a custom function that tries to quickly find any leaf of
a given subtree.

The default implementation for inner nodes is equivalent to the
`minimum_unchecked` function, but is override for the `InnerNodeSorted`
and `InnerNodeIndirect`. Both of these inner node types maintain
a compact array of child pointers, which makes it easy to select
a child node at random.

Rather than picking randomly, we're trying to find a leaf node as
quickly as possible. I'd previously read
<https://brooker.co.za/blog/2012/01/17/two-random.html>, which gave
me the idea to try a best-of-two strategy for looking for leaf node
child pointers. I couldn't say that this is the very best option,
only that it is easily proven correct (there are always two child
nodes in an inner node) and that testing later proved that it was
better than a single choice.

I wanted to do this optimization when I started reworking the range
iterator a little while ago. Specifically, the range operation needs
to do a "full prefix" search (as opposed to a pessimistic/optimistic
prefix-based search), which requires searching for a leaf node if a
given inner node has implicit bytes in the stored prefix. "Full
prefix" searches also happen in the insert code path.

Looking at the "full prefix" search code, it occured to me that using
the minimum as a way to find an arbitrary leaf node was a pretty good
option, but not the best. So I tried out the best-of-two stuff
realized that the overall improvement more most cases was marginal at
best. However, for one specific dataset and operation this was a huge
improvement:

```text
iai_callgrind::bench_insert_group::bench_from_iter skewed:...
  Baselines:                               |9b221e2
  Instructions:                   211822957|512411353            (-58.6615%) [-2.41905x]
  L1 Hits:                        283167619|616345853            (-54.0570%) [-2.17661x]
  LL Hits:                             5886|34218438             (-99.9828%) [-5813.53x]
  RAM Hits:                          551645|551715               (-0.01269%) [-1.00013x]
  Total read+write:               283725150|651116006            (-56.4248%) [-2.29488x]
  Estimated Cycles:               302504624|806748068            (-62.5032%) [-2.66689x]
```

`-60%`!!! It was consistent over multiple runs too. The reason for
this massive improvement is the structure of the `skewed` dataset.
`skewed` keys are an artificial dataset I use for testing, here is
some example data:

```text
[255]
[0, 255]
[0, 0, 255]
[0, 0, 0, 255]
[0, 0, 0, 0, 255]
[0, 0, 0, 0, 0, 255]
[0, 0, 0, 0, 0, 0, 255]
[0, 0, 0, 0, 0, 0, 0, 255]
[0, 0, 0, 0, 0, 0, 0, 0, 255]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 255]
```

When inserted into the tree structure, this sequence of keys is a
worst case for lookup/insertion/deletion/etc because the number of
inner nodes is maximized. The insert operation is especially bad
because on each step of the lookup portion of the operation (finding
where to insert) we need to see if there is a "full prefix" mismatch.
The "full prefix" mismatch requires going to find a descendant leaf
node, which requires recursing down the whole long tree. However,
this optimization specifically prevents this because it checks two
children at each inner node and prefers the one that points to a
leaf node. That effectively makes the "full prefix" lookup constant
time (for this specific key dataset).

Overall, I don't think this optimization is hugely important, though
it was fun to investigate. I'll probably keep it because it helps the
skewed edge case performance a ton and doesn't hurt other datasets
much at all.

Passed all existing tests, no new tests.
@declanvk declanvk merged commit 4e415cc into main May 5, 2026
12 of 16 checks passed
@declanvk declanvk deleted the optimize-read-full-prefix-intermediate branch May 5, 2026 05:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant