Skip to content

Conversation

@vyavdoshenko
Copy link
Contributor

@vyavdoshenko vyavdoshenko commented Nov 10, 2025

Relevant to: #6016

Improves SPOP performance on sparse hash tables (created by Reserve()) through three key optimizations:

  1. thread_local RNG: Use thread_local absl::InsecureBitGen instead of
    creating a new BitGen on each call, eliminating construction overhead

  2. Iterator increment: Replace modulo operation with iterator increment
    and wrap-around check, avoiding expensive division in the hot loop

  3. IsEmpty() early check: Test IsEmpty() before calling ExpireIfNeeded()
    to skip expensive expiration checks on empty buckets

Comparison:
Main branch:

BM_Spop1000/sparseness:1                    5964193 ns      5963687 ns          117
BM_Spop1000/sparseness:4                    6365861 ns      6365321 ns          110
BM_Spop1000/sparseness:10                   7633700 ns      7633658 ns           91
BM_Spop1000/sparseness:40                  13036086 ns     13035934 ns           54
BM_Spop1000/sparseness:100                 20192890 ns     20192937 ns           35

Fixed version (current):

BM_Spop1000/sparseness:1                    1232039 ns      1232019 ns          568
BM_Spop1000/sparseness:4                    1590988 ns      1590901 ns          440
BM_Spop1000/sparseness:10                   2773856 ns      2773784 ns          252
BM_Spop1000/sparseness:40                   7754844 ns      7754870 ns           90
BM_Spop1000/sparseness:100                 14499281 ns     14499131 ns           48

Previous approach (wrong direction) (Rejection Sampling):

BM_Spop1000/sparseness:1                    1351481 ns      1351497 ns          518
BM_Spop1000/sparseness:4                    2342621 ns      2342612 ns          299
BM_Spop1000/sparseness:10                   5463768 ns      5463703 ns          129
BM_Spop1000/sparseness:40                  13164771 ns     13164710 ns           53
BM_Spop1000/sparseness:100                 21711212 ns     21711052 ns           32

The current comparison is based on fixed benchmark: #6038

dranikpg
dranikpg previously approved these changes Nov 10, 2025
Copy link
Contributor

@dranikpg dranikpg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever. Please change the bit gen in GetRandomIterator as well

Comment on lines 680 to 681
// Use thread-local generator to avoid repeated construction overhead
thread_local absl::BitGen gen;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just using absl::InsecureBitGen is as cheap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@BorysTheDev
Copy link
Contributor

"For sparse tables (6% util): ~87% success rate with 32 probes -> avoids expensive linear scan"
sounds really strange, why linear scan is more expensive than probing

@romange
Copy link
Collaborator

romange commented Nov 11, 2025

Please open a separate issue implementing DenseSet::Shrink, as it is needed anyway for sorted sets, sets and hash maps

@romange
Copy link
Collaborator

romange commented Nov 11, 2025

Clever indeed.
$success = 1 - fail = 1- (sparsity)^{32}$

romange
romange previously approved these changes Nov 11, 2025
@BorysTheDev
Copy link
Contributor

If we have a uniform distribution and if we start from random point, we should get the same complexity for linear scan as for probing, but linear scan should be faster because of memory spatial locality

Copy link
Contributor

@BorysTheDev BorysTheDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random probing can't solve this issue

@dranikpg
Copy link
Contributor

random probing can't solve this issue

With string hashes the distribution should be close to uniform. If to think about just theoretically for a single operation, probing one more cell with a linear scan shouldn't be different from doing random probes. Except that if we always probe linarly, then we create "islands" of values, as there is more blank space to land in and we'll only bite off the next leftmost value of the next "island". So maybe with linear probing a large hole is created that only grows on average... At least this is my theory.

Any full solution is just shrinking the backing array, as there's just not enough information for efficiently searching a few leftover entries in a large backing

@BorysTheDev
Copy link
Contributor

random probing can't solve this issue

With string hashes the distribution should be close to uniform. If to think about just theoretically for a single operation, probing one more cell with a linear scan shouldn't be different from doing random probes. Except that if we always probe linarly, then we create "islands" of values, as there is more blank space to land in and we'll only bite off the next leftmost value of the next "island". So maybe with linear probing a large hole is created that only grows on average... At least this is my theory.

Any full solution is just shrinking the backing array, as there's just not enough information for efficiently searching a few leftover entries in a large backing

If we start linear scan from random point every time, there shouldn't be any islands and result should be the same as random probing. I think the problem is somewhere else, for example, our random generator could provide a non-uniform distribution or somehow we don't remove items in the beginning or something else. Anyway without understanding the real problem, mathematically this solution is incorrect

@romange
Copy link
Collaborator

romange commented Nov 11, 2025

I think having 5000x improvement justifies the fix. Having said that, I agree that it would be nice to understand WHY the main branch is "5000x" slower.

@BorysTheDev
Copy link
Contributor

I think having 5000x improvement justifies the fix. Having said that, I agree that it would be nice to understand WHY the main branch is "5000x" slower.

I don't say that there is no improvement. I say that the current fix isn't optimal and uses incorrect assumptions. And the results that we see have another explanation. I think the problem was fixed with some random change, but not with probing.

@mkaruza
Copy link
Contributor

mkaruza commented Nov 11, 2025

Outside of discussion about randomness.

@vyavdoshenko did you consider also rewriting function to use bit operations instead of modulo. i.e.

  size_t modulo = (1 << capacity_log_) - 1;
  for (size_t i = offset; i < entries_.size() + offset; i++) {
    auto it = entries_.begin() + (i & modulo);

@vyavdoshenko vyavdoshenko dismissed stale reviews from romange and dranikpg via bd49627 November 11, 2025 13:33
@vyavdoshenko
Copy link
Contributor Author

The simpliest solution is the best.
There were 3 problems:

  • Creating random generator for each call
  • Using modulo operation, it can be avoided
  • ExpireIfNeeded executed for empty objects, it can be avoided

mkaruza
mkaruza previously approved these changes Nov 11, 2025
state.ResumeTiming();
for (int i = 0; i < 1000; ++i) {
src.Pop();
tmp.Pop();
Copy link
Contributor

@mkaruza mkaruza Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check that we always pop 1000 elements or that we always pop element - whatever is easier. Otherwise LGTM.

@vyavdoshenko vyavdoshenko marked this pull request as draft November 11, 2025 13:59
@vyavdoshenko vyavdoshenko changed the title fix: Optimize GetRandomChain() in DenseSet [Do Not Review]fix: Optimize GetRandomChain() in DenseSet Nov 11, 2025
@vyavdoshenko vyavdoshenko requested a review from mkaruza November 11, 2025 15:43
@vyavdoshenko vyavdoshenko changed the title [Do Not Review]fix: Optimize GetRandomChain() in DenseSet fix: Optimize GetRandomChain() in DenseSet Nov 11, 2025
@vyavdoshenko vyavdoshenko marked this pull request as ready for review November 11, 2025 15:43
}

// Use thread-local generator to avoid construction overhead
thread_local absl::InsecureBitGen gen;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use the same thread_local object here and at line 710

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@vyavdoshenko vyavdoshenko requested a review from romange November 11, 2025 16:44
@romange
Copy link
Collaborator

romange commented Nov 11, 2025

I remember seeing differrent numbers for the benchmark. While this improvement is good, it does not solve the problem for over sized sets.

constexpr size_t kMinSize = 1 << kMinSizeShift;
constexpr bool kAllowDisplacements = true;

thread_local absl::InsecureBitGen tl_bit_gen;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe creating the bitgen is not expensive

@vyavdoshenko vyavdoshenko merged commit fa0f64f into main Nov 11, 2025
10 checks passed
@vyavdoshenko vyavdoshenko deleted the bobik/dense_set_opt branch November 11, 2025 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants