feat(l1): kademlia k-bucket routing table v2 by azteca1998 · Pull Request #6511 · lambdaclass/ethrex

azteca1998 · 2026-04-21T10:57:13Z

Summary

Re-introduces the Kademlia k-bucket routing table (reverted in #6505 for v10 release) with all fixes and performance improvements applied:

Kademlia base — revert of the revert (feat(l1): reintroduce proper Kademlia k-bucket routing table #6458)
Peer pruning fix — mark unresponsive peers as disposable (fix(l1): mark unresponsive peers as disposable to prevent snapsync stalls #6497)
Replacement contacts — include replacements in peer discovery and iteration
Flat connection pool — 10K candidate pool decoupled from k-buckets for RLPx initiation (matches Reth/Nethermind architecture)
O(1) random index probe — replace O(n) collect-then-choose with rand() % len + forward scan, avoiding actor contention during snap sync
Remove permanent blacklist — contacts pruned from k-buckets stay in the connection pool for retry

Performance issues addressed

Actor contention: The original connection pool implementation collected all 10K entries into a Vec on every get_contact_to_initiate() call (every 100ms), blocking get_best_peer() calls from snap sync workers. Now O(k) with random start index.
Permanent blacklisting: discarded_contacts permanently banned peers after a single timeout, shrinking the effective pool over time. Removed entirely — RLPx handshake handles rejection.
Peer pruning: Unresponsive peers were never removed from k-buckets, causing peer count to stagnate.

Pending

Multisync benchmarks comparing against pre-Kademlia baseline (waiting for server availability)

Test plan

Multisync (hoodi + sepolia + mainnet) sync times comparable to pre-Kademlia baseline
Daily snapsync CI passes
No peer count stagnation during long syncs

…ing table" (#6505)" This reverts commit c41226b.

…talls Fixes snapsync failures where peer count stays constant and sync eventually fails with "Failed to receive block headers" after hours of operation. Root cause: After PR #6458 introduced Kademlia k-buckets, peers that became unresponsive during sync weren't marked as disposable, so they remained in the routing table indefinitely. New peers went into replacement lists but were never promoted because dead peers weren't pruned. Changes: - Enhanced prune() to remove disposable contacts from both main and replacement lists, with automatic promotion of replacements - Mark peers as disposable when they timeout during RLPx operations (block headers, block bodies, sync head requests) - Added periodic pruning in the snap_sync main loop to ensure dead peers are regularly removed and replaced Evidence from CI artifacts showed peer count stuck at 6 throughout 3h35m sync before failure. This fix enables peer rotation so healthy peers from replacement lists can take over when active peers become unresponsive.

The Kademlia k-bucket implementation only iterated over main bucket contacts, ignoring replacement entries. This caused peer starvation because dead contacts in the main list were never replaced by fresher peers from the replacement list. Fix iter_contacts() and do_get_contact_to_initiate() to also check replacement contacts, allowing the node to discover and connect to peers that were previously invisible to the peer selection logic.

KBucket::get_mut and get_contact only searched the main contact list, so any state mutation (set_disposable, ping tracking, find_node count, mark_knows_us) silently failed for contacts in the replacement list. Since iter_contacts and do_get_contact_to_initiate now return replacement contacts, this caused phantom contacts that were visible to selection but invisible to updates. Update get_contact to use get_any (main + replacements) and get_mut to search both lists, ensuring all contact state mutations work regardless of which list holds the contact.

…able Add a separate IndexMap<H256, Node> connection pool (capacity 50K) for RLPx connection initiation, decoupled from the k-bucket routing table (which is limited to 256 × 16 = 4,096 contacts by Kademlia design). All discovered contacts are inserted into both the k-buckets (for Kademlia protocol operations like FindNode/GetClosestNodes) and the connection pool (for peer connection initiation). This restores the large candidate pool that existed before the k-bucket migration while preserving correct Kademlia routing semantics. The connection pool is: - Populated on every contact discovery (discv4, discv5, insert_if_new) - Cleaned during prune() when contacts are marked disposable - Capped at 50K entries with oldest-first eviction - Used with random selection and k-bucket state filtering

Matches the candidate pool size used by Reth and Nethermind.

- Replace O(n) collect-then-choose in do_get_contact_to_initiate with O(k) random index probing on the IndexMap (rand % len, scan forward). The old approach scanned all 10K pool entries, cloned eligible ones into a Vec, then randomly picked — blocking the peer_table actor and starving snap sync's get_best_peer calls. - Replace collect-then-choose in do_get_contact_for_lookup with IteratorRandom::choose (single-pass reservoir sampling, zero alloc). - Remove discarded_contacts permanent blacklist entirely. Contacts pruned from k-buckets now remain in the connection pool so they can be retried — the RLPx handshake rejects truly incompatible peers. Previously, a single timeout permanently blacklisted a contact from both the pool and re-discovery.

github-actions · 2026-04-21T11:00:32Z

Lines of code report

Total lines added: 549
Total lines removed: 14
Total lines changed: 563

Detailed view

+-----------------------------------------------------------+-------+------+
| File                                                      | Lines | Diff |
+-----------------------------------------------------------+-------+------+
| ethrex/cmd/ethrex/initializers.rs                         | 656   | +1   |
+-----------------------------------------------------------+-------+------+
| ethrex/cmd/ethrex/l2/initializers.rs                      | 385   | +4   |
+-----------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/discovery/discv4_handlers.rs | 542   | +64  |
+-----------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/discovery/discv5_handlers.rs | 770   | +52  |
+-----------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/discovery/lookup.rs          | 152   | +152 |
+-----------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/discovery/mod.rs             | 34    | +1   |
+-----------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/discovery/server.rs          | 428   | -6   |
+-----------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/discv4/server.rs             | 44    | -8   |
+-----------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/discv5/server.rs             | 243   | +3   |
+-----------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/peer_handler.rs              | 571   | +2   |
+-----------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/peer_table.rs                | 1406  | +269 |
+-----------------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/snap_sync.rs            | 1143  | +1   |
+-----------------------------------------------------------+-------+------+

Resolve conflicts in p2p module by adapting HEAD's new discovery/ handler files to main's PeerTable API (k-bucket table stores local_node_id internally, removing extra node_id arguments from callers).

Replace the direct random-contact lookup with a geth-style iterative convergence approach: generate a random target, seed with closest known nodes from the connection pool, query alpha=3 closest not-yet-asked nodes, feed responses back, and iterate until convergence or timeout.

Change is_finished() to consider a lookup converged as soon as all entries have been queried, without waiting for in-flight responses. Late responses still get processed via handle_neighbors and feed into the connection pool for future lookups. This prevents a single unresponsive node from blocking the entire lookup for up to 20 seconds (the timeout), which was causing very slow peer acquisition compared to the old random-contact approach.

Instead of returning after clearing a finished lookup (which waits for the easing interval before starting the next one), immediately start a new lookup in the same tick. This matches geth's behavior of continuous lookup chaining during bootstrapping.

azteca1998 added 7 commits April 21, 2026 12:55

Revert "revert(l1): revert "reintroduce proper Kademlia k-bucket rout…

bbfbeb7

…ing table" (#6505)" This reverts commit c41226b.

chore(p2p): reduce connection pool cap from 50K to 10K

0449f6e

Matches the candidate pool size used by Reth and Nethermind.

github-actions Bot assigned azteca1998 Apr 21, 2026

This was referenced Apr 21, 2026

fix(l1): mark unresponsive peers as disposable to prevent snapsync stalls #6497

Closed

perf(p2p): randomize contact selection for RLPx connection initiation #6503

Closed

perf(p2p): add flat connection pool decoupled from Kademlia routing table #6504

Closed

azteca1998 changed the title ~~feat(p2p): Kademlia k-bucket routing table v2~~ feat(l1): Kademlia k-bucket routing table v2 Apr 21, 2026

github-actions Bot added the L1 Ethereum client label Apr 21, 2026

github-project-automation Bot added this to ethrex_l1 Apr 21, 2026

azteca1998 changed the title ~~feat(l1): Kademlia k-bucket routing table v2~~ feat(l1): kademlia k-bucket routing table v2 Apr 21, 2026

azteca1998 added 5 commits May 13, 2026 14:22

Merge main into feat/kademlia-v2

dcdbe70

Resolve conflicts in p2p module by adapting HEAD's new discovery/ handler files to main's PeerTable API (k-bucket table stores local_node_id internally, removing extra node_id arguments from callers).

chore: add trace logging to iterative lookup for debugging

5f97887

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(l1): kademlia k-bucket routing table v2#6511

feat(l1): kademlia k-bucket routing table v2#6511
azteca1998 wants to merge 12 commits into
mainfrom
feat/kademlia-v2

azteca1998 commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

azteca1998 commented Apr 21, 2026

Summary

Performance issues addressed

Pending

Test plan

Uh oh!

github-actions Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Lines of code report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Apr 21, 2026 •

edited

Loading