feat(l1): replace trie layer cache with LRU cache for full sync by azteca1998 · Pull Request #6492 · lambdaclass/ethrex

azteca1998 · 2026-04-16T10:55:28Z

Summary

Adds FlatTrieCache: an LRU-based trie node cache (2M entries) for full sync batch mode, bypassing the diff-layer chain, bloom filter rebuild, and RCU overhead
Adds TrieCacheRef enum so TrieWrapper can dispatch to either cache transparently
In batch mode, apply_trie_updates takes a fast path: writes nodes directly to disk and populates the LRU, skipping layer accumulation entirely
add_blocks_in_batch enables the flat cache before execution and disables it after
Normal CL block processing is completely unchanged — trie_cache_ref() returns Layered when no batch cache is active
The bloom rebuild that takes ~25% of merkle spikes is eliminated in batch mode

Test plan

Run full sync on hoodi with FULL_SYNC_BLOCK_LIMIT=50000 and verify completion
Compare merkle computation time with and without the LRU cache (check [METRICS] logs)
Verify normal CL block processing still uses layered cache (check after sync completes)
Monitor memory usage — LRU should stay bounded at ~300-600 MB

… mode (#6480) During full sync (add_blocks_in_batch), the TrieLayerCache's diff-layer chain, bloom filter, and RCU cloning are pure overhead since full sync never reorgs. This commit adds a FlatTrieCache backed by an LRU that bypasses all of that machinery: - New FlatTrieCache in layering.rs: simple LRU keyed by trie node path, no layers, no bloom, no parent pointers. - New TrieCacheRef enum: allows TrieWrapper to use either the layered cache (normal CL processing) or the flat LRU cache (batch mode) transparently. - In batch mode, apply_trie_updates writes trie nodes directly to disk and populates the LRU, instead of accumulating diff layers and doing periodic commits with bloom rebuilds. - add_blocks_in_batch enables the flat cache before execution and disables it after, so normal block processing is completely unchanged.

github-actions · 2026-04-16T10:57:56Z

Lines of code report

Total lines added: 335
Total lines removed: 0
Total lines changed: 335

Detailed view

+----------------------------------------+-------+------+
| File                                   | Lines | Diff |
+----------------------------------------+-------+------+
| ethrex/crates/blockchain/blockchain.rs | 2503  | +16  |
+----------------------------------------+-------+------+
| ethrex/crates/storage/layering.rs      | 364   | +163 |
+----------------------------------------+-------+------+
| ethrex/crates/storage/store.rs         | 2748  | +156 |
+----------------------------------------+-------+------+

github-actions · 2026-04-16T10:58:00Z

🤖 Kimi Code Review

Overall Assessment: The PR introduces a well-structured optimization for full sync batch processing. The abstraction using TrieCacheRef is clean, and the lifecycle management in add_blocks_in_batch correctly handles cleanup via the inner/outer function split.

Issues and Suggestions:

1. Magic numbers for trie key lengths (store.rs)

Lines 2939-2940: The literals 65 and 131 should be named constants.

// Suggested:
const ACCOUNT_LEAF_KEY_LEN: usize = 65;  // 32 bytes hash + 1 byte prefix?
const STORAGE_LEAF_KEY_LEN: usize = 131; // 64 bytes + prefix?
let is_leaf = key.len() == ACCOUNT_LEAF_KEY_LEN || key.len() == STORAGE_LEAF_KEY_LEN;
let is_account = key.len() <= ACCOUNT_LEAF_KEY_LEN;

2. Early success signal before disk persistence (store.rs)

Line 2926: result_sender.send(Ok(())) notifies the caller before the disk write transaction commits.
- While this mirrors the normal path's async behavior, in batch mode the write happens synchronously immediately after. If the disk commit fails (line 2960), the caller has already been told the update succeeded.
- Recommendation: Add a comment explaining that during full sync, a disk write failure is fatal and aborts the entire batch, making this ordering acceptable.

3. Unused method (layering.rs)

Line 365: put_batch is defined but never used. Either remove it or use it in apply_trie_updates (line 2919-2922) to reduce lock acquisition overhead:
```
// Instead of locking in a loop:
cache.put_batch(new_layer.iter().map(|(p, v)| (p.as_ref().to_vec(), v.clone())).collect());
```

4. Poisoned lock handling (layering.rs)

Line 385: cache.lock().ok()?.get(key) silently converts a poisoned mutex to a cache miss.
- This is safe (falls back to disk) but hides errors. Consider logging this at warn level, as a poisoned lock indicates a panic in another thread that might signal deeper issues.

5. Unnecessary fallback in Clone (layering.rs)

Line 304: Since `Lru

Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

github-actions · 2026-04-16T10:58:24Z

🤖 Codex Code Review

High: the new batch path acknowledges trie visibility before the state is durable anywhere, then tears down the only cache that can serve it. Store::apply_updates() still returns after notify_rx.recv() at crates/storage/store.rs, but in the new fast path that notification is sent before write_tx.commit() at crates/storage/store.rs. Blockchain::add_blocks_in_batch() then disables the batch cache immediately after the call returns at crates/blockchain/blockchain.rs. That creates a correctness hole: if the background writer is still flushing, subsequent reads fall back to the layered cache, but batch mode never populated it; if the write fails, the worker only logs the error and the batch still reported success. I would keep the batch cache alive until the write is committed, or make batch-mode updates synchronous / acknowledged only after durability.
High: FlatTrieCache changes trie reads from “state-root-aware” to “latest value by key”. TrieCacheRef::get() ignores state_root for the Flat variant at crates/storage/layering.rs, while many callers open tries for arbitrary historical roots, e.g. get_account_state_by_root() at crates/storage/store.rs, get_account_proof() at crates/storage/store.rs, and the RPC proof handler at crates/networking/rpc/eth/account.rs. While batch cache is enabled, those paths can return/prove the newest cached node instead of the node valid at the requested root. For blockchain code, that is a consensus-facing correctness bug. The flat cache needs root/version scoping, or it must be restricted to code paths that only ever read the current tip state.

No other EVM opcode or gas-accounting issues stood out in this diff. I couldn’t run cargo check in this environment because the sandbox blocks rustup temp-file creation.

Automated review by OpenAI Codex · gpt-5.4 · custom prompt

github-actions · 2026-04-16T11:00:21Z

🤖 Claude Code Review

Here is the full review of PR #6492:

PR Review: `feat(storage): replace trie layer cache with LRU cache for full sync`

Overview

This PR introduces a FlatTrieCache (LRU-based) that replaces the TrieLayerCache (diff-layer chain) during full sync batch mode. The motivation is sound: during batch sync there are no reorgs, so the bloom filter, RCU cloning, and layer-chain overhead of TrieLayerCache are wasteful. The TrieCacheRef enum cleanly dispatches between the two strategies without changes to normal CL block processing. The structural approach is good.

Critical Issues

1. Cleanup invariant violated on lock error (blockchain.rs:2175-2177)

The comment says "Always disable the batch cache, even on error," but the code uses ?:

self.storage
    .disable_batch_trie_cache()
    .map_err(|e| (ChainError::StoreError(e), None))?;

result

If disable_batch_trie_cache fails (poisoned RwLock), ? returns early and:

result (the inner error/success) is silently dropped.
batch_trie_cache remains Some, so all subsequent normal CL block processing continues using the flat LRU instead of the layered cache.

This is the opposite of the stated invariant. Fix with:

let disable_result = self.storage
    .disable_batch_trie_cache()
    .map_err(|e| (ChainError::StoreError(e), None));
result.and(disable_result)

Or, since poisoning here is only possible after a panic elsewhere, simply log and ignore the disable error rather than propagating it — the important thing is batch_trie_cache is always cleared.

2. result_sender.send(Ok(())) fires before disk commit (store.rs:2918-2921)

In the fast path:

// Populate the LRU cache ...
result_sender.send(Ok(())).map_err(|_| StoreError::LockError)?;

// Write nodes to disk immediately
let mut write_tx = backend.begin_write()?;
// ... writes ...
if result.is_ok() { result = write_tx.commit(); }
let _ = fkv_ctl.send(FKVGeneratorControlMessage::Continue);
return result;

The caller is told "success" before write_tx.commit(). If commit() fails, the background thread returns an error (logged via inspect_err), but the block processing pipeline has already moved on believing the block was committed.

In the normal layered path this is tolerable because uncommitted nodes remain in the in-memory layer chain until the next flush. In the flat LRU path, LRU entries can be evicted without persistence, and after a crash the state is irrecoverably lost. Even without a crash: if the disk write fails mid-batch, subsequent blocks may be built on top of a state that is partially missing from disk.

Consider moving result_sender.send(Ok(())) to after write_tx.commit() succeeds in the flat path, accepting the tradeoff that block execution must wait slightly longer on disk I/O.

Important Issues

3. Poisoned mutex silently becomes a cache miss (layering.rs, TrieCacheRef::get)

Self::Flat(cache) => cache.lock().ok()?.get(key),

If the mutex is poisoned (a panic occurred while it was held), lock().ok()? converts the error to None — a cache miss. No log, no error propagation. Under a poisoned mutex every trie read in the batch silently skips the cache, causing all reads to fall through to disk. Add at least a warn!() or error!() log so this doesn't go completely unnoticed.

4. FlatTrieCache::put_batch is dead code (layering.rs:340-343)

apply_trie_updates uses individual .put() calls in a loop rather than put_batch. Either remove put_batch or use it in the fast path to make the intent clear.

5. Table-routing logic duplicated verbatim (store.rs:2935-2950 vs 3012-3027)

The is_leaf / is_account key-length detection and the four-way table dispatch appear twice, byte-for-byte identical. Extract to a helper:

fn trie_table_for_key(key: &[u8]) -> (&'static TableId, bool /* skip_if_past_last_written */) { … }

This avoids silent divergence if the encoding ever changes.

6. Magic numbers 65 and 131 need constants or documentation (store.rs:2935)

The values are correct — each nibble is stored as one byte (Nibbles::into_vec() is unpacked), so an account leaf is 64 nibbles + 1 leaf flag = 65 bytes, and a storage leaf is 64 (account hash) + 1 (leaf flag) + 1 (separator) + 64 (storage key) + 1 (leaf flag) = 131 bytes — but this is non-obvious. Named constants:

const ACCOUNT_LEAF_PATH_LEN: usize = 65;
const STORAGE_LEAF_PATH_LEN: usize = 131;

would make both instances self-documenting and eliminate the duplicate // Before encoding... comment that currently only appears in the normal path.

Minor Issues

7. Unreachable unwrap_or in FlatTrieCache::clone (layering.rs:305)

NonZeroUsize::new(cap.into()).unwrap_or(NonZeroUsize::new(1).unwrap())

cap is NonZeroUsize, so cap.into():usize is always > 0 and NonZeroUsize::new(cap.into()) is always Some. The unwrap_or branch is dead. Use cap directly or .expect("cap is NonZero").

8. Surprising Clone semantics on FlatTrieCache (layering.rs:295-310)

The Clone impl creates an empty cache rather than copying entries. The in-code comment explains this, but it's worth also noting what requires this bound (presumably TrieLayerCache or some derive on an enclosing type). If no external caller actually clones a FlatTrieCache value — the hot path clones the Arc<Mutex<FlatTrieCache>> — consider removing the Clone impl entirely to prevent accidental misuse.

9. No unit tests for the new path

There are no tests for FlatTrieCache (put/get/eviction), TrieCacheRef dispatch, or the enable/disable_batch_trie_cache lifecycle. Given that correctness bugs here can silently corrupt chain state, even basic property tests (capacity eviction, get after put, empty-after-clear) would help catch regressions.

Summary

The design concept is sound and the normal-path isolation (layered cache unchanged) is correctly implemented. The two critical issues — the cleanup invariant being falsifiable by a lock error, and signaling success before the disk commit in the flat path — need to be addressed before this lands, as both can lead to persistent data inconsistency during full sync.

Automated review by Claude (Anthropic) · sonnet · custom prompt

…ullsync

The batch path in `apply_trie_updates` was sending `Ok(())` on the result channel before the RocksDB write finished. The caller would then proceed to `disable_batch_trie_cache` (dropping the LRU) and `enable_batch_trie_cache` (creating a fresh, empty LRU). If the disk write hadn't completed, the next batch's `has_state_root` check would find the root node in neither the new LRU nor on disk, causing "state root missing" errors around block 443k. Move the result_sender signal to after write_tx.commit() so the caller blocks until all trie nodes are persisted.

When a batch fails with StateRootMismatch (cross-block EVM cache pollution), the fullsync falls back to single-block pipeline execution. The pipeline uses the normal trie path, which stores nodes in the layered diff-chain cache. With commit_threshold=128, the top ~127 layers remain uncommitted (not yet flushed to disk). When the next batch starts in batch mode, trie_cache_ref() previously returned only the flat LRU cache, completely bypassing the layered cache. has_state_root() would miss the flat cache (empty) and fall through to disk, which only had the root from ~128 blocks ago. The hash check failed, producing "state root missing". Fix: introduce TrieCacheRef::FlatWithFallback that checks the flat LRU first, then the layered cache, before falling through to disk.

…mode When a batch fails with StateRootMismatch, the pipeline fallback writes trie nodes to the layered cache (TrieLayerCache). The next batch writes updates only to disk + flat LRU, but after batch mode ends, the stale layered cache still shadows the fresher disk values. Subsequent reads return stale data, causing a permanent StateRootMismatch loop. Fix: drain all uncommitted layers to disk in enable_batch_trie_cache() before creating the flat LRU. This guarantees batch-mode reads (flat LRU → disk) never hit stale in-memory data. Removes the FlatWithFallback variant since the layered cache is now always empty when batch mode is active.

…trics Add per-batch and cumulative cache statistics (hits, misses, hit rate, inserts, evictions, fill level) logged at INFO level when each batch completes. This helps evaluate whether the LRU capacity is sufficient or if evictions are causing excess disk I/O. Also doubles the default capacity from 2M to 4M entries to test whether reduced eviction pressure improves fullsync throughput.

The flat LRU cache had 0% hit rate because all writes happened in phase 2 (apply_trie_updates) while all reads happened in phase 1 (block execution). Promote disk-fetched trie nodes into the flat cache on read so subsequent traversals within the same batch hit memory instead of disk.

…overhead Benchmarked read-through cache: 1.07M hits vs 69.7M misses (1.5% hit rate). The mutex lock + allocation on every disk read made overall sync 10 min slower (3h23m vs 3h13m write-only). Different blocks touch different trie paths so node reuse within a batch is minimal.

github-actions Bot assigned azteca1998 Apr 16, 2026

azteca1998 changed the title ~~feat(storage): replace trie layer cache with LRU cache for full sync~~ feat(l1): replace trie layer cache with LRU cache for full sync Apr 16, 2026

github-actions Bot added the L1 Ethereum client label Apr 16, 2026

github-project-automation Bot added this to ethrex_l1 Apr 16, 2026

azteca1998 added 8 commits April 16, 2026 13:09

fix(l1): resolve clippy unwrap warnings in flat trie cache

45b5277

Merge remote-tracking branch 'origin/main' into feat/lru-trie-cache-f…

24c9fd7

…ullsync

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(l1): replace trie layer cache with LRU cache for full sync#6492

feat(l1): replace trie layer cache with LRU cache for full sync#6492
azteca1998 wants to merge 9 commits intomainfrom
feat/lru-trie-cache-fullsync

azteca1998 commented Apr 16, 2026

Uh oh!

github-actions Bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 16, 2026

Uh oh!

github-actions Bot commented Apr 16, 2026

Uh oh!

github-actions Bot commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

azteca1998 commented Apr 16, 2026

Summary

Test plan

Uh oh!

github-actions Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Lines of code report

Uh oh!

github-actions Bot commented Apr 16, 2026

🤖 Kimi Code Review

Uh oh!

github-actions Bot commented Apr 16, 2026

🤖 Codex Code Review

Uh oh!

github-actions Bot commented Apr 16, 2026

🤖 Claude Code Review

PR Review: feat(storage): replace trie layer cache with LRU cache for full sync

Overview

Critical Issues

Important Issues

Minor Issues

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Apr 16, 2026 •

edited

Loading

PR Review: `feat(storage): replace trie layer cache with LRU cache for full sync`