fix(l1,l2): serialize forkchoice_update by iovoid · Pull Request #6508 · lambdaclass/ethrex

iovoid · 2026-04-20T19:15:31Z

Motivation

Currently if forkchoice_update is called by two sites simultaneously, we could end up with a corrupted canonical chain.

Description

There are two underlying problems:

forkchoice_update reads the latest block outside the atomic write segment, allowing two threads to read the same 'current latest'
latest_block_header could end up in an inconsistent state

This PR fixes them by serializing writes with a mutex.

It also adds a regression test.

github-actions · 2026-04-20T19:18:21Z

🤖 Codex Code Review

High: the new delete_range is not correct for CANONICAL_BLOCK_HASHES, because that column family stores block numbers as u64::to_le_bytes(), while both RocksDB and the in-memory map apply range deletion in lexicographic byte order, not numeric order. [11, 0, ..] is lexicographically greater than [0, 1, ..], so reorging to head 10 will not delete block 256, and many higher canonical entries can survive incorrectly. This is a consensus-facing correctness bug in canonical chain maintenance. References: crates/storage/store.rs:1034, crates/storage/store.rs:1036, crates/storage/api/tables.rs:4, crates/storage/backend/rocksdb.rs:368, crates/storage/backend/in_memory.rs:184. Suggested fix: do not use range delete on little-endian numeric keys; either keep explicit numeric deletes, or migrate this CF to a sortable encoding such as big-endian before introducing range operations.
Medium: the added regression test would not catch Item 1. It only runs against EngineType::InMemory and only uses block heights 10..15, where little-endian byte order still happens to match numeric order, so the test can pass while the RocksDB implementation is wrong for real chain heights. References: test/tests/storage/fcu_race_tests.rs:30, test/tests/storage/fcu_race_tests.rs:31, test/tests/storage/fcu_race_tests.rs:96, test/tests/storage/fcu_race_tests.rs:98. This should add at least one case that crosses 255 -> 256, and ideally exercise the RocksDB backend too.

I did not run cargo test: the environment’s rustup install tried to write under /home/runner/.rustup/tmp and failed with a read-only filesystem error.

Automated review by OpenAI Codex · gpt-5.4 · custom prompt

github-actions · 2026-04-20T19:18:35Z

Lines of code report

Total lines added: 9
Total lines removed: 0
Total lines changed: 9

Detailed view

+--------------------------------+-------+------+
| File                           | Lines | Diff |
+--------------------------------+-------+------+
| ethrex/crates/storage/store.rs | 2601  | +9   |
+--------------------------------+-------+------+

greptile-apps · 2026-04-20T19:19:24Z

Greptile Summary

This PR addresses a TOCTOU race in forkchoice_update by adding a tokio::sync::Mutex to serialize concurrent callers and replacing the loop-based canonical-entry deletion with a delete_range call to eliminate the pre-read of LatestBlockNumber. The mutex serialization is correct, but the delete_range approach introduces a new correctness bug: CANONICAL_BLOCK_HASHES keys are stored as little-endian u64, which does not sort lexicographically by numeric value — causing RocksDB's (and the in-memory backend's) byte-wise comparator to silently skip entries whose LE first byte is numerically smaller than (head_number+1)'s first byte (e.g., block 65536 survives a reorg to head 50000). The regression test only uses block numbers ≤ 15, so this breakage is not caught by the new test.

Confidence Score: 3/5

Not safe to merge: the delete_range fix reintroduces canonical-chain corruption for reorgs spanning a byte-boundary in LE-encoded block numbers.

The mutex approach correctly serializes forkchoice_update callers and the TOCTOU race is fixed at the concurrency level. However, the new delete_range on LE-encoded keys is semantically incorrect with RocksDB's default lexicographic comparator, leaving orphan canonical entries for block numbers >= 256 in certain reorg scenarios. The regression test's small block numbers (<=15) cannot catch this. This is a P1 data-integrity issue that needs to be resolved before merging.

crates/storage/backend/rocksdb.rs and crates/storage/backend/in_memory.rs both need the delete_range key-ordering fixed; test/tests/storage/fcu_race_tests.rs should be extended to cover block numbers above 256.

Important Files Changed

Filename	Overview
crates/storage/store.rs	Adds `fcu_lock` mutex to serialize concurrent `forkchoice_update` calls and removes the TOCTOU-prone `load_latest_block_number` read. The mutex approach is correct, but the delete_range bounds rely on LE key ordering which is incorrect for RocksDB's lexicographic comparator.
crates/storage/backend/rocksdb.rs	Implements `delete_range` via `WriteBatch::delete_range_cf`, but CANONICAL_BLOCK_HASHES uses little-endian u64 keys while RocksDB's default comparator is lexicographic — the range will silently miss entries whose LE first byte falls below `start_key`'s first byte.
crates/storage/backend/in_memory.rs	Implements `delete_range` using Rust slice lexicographic comparison via `retain`; suffers the same LE byte-ordering bug as the RocksDB path for block numbers ≥ 256.
crates/storage/api/mod.rs	Adds `delete_range` to the `StorageWriteBatch` trait with a clear half-open `[start, end)` contract; clean API surface.
test/tests/storage/fcu_race_tests.rs	Good regression test for the TOCTOU race; however all block numbers are ≤ 15 (BASE+EXT), so the LE byte-ordering bug in delete_range is never exercised.
crates/storage/Cargo.toml	Adds `sync` feature to the tokio dependency to enable `tokio::sync::Mutex`; correct and minimal change.
test/tests/storage/mod.rs	Registers the new `fcu_race_tests` module; trivial plumbing change.

Sequence Diagram

sequenceDiagram
    participant CallerA
    participant CallerB
    participant fcu_lock as fcu_lock (tokio::Mutex)
    participant Store
    participant DB as RocksDB / InMemory

    CallerA->>fcu_lock: lock().await → _guard
    Note over fcu_lock: CallerB blocks here
    CallerA->>Store: latest_block_header.update(new_head_A)
    CallerA->>Store: forkchoice_update_inner(...)
    Store->>DB: txn.put(new_canonical_blocks)
    Store->>DB: txn.delete_range([head+1, u64::MAX)) ⚠️ LE key ordering
    Store->>DB: txn.put(LatestBlockNumber = head_A)
    Store->>DB: txn.commit()
    CallerA->>fcu_lock: drop(_guard)

    CallerB->>fcu_lock: lock().await → _guard
    CallerB->>Store: latest_block_header.update(new_head_B)
    CallerB->>Store: forkchoice_update_inner(...)
    Store->>DB: txn.put / delete_range / commit
    CallerB->>fcu_lock: drop(_guard)

Prompt To Fix All With AI

This is a comment left during a code review.
Path: crates/storage/backend/rocksdb.rs
Line: 376-382

Comment:
**Little-endian keys are not monotone under lexicographic order — `delete_range` will silently skip entries**

RocksDB's default comparator is lexicographic (byte-by-byte, left to right). Little-endian u64 encoding does not produce a lexicographically sorted sequence: for example, 256 encodes as `[0x00, 0x01, 0, …]`, which sorts *before* 11's `[0x0B, 0, …]` in lex order. As a result, `delete_range_cf` with `start = (head+1).to_le_bytes()` will miss any canonical entry whose LE first byte is smaller than the first byte of `head+1`, leaving orphan canonical entries—exactly the corruption this PR is trying to eliminate.

Concrete failure scenario: chain extends to block 65536 (`[0x00, 0x00, 0x01, 0, …]`), then reorgs to `head = 50000` (delete_start `[0x51, 0xC3, 0, …]`). Lex: `[0x00, 0x00, 0x01, …] < [0x51, 0xC3, …]`, so block 65536 survives the delete_range and the canonical table remains inconsistent.

The pre-existing loop-based deletion was correct because it generated each exact key. The fix is to either use big-endian encoding (which *is* lexicographically monotone) or restore the enumeration approach.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: crates/storage/backend/in_memory.rs
Line: 197

Comment:
**Same LE byte-ordering bug as the RocksDB path**

Rust's `PartialOrd` on `&[u8]` is also lexicographic. The `retain` predicate `key.as_slice() < start_key || key.as_slice() >= end_key` applies the same byte-by-byte ordering and will preserve entries whose LE first byte is smaller than `start_key`'s first byte, even if their numeric value is above the new head. The regression test uses block numbers ≤ 15, which all fit in the first byte, so the test never exercises numbers ≥ 256 where the bug manifests.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: crates/storage/store.rs
Line: 1034-1036

Comment:
**`u64::MAX.to_le_bytes()` is not a safe upper-bound sentinel for LE-encoded RocksDB keys**

Even if the byte-order issue were resolved, using `u64::MAX.to_le_bytes()` (`[0xFF; 8]`) as the exclusive end of a `delete_range` assumes that `[0xFF; 8]` is the lexicographic maximum 8-byte key. With big-endian encoding that assumption holds; with little-endian it does not, so this sentinel is conceptually tied to byte ordering. Additionally, `delete_range` semantics are `[start, end)`, meaning a canonical entry at block `u64::MAX` (key `[0xFF; 8]`) would not be removed — but that's an academic concern for this context.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "fix(l1,l2): serialize forkchoice_update" | Re-trigger Greptile}

greptile-apps · 2026-04-20T19:19:28Z

+            .cf_handle(table)
+            .ok_or_else(|| StoreError::Custom(format!("Table {} not found", table)))?;
+
+        self.batch.delete_range_cf(&cf, start_key, end_key);
+        Ok(())
+    }
+


Little-endian keys are not monotone under lexicographic order — delete_range will silently skip entries

RocksDB's default comparator is lexicographic (byte-by-byte, left to right). Little-endian u64 encoding does not produce a lexicographically sorted sequence: for example, 256 encodes as [0x00, 0x01, 0, …], which sorts before 11's [0x0B, 0, …] in lex order. As a result, delete_range_cf with start = (head+1).to_le_bytes() will miss any canonical entry whose LE first byte is smaller than the first byte of head+1, leaving orphan canonical entries—exactly the corruption this PR is trying to eliminate.

Concrete failure scenario: chain extends to block 65536 ([0x00, 0x00, 0x01, 0, …]), then reorgs to head = 50000 (delete_start [0x51, 0xC3, 0, …]). Lex: [0x00, 0x00, 0x01, …] < [0x51, 0xC3, …], so block 65536 survives the delete_range and the canonical table remains inconsistent.

The pre-existing loop-based deletion was correct because it generated each exact key. The fix is to either use big-endian encoding (which is lexicographically monotone) or restore the enumeration approach.

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/storage/backend/rocksdb.rs Line: 376-382 Comment: **Little-endian keys are not monotone under lexicographic order — `delete_range` will silently skip entries** RocksDB's default comparator is lexicographic (byte-by-byte, left to right). Little-endian u64 encoding does not produce a lexicographically sorted sequence: for example, 256 encodes as `[0x00, 0x01, 0, …]`, which sorts *before* 11's `[0x0B, 0, …]` in lex order. As a result, `delete_range_cf` with `start = (head+1).to_le_bytes()` will miss any canonical entry whose LE first byte is smaller than the first byte of `head+1`, leaving orphan canonical entries—exactly the corruption this PR is trying to eliminate. Concrete failure scenario: chain extends to block 65536 (`[0x00, 0x00, 0x01, 0, …]`), then reorgs to `head = 50000` (delete_start `[0x51, 0xC3, 0, …]`). Lex: `[0x00, 0x00, 0x01, …] < [0x51, 0xC3, …]`, so block 65536 survives the delete_range and the canonical table remains inconsistent. The pre-existing loop-based deletion was correct because it generated each exact key. The fix is to either use big-endian encoding (which *is* lexicographically monotone) or restore the enumeration approach. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-20T19:19:29Z

+
+        let db_mut = Arc::make_mut(&mut *db);
+        if let Some(table_ref) = db_mut.get_mut(table) {
+            table_ref.retain(|key, _| key.as_slice() < start_key || key.as_slice() >= end_key);


Same LE byte-ordering bug as the RocksDB path

Rust's PartialOrd on &[u8] is also lexicographic. The retain predicate key.as_slice() < start_key || key.as_slice() >= end_key applies the same byte-by-byte ordering and will preserve entries whose LE first byte is smaller than start_key's first byte, even if their numeric value is above the new head. The regression test uses block numbers ≤ 15, which all fit in the first byte, so the test never exercises numbers ≥ 256 where the bug manifests.

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/storage/backend/in_memory.rs Line: 197 Comment: **Same LE byte-ordering bug as the RocksDB path** Rust's `PartialOrd` on `&[u8]` is also lexicographic. The `retain` predicate `key.as_slice() < start_key || key.as_slice() >= end_key` applies the same byte-by-byte ordering and will preserve entries whose LE first byte is smaller than `start_key`'s first byte, even if their numeric value is above the new head. The regression test uses block numbers ≤ 15, which all fit in the first byte, so the test never exercises numbers ≥ 256 where the bug manifests. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-20T19:19:31Z

+            let delete_start = (head_number + 1).to_le_bytes();
+            let delete_end = u64::MAX.to_le_bytes();
+            txn.delete_range(CANONICAL_BLOCK_HASHES, &delete_start, &delete_end)?;


u64::MAX.to_le_bytes() is not a safe upper-bound sentinel for LE-encoded RocksDB keys

Even if the byte-order issue were resolved, using u64::MAX.to_le_bytes() ([0xFF; 8]) as the exclusive end of a delete_range assumes that [0xFF; 8] is the lexicographic maximum 8-byte key. With big-endian encoding that assumption holds; with little-endian it does not, so this sentinel is conceptually tied to byte ordering. Additionally, delete_range semantics are [start, end), meaning a canonical entry at block u64::MAX (key [0xFF; 8]) would not be removed — but that's an academic concern for this context.

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/storage/store.rs Line: 1034-1036 Comment: **`u64::MAX.to_le_bytes()` is not a safe upper-bound sentinel for LE-encoded RocksDB keys** Even if the byte-order issue were resolved, using `u64::MAX.to_le_bytes()` (`[0xFF; 8]`) as the exclusive end of a `delete_range` assumes that `[0xFF; 8]` is the lexicographic maximum 8-byte key. With big-endian encoding that assumption holds; with little-endian it does not, so this sentinel is conceptually tied to byte ordering. Additionally, `delete_range` semantics are `[start, end)`, meaning a canonical entry at block `u64::MAX` (key `[0xFF; 8]`) would not be removed — but that's an academic concern for this context. How can I resolve this? If you propose a fix, please make it concise.

github-actions · 2026-04-20T19:19:49Z

🤖 Claude Code Review

Now I have a thorough picture of the changes. Let me compile the review.

PR Review: `fix(l1,l2): serialize forkchoice_update` (#6508)

Overview

This PR addresses a TOCTOU race in forkchoice_update where two concurrent callers could observe the same LatestBlockNumber and leave orphaned canonical entries. The fix has two parts:

Replace per-key deletion with a delete_range on CANONICAL_BLOCK_HASHES to make the inner function self-contained (no read outside the transaction).
Add a tokio::sync::Mutex (fcu_lock) to serialize concurrent callers and protect the latest_block_header cache update.

The concurrency diagnosis is correct and the mutex is the right tool. However, there is a critical correctness bug in the delete_range implementation.

Critical Bug: `delete_range` is wrong for little-endian keys

This is the most important finding.

CANONICAL_BLOCK_HASHES keys are block numbers encoded as to_le_bytes(). Both the RocksDB backend (default bytewise/memcmp comparator — no custom comparator is set in rocksdb.rs) and the in-memory backend (retain with byte-slice comparison) use lexicographic ordering, which does not match numeric order for little-endian values.

For any n, le8(n) has:

byte 0 = n & 0xFF
byte 1 = (n >> 8) & 0xFF
…

Lexicographic comparison inspects byte 0 first. Because LE flips significance, le8(255) = [0xFF, 0x00, …] is greater than le8(256) = [0x00, 0x01, …] under memcmp, even though 255 < 256.

The impact for head_number = 255:

delete_start = le8(256) = [0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00]

A key K = le8(n) satisfies K >= delete_start iff K[0] > 0x00 OR (K[0] == 0x00 AND K[1] >= 0x01). This means:

Blocks 1–255 (le8 first byte 0x01–0xFF > 0x00) → incorrectly deleted (they should stay canonical)
Block 0 → correctly preserved (it's head-or-before)
Block 65536 → le8(65536) = [0x00, 0x00, 0x01, …] → K[0]=0x00, K[1]=0x00 < 0x01 → incorrectly preserved (should be deleted in a reorg past 65536)

In other words: delete_range(le8(head+1), le8(u64::MAX)) deletes every block whose number is not a multiple of 65536 rather than every block greater than head. For a reorg to block 255, this silently destroys the entire canonical chain from block 1 to 254.

store.rs:1034–1036 — the affected lines:

let delete_start = (head_number + 1).to_le_bytes();
let delete_end = u64::MAX.to_le_bytes();
txn.delete_range(CANONICAL_BLOCK_HASHES, &delete_start, &delete_end)?;

The same logic flaw applies to in_memory.rs:197's retain call.

Why the regression test doesn't catch this: BASE = 10, EXT = 5 — all block numbers are < 256. For single-byte numbers, the LE first byte is the number itself, so lexicographic order coincidentally matches numeric order in this range. The test gives false confidence.

Fix options (in order of preference):

Switch CANONICAL_BLOCK_HASHES keys to big-endian (to_be_bytes()): BE encoding preserves numeric order lexicographically, making range deletes correct. This is a one-time DB migration but the cleanest long-term solution.
Keep per-key deletion but move the latest read inside the mutex (before calling forkchoice_update_inner), then pass it in:
```
let _guard = self.fcu_lock.lock().await;
let latest = self.load_latest_block_number().await?.unwrap_or(0);
self.forkchoice_update_inner(..., latest).await?;
```
Since the mutex is held, no other FCU can race between the read and the write. This avoids any range-ordering dependency.

Secondary Issues

forkchoice_update_inner is pub (store.rs:1012)

The mutex is only acquired in forkchoice_update. If any caller invokes forkchoice_update_inner directly, they bypass serialization entirely. Since the function is pub, external crates can call it. It should be pub(crate) at most, or better yet unexposed (fn).

u64 overflow in head_number + 1 (store.rs:1034)

If head_number == u64::MAX, this panics in debug mode or wraps in release. Use head_number.saturating_add(1) or an explicit check. Practically unlikely, but worth fixing for robustness.

latest_block_header update happens before the DB write (store.rs:2258–2261)

This ordering is intentional (referenced comment from issue #3927), but within the mutex the in-memory cache is updated optimistically before forkchoice_update_inner completes. If forkchoice_update_inner returns an error, the cache is now inconsistent with the DB. Consider reverting the cache update on error, or updating it after a successful commit.

In-memory delete_range is not deferred to commit (in_memory.rs:184–200)

The existing delete method also acquires the lock immediately rather than deferring. The pre-existing FIXME on commit() ("in-memory writes aren't atomic") covers this, and the new method is consistent with existing behavior — just worth noting the atomicity gap still exists.

Positive Notes

Adding tokio::sync::Mutex (not std::sync::Mutex) is correct since the lock spans .await points.
The Arc clone pattern for the fcu_lock field is consistent with the rest of the Store struct.
The test structure (seed → race two FCUs → cleanup probe) is a sound regression design; it just needs larger block numbers to cover the LE ordering edge cases.
Adding "sync" to the tokio feature flags is the right minimal change.

Summary

The mutex serialization is sound and addresses the stated race. The delete_range approach is broken for any chain that has crossed block 255 and is the primary blocker for merging. The test needs block numbers > 255 to be meaningful. Recommend resolving the delete_range strategy (option 2 above is the lowest-risk change) and making forkchoice_update_inner non-public before merge.

Automated review by Claude (Anthropic) · sonnet · custom prompt

github-actions · 2026-04-20T19:31:14Z

🤖 Kimi Code Review

This is a high-quality fix for a subtle TOCTOU race condition in the forkchoice update path. The implementation correctly addresses the safety issue while maintaining good documentation.

Critical Correctness: Race Condition Fix

Line 2256 (store.rs): The tokio::sync::Mutex correctly serializes concurrent FCU callers. This prevents the interleaving where two threads read the same latest value outside the write transaction, then both commit with stale delete ranges, leaving orphaned canonical entries.
Line 1020 (store.rs): The comment accurately describes the safety invariant—the lock ensures the read of LatestBlockNumber and the subsequent commit are atomic with respect to other FCU operations.

Storage Layer Safety: Delete Range Logic

Lines 1031-1036 (store.rs): Excellent documentation of the little-endian encoding trap. The explicit iteration over (head_number + 1)..=latest is correct—u64::to_le_bytes() creates non-lexicographic keys where block 256 ([0x00, 0x01]) sorts before block 11 ([0x0B]), making RocksDB's delete_range unsafe here.

Concurrency Design

Using tokio::sync::Mutex (async-aware) is appropriate since the operation spans async I/O (DB writes via spawn_blocking). The lock is held for the entire duration of forkchoice_update_inner, which is necessary for correctness but acceptable given FCU is not on the hot path for transaction processing.
Line 1520 (store.rs): Correct initialization of the new lock field.

Test Coverage

fcu_race_tests.rs: Well-designed stress test that actually triggers the race condition using a multi-threaded runtime. The test logic is sound:
1. Seeds canonical entries at block BASE (250)
2. Spawns concurrent extend (to 260) and no-op FCUs
3. Verifies cleanup removes all entries above BASE
The choice of BASE=250 with EXT=10 deliberately crosses the byte boundary at 256 to guard against future refactors attempting lexicographic range deletions.

Minor Suggestions

Test assertion style: In race_iteration, returning true for "race detected" is slightly counterintuitive (vs Result or assert!), but the usage is unambiguous.
Lock ordering: Ensure no other code paths acquire locks inside forkchoice_update_inner while holding fcu_lock to prevent potential deadlocks. The current implementation appears safe as the DB operations use spawn_blocking without holding other store locks.

Summary: The fix is correct, well-documented, and properly tested. The endianness comment is particularly valuable for preventing future regressions.

Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

github-actions · 2026-04-20T19:34:08Z

🤖 Claude Code Review

Now I have a complete picture. Let me write the review.

PR #6508 Review: `fix(l1,l2): serialize forkchoice_update`

Overview

This PR fixes a TOCTOU (time-of-check/time-of-use) race in Store::forkchoice_update where concurrent callers could interleave: (a) the LatestBlockNumber read (used to bound the canonical-entry delete loop) and (b) the DB write transaction and in-memory cache update. The fix adds a tokio::sync::Mutex (fcu_lock) that serializes all callers end-to-end. A focused regression test is included.

The approach is sound. Below are specific findings.

Correctness

forkchoice_update_inner is pub — lock contract unenforced (store.rs:1012)

pub async fn forkchoice_update_inner(...)

The invariant "the caller must hold fcu_lock" exists only in a comment. Any code that calls forkchoice_update_inner directly skips the mutex entirely. Currently there is only one caller (line 2262), but pub visibility invites future misuse.

Recommend: change to pub(crate) or private, or add a #[doc(hidden)] + strong # Safety doccomment that makes the invariant explicit. If nothing external needs it, making it private is the safest option.

Cache updated before DB commit — error path leaves state inconsistent (store.rs:2257–2269)

let _guard = self.fcu_lock.lock().await;

// Cache updated FIRST
self.latest_block_header.update(new_head);

// DB write happens SECOND — if this fails, the cache is stale
self.forkchoice_update_inner(...).await?;

If forkchoice_update_inner returns an error, latest_block_header now points to a block that the DB does not consider the canonical head. Subsequent nonce/state reads will be wrong until the next successful FCU.

The PR description cites issue #3927 as the reason for this ordering (cache-first avoids nonce inconsistencies during sync), but that rationale should be evaluated against the error case. At minimum, a // SAFETY: comment explaining why the stale-on-error case is acceptable would help future readers.

unwrap_or(0) in forkchoice_update_inner (store.rs:1020)

let latest = self.load_latest_block_number().await?.unwrap_or(0);

When latest is None (empty store), head_number + 1 is typically ≥ 1, so the delete range (head_number+1)..=0 is empty — safe. But calling FCU when head_number > 0 on an uninitialized store returns 0 and skips all cleanup. This edge case appears impossible in practice (genesis is always block 0), but a debug_assert!(head_number == 0 || latest.is_some()) would make the invariant explicit.

Design

tokio::sync::Mutex choice is correct. The lock is held across multiple .await points (load_latest_block_number().await, spawn_blocking(...).await). Using std::sync::Mutex here would be unsound (cannot hold across await). Good.

let _guard naming is correct. _guard (with trailing letters) keeps the value alive until end of scope. Bare _ would drop immediately and release the lock. No issue here.

Single constructor path. fcu_lock is initialized only in from_backend (line 1514), which is the single path that all public constructors flow through. No risk of a Store instance missing the field.

LE-encoding note in the diff

The added comment explains why delete_range is not used:

// `delete_range` is not safe here: keys are `u64::to_le_bytes()`, and
// RocksDB's lexicographic comparator does not match LE numeric order
// (e.g. block 256 = [0x00, 0x01, ..] sorts before block 11 = [0x0B, ..]),
// so a range-delete would silently miss blocks whose LE first byte is
// smaller than `head+1`'s first byte.

This is technically correct and is a valuable warning for future maintainers. The explanation clearly conveys the subtle LE vs. lexicographic mismatch. Worth keeping.

Test

test/tests/storage/fcu_race_tests.rs — well-structured.

#[tokio::test(flavor = "multi_thread", worker_threads = 4)] correctly exercises real concurrent scheduling.
The seed → race → cleanup → assert pattern is clear.
BASE=250, EXT=10 spans block 256, the boundary where LE byte representations overflow a single byte. This guards against someone silently breaking things by swapping the loop for a delete_range in a future refactor. The choice is clever and worth the inline explanation.
100 iterations is a reasonable signal given the pre-fix hit rate of ~100%.

One minor gap: the test verifies the DB consistency (no orphan canonical entries) but does not assert cache consistency — i.e., that latest_block_header in memory matches the DB after concurrent FCUs. Adding that assertion would cover the second issue cited in the PR description ("latest_block_header could end up in an inconsistent state").

Summary

Finding	Severity	Pre-existing?
`forkchoice_update_inner` is `pub`, bypasses lock	Medium	No
Cache updated before DB commit; error leaves state dirty	Low–Medium	Yes, but PR touches this code
Test doesn't cover cache consistency	Low	No
`unwrap_or(0)` edge case undocumented	Very low	Yes

The core fix (tokio Mutex serialization) is correct and the regression test is solid. The main actionable item is restricting forkchoice_update_inner visibility so the lock invariant is enforced rather than documented.

Automated review by Claude (Anthropic) · sonnet · custom prompt

github-actions · 2026-04-20T19:34:38Z

🤖 Codex Code Review

Findings

Cache/DB inconsistency still exists on the error path in store.rs line 2257. forkchoice_update() updates latest_block_header before awaiting the DB transaction, and if forkchoice_update_inner() fails, the cache is left pointing at a head that never committed. That is externally observable through store.rs line 985 and the fast-path in store.rs line 1168. The new fcu_lock only serializes callers; it does not make cache and DB state atomic. I’d either move the cache update after a successful commit, or restore the previous cached header on error while still holding the lock.
The new regression test can pass spuriously because it ignores both spawned task results at fcu_race_tests.rs line 69, fcu_race_tests.rs line 73, and fcu_race_tests.rs line 77. If either forkchoice_update() returns Err or the task panics, the test keeps going and may report a false negative. This should assert both JoinHandles completed successfully and both inner results are Ok(()).

Notes

The per-key delete in forkchoice_update_inner looks like the right correctness fix for LE-encoded block numbers; replacing delete_range there is justified.

I wasn’t able to run the tests in this environment because cargo failed trying to update the Rust toolchain in a read-only rustup temp dir.

Automated review by OpenAI Codex · gpt-5.4 · custom prompt

ElFantasma

Solid TOCTOU fix. The mutex serializes the cache update + DB write so they can't interleave. The cache rollback on DB-write failure is the right complement — without it, a failed FCU would leave the cache pointing at a non-canonical head.

The regression test is excellent: the BASE=250, EXT=10 choice deliberately spans block 256 to catch any future refactor that tries to replace the per-block delete loop with a delete_range, since LE-encoded keys aren't lexicographically monotone. The threshold calibration (4999/5000 hits pre-fix) gives confidence the test would actually catch a regression.

The inline comment about why delete_range isn't safe is gold — exactly the kind of trap that bites people six months later.

ElFantasma · 2026-04-29T14:05:20Z

            }

+            // Delete canonical entries above the new head by enumerating each key.
+            // `delete_range` is not safe here: keys are `u64::to_le_bytes()`, and


This comment is gold — exactly the kind of trap that bites people six months later.

iovoid requested a review from a team as a code owner April 20, 2026 19:15

github-actions Bot assigned iovoid Apr 20, 2026

github-actions Bot added L1 Ethereum client L2 Rollup client labels Apr 20, 2026

github-project-automation Bot added this to ethrex_l1 and ethrex_l2 Apr 20, 2026

ilitteri approved these changes Apr 20, 2026

View reviewed changes

greptile-apps Bot reviewed Apr 20, 2026

View reviewed changes

iovoid force-pushed the fix-fcu-race branch from 4f30889 to 79e0184 Compare April 20, 2026 19:24

fix(l1,l2): serialize forkchoice_update

9a5d860

iovoid force-pushed the fix-fcu-race branch from 79e0184 to 9a5d860 Compare April 20, 2026 20:36

ElFantasma reviewed Apr 29, 2026

View reviewed changes

ElFantasma approved these changes Apr 29, 2026

View reviewed changes

azteca1998 approved these changes May 6, 2026

View reviewed changes

github-project-automation Bot moved this to In Review in ethrex_l1 May 6, 2026

iovoid added this pull request to the merge queue May 7, 2026

Merged via the queue into main with commit 063b8be May 7, 2026
73 of 75 checks passed

iovoid deleted the fix-fcu-race branch May 7, 2026 21:27

github-project-automation Bot moved this from In Review to Done in ethrex_l1 May 7, 2026

github-project-automation Bot moved this to Done in ethrex_l2 May 7, 2026

Conversation

iovoid commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 20, 2026

🤖 Codex Code Review

Uh oh!

github-actions Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Lines of code report

Uh oh!

greptile-apps Bot commented Apr 20, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 20, 2026

🤖 Claude Code Review

PR Review: fix(l1,l2): serialize forkchoice_update (#6508)

Overview

Critical Bug: delete_range is wrong for little-endian keys

Secondary Issues

Positive Notes

Summary

Uh oh!

github-actions Bot commented Apr 20, 2026

🤖 Kimi Code Review

Uh oh!

github-actions Bot commented Apr 20, 2026

🤖 Claude Code Review

PR #6508 Review: fix(l1,l2): serialize forkchoice_update

Overview

Correctness

Design

LE-encoding note in the diff

Test

Summary

Uh oh!

github-actions Bot commented Apr 20, 2026

🤖 Codex Code Review

Uh oh!

ElFantasma left a comment

Choose a reason for hiding this comment

Uh oh!

ElFantasma Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

iovoid commented Apr 20, 2026 •

edited

Loading

github-actions Bot commented Apr 20, 2026 •

edited

Loading

PR Review: `fix(l1,l2): serialize forkchoice_update` (#6508)

Critical Bug: `delete_range` is wrong for little-endian keys

PR #6508 Review: `fix(l1,l2): serialize forkchoice_update`