[server][dvc] Fix post-blob RTS regression in reportIfCatchUpVersionTopicOffset (cache eviction + leader-complete gate) by sushantmane · Pull Request #2779 · linkedin/venice

sushantmane · 2026-05-07T19:05:06Z

Summary

Two narrow fixes for a regression observed on a hybrid AA store after Helix LOAD_REBALANCE + blob-transfer bootstrap, where post-blob followers were marked READY_TO_SERVE before any leader-complete heartbeat arrived. The replicas then served reads while the per-record Venice.Server.Ingestion.Replication.Record.Delay OTel metric tagged records ready_to_serve with stale producer timestamps from the donor's pre-bootstrap RT replay window, polluting the SLI dashboard.

Relationship to the companion PR #2773

This PR fixes what actually fired in production — the reportIfCatchUpVersionTopicOffset path inside the running JVM, observable as the Reported CATCH_UP_BASE_TOPIC_OFFSET_LAG log entry on every affected replica.

The companion PR #2773 is a defense-in-depth fix for an adjacent path — the fast-RTS code (checkFastReadyToServeForReplica → checkFastReadyToServeWithPreviousTimeLag). That path was not the trigger in the captured incident (the previouslyReadyToServe clear from #2715 successfully short-circuited it), but the donor's heartbeatTimestamp / lastCheckpointTimestamp are still persisted to disk verbatim during blob transfer. If a JVM crashes between disk-persist and the SIT-side in-memory clear, restart re-reads the donor record and the heartbeat-INVALID early-return is the only remaining barrier. #2773 hardens that crash-restart window by clearing the donor-clock fields at every observable persistence point.

Both PRs are independent — they touch different methods, fix different classes of artifact, and can land in either order. This PR is the load-bearing production fix; #2773 is belt-and-suspenders for the persistence-window variant.

Root cause

Bug location: LeaderFollowerStoreIngestionTask.reportIfCatchUpVersionTopicOffset.

It reads the live VT end-position via TopicManager.getLatestPositionCached, which can return a value up to server.source.topic.offset.check.interval.ms stale (default 60s). On a fresh post-blob replica that has not seen any read traffic, the cache often had no entry or a pre-bootstrap value, so lag <= 0 evaluated as "caught up" while records were actually pending on the wire.
After confirming lag <= 0, the method calls pcs.lagHasCaughtUp() + reportCompleted(force=true) without any leader-complete gating on hybrid followers.

Captured timeline on one partition (sub-second precision, single contiguous JVM lifetime — no crash):

T+0      blob-transfer-bootstrap completes
T+10ms   PCS reinitialized; lagCaughtUp=false, leaderCompleteState=LEADER_NOT_COMPLETED
T+38ms   Reported CATCH_UP_BASE_TOPIC_OFFSET_LAG               <- reportIfCatchUpVersionTopicOffset fires
T+59ms   Reported COMPLETED                                    <- replica marked RTS,
                                                                   leaderCompleteState STILL LEADER_NOT_COMPLETED
T+431ms  LeaderCompleteState changed NOT_COMPLETED -> COMPLETED  (372ms after the replica was marked RTS)

Fixes in this PR

1. Cache eviction before lag measurement (SUBSCRIBE-time path only)

Expose TopicManager.invalidatePartitionPositionCache(PubSubTopicPartition). Previously this was only reachable as the heavy invalidateCache(PubSubTopic) which evicts every partition of the topic.
reportIfCatchUpVersionTopicOffset gains a forceCacheRefresh parameter:
- SUBSCRIBE-time call (validateAndSubscribePartition) passes true → invalidates the partition's cache entry before measureLagWithCallToPubSub so the next read is a fresh broker round-trip. One-time-per-partition; cost is bounded.
- Per-record call (processConsumerRecord) passes false → reuses the cached value. The latch-held catch-up window can hold thousands of records; forcing a broker round-trip per record would be a serious regression.
Closes the stale-cache false-positive case, the primary contributor to the observed spike.

2. Optional leader-complete gate

New config server.require.leader.complete.for.catch.up.vt.rts (default false — opt-in).
When enabled: reportIfCatchUpVersionTopicOffset on a hybrid follower additionally requires the most recent leader-complete heartbeat header to be observed within server.leader.complete.state.check.in.follower.valid.interval.ms (default 5 min) — same window the existing checkAndLogIfLagIsAcceptableForHybridStore uses.
When disabled (default): pre-existing relax-completion behavior is preserved. The original Helix-rebalance edge case the comment in this method describes (no leader exists to send leader-complete heartbeats; cluster could end up with zero online replicas) is unaffected.
Closes the genuinely-quiet-VT case where fresh-fetch confirms lag <= 0 but no leader-complete signal has arrived yet. Operators who hit the post-blob-transfer regression flip this on per cluster.
Why default OFF: many existing tests exercise this RTS path without simulating leader-complete heartbeats; default ON would silently wedge them. Opting in matches how this safeguard should be rolled out — start on cert clusters, validate, expand.

Why this is a different bug than #2715

#2715 (already merged) clears the previouslyReadyToServe flag in-memory in completePostTransferPSCUpdated, which gates the fast-RTS path (checkFastReadyToServeForReplica → checkFastReadyToServeWithPreviousTimeLag). That fix works correctly — verified empirically on the running version (no "will mark the replica ready-to-serve directly" log fires post-blob).

reportIfCatchUpVersionTopicOffset is a separate RTS-promotion path that fires in validateAndSubscribePartition immediately after checkConsumptionStateWhenStart. The Reported CATCH_UP_BASE_TOPIC_OFFSET_LAG log is the unique signature that confirms this method (and not fast-RTS) was the path that fired in the captured incident.

Testing Done

Five new unit tests in LeaderFollowerStoreIngestionTaskTest:

testReportIfCatchUpVersionTopicOffsetEvictsCacheBeforeLagCheckOnSubscribe — uses Mockito InOrder to assert invalidatePartitionPositionCache is called before measureLagWithCallToPubSub on the SUBSCRIBE-time path.
testReportIfCatchUpVersionTopicOffsetSkipsCacheEvictionOnPerRecordPath — asserts invalidatePartitionPositionCache is never called when forceCacheRefresh=false (the per-record path). Locks down the perf protection.
testReportIfCatchUpVersionTopicOffsetGateBlocksWhenLeaderNotCompleted — gate ON, hybrid follower, LEADER_NOT_COMPLETED → lagHasCaughtUp NOT called.
testReportIfCatchUpVersionTopicOffsetGateAllowsWhenLeaderCompleteRecent — gate ON, hybrid follower, recent LEADER_COMPLETED → lagHasCaughtUp called.
testReportIfCatchUpVersionTopicOffsetGateDisabledRestoresOldBehavior — gate OFF (default) → pre-existing behavior, lagHasCaughtUp called even with LEADER_NOT_COMPLETED.

Existing testReportIfCatchUpVersionTopicOffset in StoreIngestionTaskTest continues to pass — it asserts the latch-state preconditions, which this fix does not change.

Verified locally: :clients:da-vinci-client:test --tests testReportIfCatchUpVersionTopicOffset* passes; testRecordLevelMetricForCurrentVersion (the canary that broke when the gate defaulted ON) also passes with the gate default-OFF.

Out of scope

The defense-in-depth crash-restart-window fix (donor-clock fields cleared on disk via clearInheritedDonorState, etc.) — covered separately by [server][dvc] Invalidate inherited fast-RTS inputs after blob transfer and persist before logging #2773.

reportIfCatchUpVersionTopicOffset Two narrow fixes for a regression observed on a hybrid AA store after Helix LOAD_REBALANCE + blob-transfer bootstrap, where post-blob followers were marked READY_TO_SERVE before any leader-complete heartbeat arrived. The replicas then served reads while the per-record Venice.Server.Ingestion.Replication.Record.Delay OTel metric tagged records ready_to_serve with stale producer timestamps from the donor's pre-bootstrap RT replay window, polluting the SLI dashboard. Root cause was reportIfCatchUpVersionTopicOffset (LeaderFollowerStoreIngestionTask): - Reads the live VT end-position via TopicManager.getLatestPositionCached, which can return a value up to server.source.topic.offset.check.interval.ms stale (default 60s). On a fresh post-blob replica that has not seen any read traffic, the cache often had no entry or a pre-bootstrap value, so lag <= 0 evaluated as "caught up" while records were actually pending on the wire. - After confirming lag <= 0, the method calls lagHasCaughtUp + reportCompleted(force=true) WITHOUT any leader-complete gating on hybrid followers. Fixes in this PR: 1. Cache eviction in reportIfCatchUpVersionTopicOffset - Expose TopicManager.invalidatePartitionPositionCache(PubSubTopicPartition); previously this was only reachable as the heavy invalidateCache(PubSubTopic) which evicts every partition of the topic. - reportIfCatchUpVersionTopicOffset invalidates the partition's cache entry before measureLagWithCallToPubSub. The next read does a fresh broker round-trip; cost is bounded because this method only runs while the latch is held (one-time-per-partition transition window). - Closes the stale-cache false-positive case, which was the primary contributor to the observed dashboard spike. 2. Optional leader-complete gate - New config server.require.leader.complete.for.catch.up.vt.rts (default true). - When enabled (default): reportIfCatchUpVersionTopicOffset on a hybrid follower additionally requires the most recent leader-complete heartbeat header to be observed within server.leader.complete.state.check.in.follower.valid.interval.ms (default 5min) — same window the existing checkAndLogIfLagIsAcceptableForHybridStore us es. - When disabled: pre-existing relax-completion behavior is restored. Operators who hit the original Helix-rebalance edge case the comment in this method describes (no leader exists to send leader-complete heartbeats; cluster ends up with zero online replicas) can flip the flag to revert. - Closes the genuinely-quiet-VT case where fresh-fetch confirms lag <= 0 but no leader- complete signal has arrived yet. Tests: four new LeaderFollowerStoreIngestionTaskTest cases: - testReportIfCatchUpVersionTopicOffsetEvictsCacheBeforeLagCheck - testReportIfCatchUpVersionTopicOffsetGateBlocksWhenLeaderNotCompleted - testReportIfCatchUpVersionTopicOffsetGateAllowsWhenLeaderCompleteRecent - testReportIfCatchUpVersionTopicOffsetGateDisabledRestoresOldBehavior Existing testReportIfCatchUpVersionTopicOffset in StoreIngestionTaskTest continues to pass (it asserts the latch-state preconditions that this fix does not change).

Copilot

Pull request overview

Fixes a post-blob-transfer READY_TO_SERVE (RTS) promotion regression in the server ingestion path by (a) avoiding stale version-topic end-position cache reads and (b) optionally requiring a recent leader-complete signal before allowing the “caught-up VT” RTS shortcut on hybrid followers.

Changes:

Added a TopicManager API to invalidate cached latest/earliest positions for a single topic-partition and used it before measuring VT lag.
Introduced server.require.leader.complete.for.catch.up.vt.rts (default true) and gated the RTS shortcut on a “leader-complete observed recently” check for hybrid followers.
Added unit tests covering cache eviction and leader-complete gating behaviors.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
internal/venice-common/src/main/java/com/linkedin/venice/pubsub/manager/TopicManager.java	Adds a per-partition cache invalidation method to force fresh broker position fetch on next read.
internal/venice-common/src/main/java/com/linkedin/venice/ConfigKeys.java	Adds config key + Javadoc for the new leader-complete gate.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/LeaderFollowerStoreIngestionTask.java	Evicts per-partition position cache before lag measurement; adds optional leader-complete recency gate and helper method.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/config/VeniceServerConfig.java	Wires the new boolean config with default `true` and exposes an accessor.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/kafka/consumer/LeaderFollowerStoreIngestionTaskTest.java	Adds unit tests for cache eviction and leader-complete gate enable/disable scenarios.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…der-complete gate default off Three review-driven changes on top of the initial fix: 1. Cache invalidation is now scoped to the SUBSCRIBE-time path only. reportIfCatchUpVersionTopicOffset gains a forceCacheRefresh flag — true at validateAndSubscribePartition (one-time-per-partition, broker round-trip is bounded), false at the per-record call site (the latch- held catch-up window can hold thousands of records and a forced refresh on each would be a serious regression). Closes Copilot's per-record performance comment. 2. Leader-complete gate default flipped to OFF (opt-in). Many existing tests exercise the catch-up VT RTS path without simulating a leader- complete heartbeat; with the gate ON by default they hang at the metric-tier handoff (testRecordLevelMetricForCurrentVersion[0](false) was the canary — same wedge across all four SIT subclasses). The pre-existing un-gated behavior is the long-standing production default, so default-off preserves it; operators who hit the post- blob-transfer regression should turn the gate on per cluster. 3. Tests cover both branches now: a new testReportIfCatchUpVersionTopicOffsetSkipsCacheEvictionOnPerRecordPath asserts no invalidate fires when forceCacheRefresh=false, and the existing eviction test now uses Mockito InOrder to assert invalidate-before-measureLag explicitly (Copilot's ordering comment). Javadoc on SERVER_REQUIRE_LEADER_COMPLETE_FOR_CATCH_UP_VT_RTS reworded to refer to the Replica.State *dimension* on the metric Venice.Server.Ingestion.Replication.Record.Delay rather than the metric itself (Copilot's wording comment).

Copilot AI review requested due to automatic review settings May 7, 2026 19:05

Copilot started reviewing on behalf of sushantmane May 7, 2026 19:06 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

sushantmane mentioned this pull request May 7, 2026

[server][dvc] Invalidate inherited fast-RTS inputs after blob transfer and persist before logging #2773

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[server][dvc] Fix post-blob RTS regression in reportIfCatchUpVersionTopicOffset (cache eviction + leader-complete gate)#2779

[server][dvc] Fix post-blob RTS regression in reportIfCatchUpVersionTopicOffset (cache eviction + leader-complete gate)#2779
sushantmane wants to merge 2 commits intolinkedin:mainfrom
sushantmane:sumane/catch-up-vt-offset-fresh-fetch-and-gate

sushantmane commented May 7, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sushantmane commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Relationship to the companion PR #2773

Root cause

Fixes in this PR

1. Cache eviction before lag measurement (SUBSCRIBE-time path only)

2. Optional leader-complete gate

Why this is a different bug than #2715

Testing Done

Out of scope

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sushantmane commented May 7, 2026 •

edited

Loading