Skip to content

[server][dvc] Fix post-blob RTS regression in reportIfCatchUpVersionTopicOffset (cache eviction + leader-complete gate)#2779

Open
sushantmane wants to merge 2 commits intolinkedin:mainfrom
sushantmane:sumane/catch-up-vt-offset-fresh-fetch-and-gate
Open

[server][dvc] Fix post-blob RTS regression in reportIfCatchUpVersionTopicOffset (cache eviction + leader-complete gate)#2779
sushantmane wants to merge 2 commits intolinkedin:mainfrom
sushantmane:sumane/catch-up-vt-offset-fresh-fetch-and-gate

Conversation

@sushantmane
Copy link
Copy Markdown
Contributor

@sushantmane sushantmane commented May 7, 2026

Summary

Two narrow fixes for a regression observed on a hybrid AA store after Helix LOAD_REBALANCE + blob-transfer bootstrap, where post-blob followers were marked READY_TO_SERVE before any leader-complete heartbeat arrived. The replicas then served reads while the per-record Venice.Server.Ingestion.Replication.Record.Delay OTel metric tagged records ready_to_serve with stale producer timestamps from the donor's pre-bootstrap RT replay window, polluting the SLI dashboard.

Relationship to the companion PR #2773

This PR fixes what actually fired in production — the reportIfCatchUpVersionTopicOffset path inside the running JVM, observable as the Reported CATCH_UP_BASE_TOPIC_OFFSET_LAG log entry on every affected replica.

The companion PR #2773 is a defense-in-depth fix for an adjacent path — the fast-RTS code (checkFastReadyToServeForReplicacheckFastReadyToServeWithPreviousTimeLag). That path was not the trigger in the captured incident (the previouslyReadyToServe clear from #2715 successfully short-circuited it), but the donor's heartbeatTimestamp / lastCheckpointTimestamp are still persisted to disk verbatim during blob transfer. If a JVM crashes between disk-persist and the SIT-side in-memory clear, restart re-reads the donor record and the heartbeat-INVALID early-return is the only remaining barrier. #2773 hardens that crash-restart window by clearing the donor-clock fields at every observable persistence point.

Both PRs are independent — they touch different methods, fix different classes of artifact, and can land in either order. This PR is the load-bearing production fix; #2773 is belt-and-suspenders for the persistence-window variant.

Root cause

Bug location: LeaderFollowerStoreIngestionTask.reportIfCatchUpVersionTopicOffset.

  1. It reads the live VT end-position via TopicManager.getLatestPositionCached, which can return a value up to server.source.topic.offset.check.interval.ms stale (default 60s). On a fresh post-blob replica that has not seen any read traffic, the cache often had no entry or a pre-bootstrap value, so lag <= 0 evaluated as "caught up" while records were actually pending on the wire.

  2. After confirming lag <= 0, the method calls pcs.lagHasCaughtUp() + reportCompleted(force=true) without any leader-complete gating on hybrid followers.

Captured timeline on one partition (sub-second precision, single contiguous JVM lifetime — no crash):

T+0      blob-transfer-bootstrap completes
T+10ms   PCS reinitialized; lagCaughtUp=false, leaderCompleteState=LEADER_NOT_COMPLETED
T+38ms   Reported CATCH_UP_BASE_TOPIC_OFFSET_LAG               <- reportIfCatchUpVersionTopicOffset fires
T+59ms   Reported COMPLETED                                    <- replica marked RTS,
                                                                   leaderCompleteState STILL LEADER_NOT_COMPLETED
T+431ms  LeaderCompleteState changed NOT_COMPLETED -> COMPLETED  (372ms after the replica was marked RTS)

Fixes in this PR

1. Cache eviction before lag measurement (SUBSCRIBE-time path only)

  • Expose TopicManager.invalidatePartitionPositionCache(PubSubTopicPartition). Previously this was only reachable as the heavy invalidateCache(PubSubTopic) which evicts every partition of the topic.
  • reportIfCatchUpVersionTopicOffset gains a forceCacheRefresh parameter:
    • SUBSCRIBE-time call (validateAndSubscribePartition) passes true → invalidates the partition's cache entry before measureLagWithCallToPubSub so the next read is a fresh broker round-trip. One-time-per-partition; cost is bounded.
    • Per-record call (processConsumerRecord) passes false → reuses the cached value. The latch-held catch-up window can hold thousands of records; forcing a broker round-trip per record would be a serious regression.
  • Closes the stale-cache false-positive case, the primary contributor to the observed spike.

2. Optional leader-complete gate

  • New config server.require.leader.complete.for.catch.up.vt.rts (default false — opt-in).
  • When enabled: reportIfCatchUpVersionTopicOffset on a hybrid follower additionally requires the most recent leader-complete heartbeat header to be observed within server.leader.complete.state.check.in.follower.valid.interval.ms (default 5 min) — same window the existing checkAndLogIfLagIsAcceptableForHybridStore uses.
  • When disabled (default): pre-existing relax-completion behavior is preserved. The original Helix-rebalance edge case the comment in this method describes (no leader exists to send leader-complete heartbeats; cluster could end up with zero online replicas) is unaffected.
  • Closes the genuinely-quiet-VT case where fresh-fetch confirms lag <= 0 but no leader-complete signal has arrived yet. Operators who hit the post-blob-transfer regression flip this on per cluster.
  • Why default OFF: many existing tests exercise this RTS path without simulating leader-complete heartbeats; default ON would silently wedge them. Opting in matches how this safeguard should be rolled out — start on cert clusters, validate, expand.

Why this is a different bug than #2715

#2715 (already merged) clears the previouslyReadyToServe flag in-memory in completePostTransferPSCUpdated, which gates the fast-RTS path (checkFastReadyToServeForReplicacheckFastReadyToServeWithPreviousTimeLag). That fix works correctly — verified empirically on the running version (no "will mark the replica ready-to-serve directly" log fires post-blob).

reportIfCatchUpVersionTopicOffset is a separate RTS-promotion path that fires in validateAndSubscribePartition immediately after checkConsumptionStateWhenStart. The Reported CATCH_UP_BASE_TOPIC_OFFSET_LAG log is the unique signature that confirms this method (and not fast-RTS) was the path that fired in the captured incident.

Testing Done

Five new unit tests in LeaderFollowerStoreIngestionTaskTest:

  • testReportIfCatchUpVersionTopicOffsetEvictsCacheBeforeLagCheckOnSubscribe — uses Mockito InOrder to assert invalidatePartitionPositionCache is called before measureLagWithCallToPubSub on the SUBSCRIBE-time path.
  • testReportIfCatchUpVersionTopicOffsetSkipsCacheEvictionOnPerRecordPath — asserts invalidatePartitionPositionCache is never called when forceCacheRefresh=false (the per-record path). Locks down the perf protection.
  • testReportIfCatchUpVersionTopicOffsetGateBlocksWhenLeaderNotCompleted — gate ON, hybrid follower, LEADER_NOT_COMPLETEDlagHasCaughtUp NOT called.
  • testReportIfCatchUpVersionTopicOffsetGateAllowsWhenLeaderCompleteRecent — gate ON, hybrid follower, recent LEADER_COMPLETEDlagHasCaughtUp called.
  • testReportIfCatchUpVersionTopicOffsetGateDisabledRestoresOldBehavior — gate OFF (default) → pre-existing behavior, lagHasCaughtUp called even with LEADER_NOT_COMPLETED.

Existing testReportIfCatchUpVersionTopicOffset in StoreIngestionTaskTest continues to pass — it asserts the latch-state preconditions, which this fix does not change.

Verified locally: :clients:da-vinci-client:test --tests testReportIfCatchUpVersionTopicOffset* passes; testRecordLevelMetricForCurrentVersion (the canary that broke when the gate defaulted ON) also passes with the gate default-OFF.

Out of scope

reportIfCatchUpVersionTopicOffset

Two narrow fixes for a regression observed on a hybrid AA store after Helix
LOAD_REBALANCE
+ blob-transfer bootstrap, where post-blob followers were marked
READY_TO_SERVE before any
leader-complete heartbeat arrived. The replicas then served reads while the
per-record
Venice.Server.Ingestion.Replication.Record.Delay OTel metric tagged records
ready_to_serve
with stale producer timestamps from the donor's pre-bootstrap RT replay
window, polluting
the SLI dashboard.

Root cause was reportIfCatchUpVersionTopicOffset
(LeaderFollowerStoreIngestionTask):
- Reads the live VT end-position via TopicManager.getLatestPositionCached,
which can return
  a value up to server.source.topic.offset.check.interval.ms stale (default
60s). On a fresh
  post-blob replica that has not seen any read traffic, the cache often had
no entry or a
  pre-bootstrap value, so lag <= 0 evaluated as "caught up" while records
were actually
  pending on the wire.
- After confirming lag <= 0, the method calls lagHasCaughtUp +
reportCompleted(force=true)
  WITHOUT any leader-complete gating on hybrid followers.

Fixes in this PR:

1. Cache eviction in reportIfCatchUpVersionTopicOffset
   - Expose
TopicManager.invalidatePartitionPositionCache(PubSubTopicPartition);
previously
     this was only reachable as the heavy invalidateCache(PubSubTopic) which
evicts every
     partition of the topic.
   - reportIfCatchUpVersionTopicOffset invalidates the partition's cache
entry before
     measureLagWithCallToPubSub. The next read does a fresh broker
round-trip; cost is
     bounded because this method only runs while the latch is held
(one-time-per-partition
     transition window).
   - Closes the stale-cache false-positive case, which was the primary
contributor to the
     observed dashboard spike.

2. Optional leader-complete gate
   - New config server.require.leader.complete.for.catch.up.vt.rts (default
true).
   - When enabled (default): reportIfCatchUpVersionTopicOffset on a hybrid
follower
     additionally requires the most recent leader-complete heartbeat header
to be observed
     within server.leader.complete.state.check.in.follower.valid.interval.ms
(default 5min)
     — same window the existing checkAndLogIfLagIsAcceptableForHybridStore us
es.
   - When disabled: pre-existing relax-completion behavior is restored.
Operators who hit
     the original Helix-rebalance edge case the comment in this method
describes (no leader
     exists to send leader-complete heartbeats; cluster ends up with zero
online replicas)
     can flip the flag to revert.
   - Closes the genuinely-quiet-VT case where fresh-fetch confirms lag <= 0
but no leader-
     complete signal has arrived yet.

Tests: four new LeaderFollowerStoreIngestionTaskTest cases:
  - testReportIfCatchUpVersionTopicOffsetEvictsCacheBeforeLagCheck
  - testReportIfCatchUpVersionTopicOffsetGateBlocksWhenLeaderNotCompleted
  - testReportIfCatchUpVersionTopicOffsetGateAllowsWhenLeaderCompleteRecent
  - testReportIfCatchUpVersionTopicOffsetGateDisabledRestoresOldBehavior

Existing testReportIfCatchUpVersionTopicOffset in StoreIngestionTaskTest
continues to pass
(it asserts the latch-state preconditions that this fix does not change).
Copilot AI review requested due to automatic review settings May 7, 2026 19:05
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a post-blob-transfer READY_TO_SERVE (RTS) promotion regression in the server ingestion path by (a) avoiding stale version-topic end-position cache reads and (b) optionally requiring a recent leader-complete signal before allowing the “caught-up VT” RTS shortcut on hybrid followers.

Changes:

  • Added a TopicManager API to invalidate cached latest/earliest positions for a single topic-partition and used it before measuring VT lag.
  • Introduced server.require.leader.complete.for.catch.up.vt.rts (default true) and gated the RTS shortcut on a “leader-complete observed recently” check for hybrid followers.
  • Added unit tests covering cache eviction and leader-complete gating behaviors.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
internal/venice-common/src/main/java/com/linkedin/venice/pubsub/manager/TopicManager.java Adds a per-partition cache invalidation method to force fresh broker position fetch on next read.
internal/venice-common/src/main/java/com/linkedin/venice/ConfigKeys.java Adds config key + Javadoc for the new leader-complete gate.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/LeaderFollowerStoreIngestionTask.java Evicts per-partition position cache before lag measurement; adds optional leader-complete recency gate and helper method.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/config/VeniceServerConfig.java Wires the new boolean config with default true and exposes an accessor.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/kafka/consumer/LeaderFollowerStoreIngestionTaskTest.java Adds unit tests for cache eviction and leader-complete gate enable/disable scenarios.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/venice-common/src/main/java/com/linkedin/venice/ConfigKeys.java Outdated
…der-complete

gate default off

Three review-driven changes on top of the initial fix:

1. Cache invalidation is now scoped to the SUBSCRIBE-time path only.
   reportIfCatchUpVersionTopicOffset gains a forceCacheRefresh flag —
   true at validateAndSubscribePartition (one-time-per-partition, broker
   round-trip is bounded), false at the per-record call site (the latch-
   held catch-up window can hold thousands of records and a forced
   refresh on each would be a serious regression). Closes Copilot's
   per-record performance comment.

2. Leader-complete gate default flipped to OFF (opt-in). Many existing
   tests exercise the catch-up VT RTS path without simulating a leader-
   complete heartbeat; with the gate ON by default they hang at the
   metric-tier handoff (testRecordLevelMetricForCurrentVersion[0](false)
   was the canary — same wedge across all four SIT subclasses). The
   pre-existing un-gated behavior is the long-standing production
   default, so default-off preserves it; operators who hit the post-
   blob-transfer regression should turn the gate on per cluster.

3. Tests cover both branches now: a new
   testReportIfCatchUpVersionTopicOffsetSkipsCacheEvictionOnPerRecordPath
   asserts no invalidate fires when forceCacheRefresh=false, and the
   existing eviction test now uses Mockito InOrder to assert
   invalidate-before-measureLag explicitly (Copilot's ordering comment).
   Javadoc on SERVER_REQUIRE_LEADER_COMPLETE_FOR_CATCH_UP_VT_RTS
   reworded to refer to the Replica.State *dimension* on the metric
   Venice.Server.Ingestion.Replication.Record.Delay rather than the
   metric itself (Copilot's wording comment).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants