[server][dvc] Fix post-blob RTS regression in reportIfCatchUpVersionTopicOffset (cache eviction + leader-complete gate)#2779
Open
sushantmane wants to merge 2 commits intolinkedin:mainfrom
Conversation
reportIfCatchUpVersionTopicOffset
Two narrow fixes for a regression observed on a hybrid AA store after Helix
LOAD_REBALANCE
+ blob-transfer bootstrap, where post-blob followers were marked
READY_TO_SERVE before any
leader-complete heartbeat arrived. The replicas then served reads while the
per-record
Venice.Server.Ingestion.Replication.Record.Delay OTel metric tagged records
ready_to_serve
with stale producer timestamps from the donor's pre-bootstrap RT replay
window, polluting
the SLI dashboard.
Root cause was reportIfCatchUpVersionTopicOffset
(LeaderFollowerStoreIngestionTask):
- Reads the live VT end-position via TopicManager.getLatestPositionCached,
which can return
a value up to server.source.topic.offset.check.interval.ms stale (default
60s). On a fresh
post-blob replica that has not seen any read traffic, the cache often had
no entry or a
pre-bootstrap value, so lag <= 0 evaluated as "caught up" while records
were actually
pending on the wire.
- After confirming lag <= 0, the method calls lagHasCaughtUp +
reportCompleted(force=true)
WITHOUT any leader-complete gating on hybrid followers.
Fixes in this PR:
1. Cache eviction in reportIfCatchUpVersionTopicOffset
- Expose
TopicManager.invalidatePartitionPositionCache(PubSubTopicPartition);
previously
this was only reachable as the heavy invalidateCache(PubSubTopic) which
evicts every
partition of the topic.
- reportIfCatchUpVersionTopicOffset invalidates the partition's cache
entry before
measureLagWithCallToPubSub. The next read does a fresh broker
round-trip; cost is
bounded because this method only runs while the latch is held
(one-time-per-partition
transition window).
- Closes the stale-cache false-positive case, which was the primary
contributor to the
observed dashboard spike.
2. Optional leader-complete gate
- New config server.require.leader.complete.for.catch.up.vt.rts (default
true).
- When enabled (default): reportIfCatchUpVersionTopicOffset on a hybrid
follower
additionally requires the most recent leader-complete heartbeat header
to be observed
within server.leader.complete.state.check.in.follower.valid.interval.ms
(default 5min)
— same window the existing checkAndLogIfLagIsAcceptableForHybridStore us
es.
- When disabled: pre-existing relax-completion behavior is restored.
Operators who hit
the original Helix-rebalance edge case the comment in this method
describes (no leader
exists to send leader-complete heartbeats; cluster ends up with zero
online replicas)
can flip the flag to revert.
- Closes the genuinely-quiet-VT case where fresh-fetch confirms lag <= 0
but no leader-
complete signal has arrived yet.
Tests: four new LeaderFollowerStoreIngestionTaskTest cases:
- testReportIfCatchUpVersionTopicOffsetEvictsCacheBeforeLagCheck
- testReportIfCatchUpVersionTopicOffsetGateBlocksWhenLeaderNotCompleted
- testReportIfCatchUpVersionTopicOffsetGateAllowsWhenLeaderCompleteRecent
- testReportIfCatchUpVersionTopicOffsetGateDisabledRestoresOldBehavior
Existing testReportIfCatchUpVersionTopicOffset in StoreIngestionTaskTest
continues to pass
(it asserts the latch-state preconditions that this fix does not change).
There was a problem hiding this comment.
Pull request overview
Fixes a post-blob-transfer READY_TO_SERVE (RTS) promotion regression in the server ingestion path by (a) avoiding stale version-topic end-position cache reads and (b) optionally requiring a recent leader-complete signal before allowing the “caught-up VT” RTS shortcut on hybrid followers.
Changes:
- Added a TopicManager API to invalidate cached latest/earliest positions for a single topic-partition and used it before measuring VT lag.
- Introduced
server.require.leader.complete.for.catch.up.vt.rts(defaulttrue) and gated the RTS shortcut on a “leader-complete observed recently” check for hybrid followers. - Added unit tests covering cache eviction and leader-complete gating behaviors.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/venice-common/src/main/java/com/linkedin/venice/pubsub/manager/TopicManager.java | Adds a per-partition cache invalidation method to force fresh broker position fetch on next read. |
| internal/venice-common/src/main/java/com/linkedin/venice/ConfigKeys.java | Adds config key + Javadoc for the new leader-complete gate. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/LeaderFollowerStoreIngestionTask.java | Evicts per-partition position cache before lag measurement; adds optional leader-complete recency gate and helper method. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/config/VeniceServerConfig.java | Wires the new boolean config with default true and exposes an accessor. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/kafka/consumer/LeaderFollowerStoreIngestionTaskTest.java | Adds unit tests for cache eviction and leader-complete gate enable/disable scenarios. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…der-complete gate default off Three review-driven changes on top of the initial fix: 1. Cache invalidation is now scoped to the SUBSCRIBE-time path only. reportIfCatchUpVersionTopicOffset gains a forceCacheRefresh flag — true at validateAndSubscribePartition (one-time-per-partition, broker round-trip is bounded), false at the per-record call site (the latch- held catch-up window can hold thousands of records and a forced refresh on each would be a serious regression). Closes Copilot's per-record performance comment. 2. Leader-complete gate default flipped to OFF (opt-in). Many existing tests exercise the catch-up VT RTS path without simulating a leader- complete heartbeat; with the gate ON by default they hang at the metric-tier handoff (testRecordLevelMetricForCurrentVersion[0](false) was the canary — same wedge across all four SIT subclasses). The pre-existing un-gated behavior is the long-standing production default, so default-off preserves it; operators who hit the post- blob-transfer regression should turn the gate on per cluster. 3. Tests cover both branches now: a new testReportIfCatchUpVersionTopicOffsetSkipsCacheEvictionOnPerRecordPath asserts no invalidate fires when forceCacheRefresh=false, and the existing eviction test now uses Mockito InOrder to assert invalidate-before-measureLag explicitly (Copilot's ordering comment). Javadoc on SERVER_REQUIRE_LEADER_COMPLETE_FOR_CATCH_UP_VT_RTS reworded to refer to the Replica.State *dimension* on the metric Venice.Server.Ingestion.Replication.Record.Delay rather than the metric itself (Copilot's wording comment).
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two narrow fixes for a regression observed on a hybrid AA store after Helix
LOAD_REBALANCE+ blob-transfer bootstrap, where post-blob followers were marked READY_TO_SERVE before any leader-complete heartbeat arrived. The replicas then served reads while the per-recordVenice.Server.Ingestion.Replication.Record.DelayOTel metric tagged recordsready_to_servewith stale producer timestamps from the donor's pre-bootstrap RT replay window, polluting the SLI dashboard.Root cause
Bug location:
LeaderFollowerStoreIngestionTask.reportIfCatchUpVersionTopicOffset.It reads the live VT end-position via
TopicManager.getLatestPositionCached, which can return a value up toserver.source.topic.offset.check.interval.msstale (default 60s). On a fresh post-blob replica that has not seen any read traffic, the cache often had no entry or a pre-bootstrap value, solag <= 0evaluated as "caught up" while records were actually pending on the wire.After confirming
lag <= 0, the method callspcs.lagHasCaughtUp() + reportCompleted(force=true)without any leader-complete gating on hybrid followers.Captured timeline on one partition (sub-second precision, single contiguous JVM lifetime — no crash):
Fixes in this PR
1. Cache eviction before lag measurement (SUBSCRIBE-time path only)
TopicManager.invalidatePartitionPositionCache(PubSubTopicPartition). Previously this was only reachable as the heavyinvalidateCache(PubSubTopic)which evicts every partition of the topic.reportIfCatchUpVersionTopicOffsetgains aforceCacheRefreshparameter:validateAndSubscribePartition) passestrue→ invalidates the partition's cache entry beforemeasureLagWithCallToPubSubso the next read is a fresh broker round-trip. One-time-per-partition; cost is bounded.processConsumerRecord) passesfalse→ reuses the cached value. The latch-held catch-up window can hold thousands of records; forcing a broker round-trip per record would be a serious regression.2. Optional leader-complete gate
server.require.leader.complete.for.catch.up.vt.rts(defaultfalse— opt-in).reportIfCatchUpVersionTopicOffseton a hybrid follower additionally requires the most recent leader-complete heartbeat header to be observed withinserver.leader.complete.state.check.in.follower.valid.interval.ms(default 5 min) — same window the existingcheckAndLogIfLagIsAcceptableForHybridStoreuses.lag <= 0but no leader-complete signal has arrived yet. Operators who hit the post-blob-transfer regression flip this on per cluster.Why this is a different bug than #2715
#2715(already merged) clears thepreviouslyReadyToServeflag in-memory incompletePostTransferPSCUpdated, which gates the fast-RTS path (checkFastReadyToServeForReplica→checkFastReadyToServeWithPreviousTimeLag). That fix works correctly — verified empirically on the running version (no"will mark the replica ready-to-serve directly"log fires post-blob).reportIfCatchUpVersionTopicOffsetis a separate RTS-promotion path that fires invalidateAndSubscribePartitionimmediately aftercheckConsumptionStateWhenStart. TheReported CATCH_UP_BASE_TOPIC_OFFSET_LAGlog is the unique signature that confirms this method (and not fast-RTS) was the path that fired in the captured incident.Testing Done
Five new unit tests in
LeaderFollowerStoreIngestionTaskTest:testReportIfCatchUpVersionTopicOffsetEvictsCacheBeforeLagCheckOnSubscribe— uses MockitoInOrderto assertinvalidatePartitionPositionCacheis called beforemeasureLagWithCallToPubSubon the SUBSCRIBE-time path.testReportIfCatchUpVersionTopicOffsetSkipsCacheEvictionOnPerRecordPath— assertsinvalidatePartitionPositionCacheis never called whenforceCacheRefresh=false(the per-record path). Locks down the perf protection.testReportIfCatchUpVersionTopicOffsetGateBlocksWhenLeaderNotCompleted— gate ON, hybrid follower,LEADER_NOT_COMPLETED→lagHasCaughtUpNOT called.testReportIfCatchUpVersionTopicOffsetGateAllowsWhenLeaderCompleteRecent— gate ON, hybrid follower, recentLEADER_COMPLETED→lagHasCaughtUpcalled.testReportIfCatchUpVersionTopicOffsetGateDisabledRestoresOldBehavior— gate OFF (default) → pre-existing behavior,lagHasCaughtUpcalled even withLEADER_NOT_COMPLETED.Existing
testReportIfCatchUpVersionTopicOffsetinStoreIngestionTaskTestcontinues to pass — it asserts the latch-state preconditions, which this fix does not change.Verified locally:
:clients:da-vinci-client:test --tests testReportIfCatchUpVersionTopicOffset*passes;testRecordLevelMetricForCurrentVersion(the canary that broke when the gate defaulted ON) also passes with the gate default-OFF.Out of scope
clearInheritedDonorState, etc.) — covered separately by [server][dvc] Invalidate inherited fast-RTS inputs after blob transfer and persist before logging #2773.