[controller] Recheck topic truncation in skipMigratingVersion to absorb broker-metadata race#2764
Conversation
broker-metadata race
When the parent admin channel mirrors an addVersion message to the
destination cluster
during store migration, the destination-side child controller calls
skipMigratingVersion to decide whether to honor the addVersion. The prior
check
if (!topicExists || isTopicTruncated(versionTopic.getName())) { ... }
silently skips when isTopicTruncated returns true. isTopicTruncated catches
PubSubTopicDoesNotExistException and defensively returns true, but this
exception can be thrown transiently right after topic creation: containsTopic
already sees the topic (via listTopics metadata), while describeConfigs (used
by getTopicConfig) momentarily returns UNKNOWN_TOPIC_OR_PARTITION. The
destination-side mirror is then dropped and the destination cluster
permanently
lacks the version, even though the source cluster has the topic and the
version
is healthy.
This was observed in a real migration where a single
pub_sub_admin_op_failure_count was recorded on the destination child
controller
in the same minute as the addVersion processing, with no actual truncation log
firing for the version topic.
Fix: split the existing short-circuit so the topic-missing case is handled
explicitly, and recheck isTopicTruncated up to 3 attempts with exponential
backoff (500ms / 1000ms / max 2000ms) when it reports truncated. If
containsTopicWithRetries confirms the topic still exists across attempts, the
positive truncation result is treated as a transient race and we retry; if the
topic genuinely vanished we accept the result without further retries.
Add 4 unit tests covering: not-truncated short-circuit, topic-vanished early
acceptance, transient race recovery, and exhaustion of retries.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Improves destination child controller behavior during store migration by making the “version topic truncated” check resilient to transient broker-metadata races that can occur immediately after topic creation, preventing erroneous skipping of mirrored addVersion events.
Changes:
- Split the “topic missing vs truncated” skip logic in
skipMigratingVersionand add a truncation re-check with bounded exponential backoff. - Add
isVersionTopicTruncatedWithRecheckhelper and test-visible constants to control attempts/backoff. - Add 4 unit tests covering non-truncated, vanished-topic, transient-race recovery, and max-attempts behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceHelixAdmin.java | Adds truncation re-check logic (with backoff) and separates missing-topic logging to avoid false skips during migration mirroring. |
| services/venice-controller/src/test/java/com/linkedin/venice/controller/TestVeniceHelixAdminWithoutCluster.java | Adds unit tests validating the new truncation re-check behavior across key scenarios. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ranwang2024
left a comment
There was a problem hiding this comment.
LGTM to ship. Reviewed the lock context flagged by Copilot — skipMigratingVersion does run under the cluster read lock + store write lock at VeniceHelixAdmin.java:3193-3194, so the recheck loop can hold those for ~1.5–3s in the worst case. Two reasons this is acceptable:
- The path is double-gated by store.isMigrating() and isTopicTruncated == true on the first attempt — rare in absolute terms.
- The alternative is the silent permanent data divergence on the destination cluster that this PR is fixing, which is strictly worse than a few seconds of admin-channel back-pressure during a migration race.
One follow-up ask, non-blocking: please add a counter or WARN log when the recheck loop exhausts MIGRATION_TRUNCATION_RECHECK_MAX_ATTEMPTS with truncated == true && topic still exists.
That's the only way to detect in production if the broker-metadata race ever outlives the 3.5s budget — without it the original bug recurs silently and is just as hard to find as before. Happy for this to be a follow-up PR.
- Add WARN log when migration truncation recheck exhausts max attempts with topic still existing, to detect if the broker-metadata race ever outlives the recheck budget in production. - Replace `// Visible for testing.` comments with @VisibleForTesting annotations for consistency with the rest of VeniceHelixAdmin. - Fix typo: "slown" -> "slow". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Problem Statement
During Venice store migration, the parent admin channel mirrors
addVersionmessages to both the source and destination clusters so the destination can ingest in lockstep. The destination-side child controller callsVeniceHelixAdmin.skipMigratingVersionto decide whether to honor the addVersion. The current check short-circuits as follows:isTopicTruncateddefensively returnstruewhenever it catchesPubSubTopicDoesNotExistException(TopicManager.java~ line 527). That exception is thrown transiently right after topic creation:containsTopicalready sees the topic vialistTopicsmetadata, whiledescribeConfigs(used bygetTopicConfig/getTopicRetention) momentarily returnsUNKNOWN_TOPIC_OR_PARTITION. The destination-side mirror is then dropped silently (a singlepub_sub_admin_op_failure_countis recorded; the companion log is at DEBUG), and the destination cluster permanently lacks the version even though the source cluster has the topic and the version is healthy.This was observed in a real prod migration: the destination cluster was stuck at
current=4in one fabric while the source hadcurrent=5and the same fabric's destination cluster in the other two fabrics had successfully mirrored tov5. The skip log was:topicExists=truewhileisTopicTruncated=true— the contradiction is the signature of the race. NoUpdated topic ... with retention.mslog fired against the version topic in that window, confirming the topic was never actually truncated.Solution
Split the existing short-circuit so the topic-missing case is handled explicitly with a dedicated log, then re-check
isTopicTruncatedup to 3 attempts with exponential backoff (500ms / 1000ms / max 2000ms) when it reports truncated. On each attempt:isTopicTruncatedreturnsfalse→ not truncated, return immediately.isTopicTruncatedreturnstrueandcontainsTopicWithRetriesconfirms the topic vanished → accept the truncated result without further retries.isTopicTruncatedreturnstruebut the topic still exists → treat as a transient broker-metadata race, back off and retry.This preserves the original protective intent (genuinely truncated topics still skip) while making the destination-side mirror robust against the broker-metadata race. Worst case adds 1.5s on a rare race; typical case is unchanged. The fix is scoped to
skipMigratingVersionrather than the sharedTopicManager.isTopicTruncatedmethod to keep blast radius small (other callers ofisTopicTruncatedmay rely on the existing fast-fail semantics for already-truncated topics).Code changes
MIGRATION_TRUNCATION_RECHECK_MAX_ATTEMPTS=3,..._INITIAL_BACKOFF_MS=500,..._MAX_BACKOFF_MS=2000).Concurrency-Specific Checks
InterruptedExceptionfromThread.sleepis propagated by re-setting the interrupt flag and conservatively returningtrue.How was this PR tested?
TestVeniceHelixAdminWithoutCluster)isVersionTopicTruncatedWithRecheckReturnsFalseWhenNotTruncatedisVersionTopicTruncatedWithRecheckReturnsTrueWhenTopicVanishedisVersionTopicTruncatedWithRecheckReturnsFalseAfterTransientRaceisVersionTopicTruncatedWithRecheckReturnsTrueAfterMaxAttemptsLocal test run:
```
BUILD SUCCESSFUL
12 tests in TestVeniceHelixAdminWithoutCluster passed (4 new + 8 existing)
```
Does this PR introduce any user-facing or breaking changes?