[controller] Recheck topic truncation in skipMigratingVersion to absorb broker-metadata race by misyel · Pull Request #2764 · linkedin/venice

misyel · 2026-04-29T00:01:54Z

Problem Statement

During Venice store migration, the parent admin channel mirrors addVersion messages to both the source and destination clusters so the destination can ingest in lockstep. The destination-side child controller calls VeniceHelixAdmin.skipMigratingVersion to decide whether to honor the addVersion. The current check short-circuits as follows:

boolean topicExists = getTopicManager().containsTopicWithRetries(versionTopic, 5);
if (!topicExists || isTopicTruncated(versionTopic.getName())) {
  // ERROR log + return true (skip the addVersion)
}

isTopicTruncated defensively returns true whenever it catches PubSubTopicDoesNotExistException (TopicManager.java ~ line 527). That exception is thrown transiently right after topic creation: containsTopic already sees the topic via listTopics metadata, while describeConfigs (used by getTopicConfig / getTopicRetention) momentarily returns UNKNOWN_TOPIC_OR_PARTITION. The destination-side mirror is then dropped silently (a single pub_sub_admin_op_failure_count is recorded; the companion log is at DEBUG), and the destination cluster permanently lacks the version even though the source cluster has the topic and the version is healthy.

This was observed in a real prod migration: the destination cluster was stuck at current=4 in one fabric while the source had current=5 and the same fabric's destination cluster in the other two fabrics had successfully mirrored to v5. The skip log was:

ERROR Skip adding version: 5 for store: <store> in cluster: <dest> because the corresponding VT is truncated and version status is STARTED or topic doesn't exist : true

topicExists=true while isTopicTruncated=true — the contradiction is the signature of the race. No Updated topic ... with retention.ms log fired against the version topic in that window, confirming the topic was never actually truncated.

Solution

Split the existing short-circuit so the topic-missing case is handled explicitly with a dedicated log, then re-check isTopicTruncated up to 3 attempts with exponential backoff (500ms / 1000ms / max 2000ms) when it reports truncated. On each attempt:

If isTopicTruncated returns false → not truncated, return immediately.
If isTopicTruncated returns true and containsTopicWithRetries confirms the topic vanished → accept the truncated result without further retries.
If isTopicTruncated returns true but the topic still exists → treat as a transient broker-metadata race, back off and retry.
After max attempts with stable truncated == true → treat as genuinely truncated.

This preserves the original protective intent (genuinely truncated topics still skip) while making the destination-side mirror robust against the broker-metadata race. Worst case adds 1.5s on a rare race; typical case is unchanged. The fix is scoped to skipMigratingVersion rather than the shared TopicManager.isTopicTruncated method to keep blast radius small (other callers of isTopicTruncated may rely on the existing fast-fail semantics for already-truncated topics).

Code changes

Added new code behind a config. No config; new constants are class-private with sensible defaults (MIGRATION_TRUNCATION_RECHECK_MAX_ATTEMPTS=3, ..._INITIAL_BACKOFF_MS=500, ..._MAX_BACKOFF_MS=2000).
Introduced new log lines.
- Confirmed if logs need to be rate limited to avoid excessive logging. Logs only fire on the skip path (existing rate); replaced one ERROR with two distinct ERROR/WARN, no new per-iteration logging.

Concurrency-Specific Checks

Code has no race conditions or thread safety issues. Method-local state only; no shared mutable state introduced.
Proper synchronization mechanisms are used where needed. Not applicable — no shared state.
No blocking calls inside critical sections that could lead to deadlocks or performance degradation. Method does not hold locks; the existing caller path already tolerates blocking I/O against the broker.
Verified thread-safe collections are used. No new collections.
Validated proper exception handling in multi-threaded code. InterruptedException from Thread.sleep is propagated by re-setting the interrupt flag and conservatively returning true.

How was this PR tested?

New unit tests added. (4 new tests in TestVeniceHelixAdminWithoutCluster)
- isVersionTopicTruncatedWithRecheckReturnsFalseWhenNotTruncated
- isVersionTopicTruncatedWithRecheckReturnsTrueWhenTopicVanished
- isVersionTopicTruncatedWithRecheckReturnsFalseAfterTransientRace
- isVersionTopicTruncatedWithRecheckReturnsTrueAfterMaxAttempts
New integration tests added.
Modified or extended existing tests.
Verified backward compatibility. Behavior is unchanged when the topic is genuinely truncated or genuinely missing; only the transient-race path now recovers instead of silently skipping.

Local test run:

```
BUILD SUCCESSFUL
12 tests in TestVeniceHelixAdminWithoutCluster passed (4 new + 8 existing)
```

Does this PR introduce any user-facing or breaking changes?

No. You can skip the rest of this section.

broker-metadata race When the parent admin channel mirrors an addVersion message to the destination cluster during store migration, the destination-side child controller calls skipMigratingVersion to decide whether to honor the addVersion. The prior check if (!topicExists || isTopicTruncated(versionTopic.getName())) { ... } silently skips when isTopicTruncated returns true. isTopicTruncated catches PubSubTopicDoesNotExistException and defensively returns true, but this exception can be thrown transiently right after topic creation: containsTopic already sees the topic (via listTopics metadata), while describeConfigs (used by getTopicConfig) momentarily returns UNKNOWN_TOPIC_OR_PARTITION. The destination-side mirror is then dropped and the destination cluster permanently lacks the version, even though the source cluster has the topic and the version is healthy. This was observed in a real migration where a single pub_sub_admin_op_failure_count was recorded on the destination child controller in the same minute as the addVersion processing, with no actual truncation log firing for the version topic. Fix: split the existing short-circuit so the topic-missing case is handled explicitly, and recheck isTopicTruncated up to 3 attempts with exponential backoff (500ms / 1000ms / max 2000ms) when it reports truncated. If containsTopicWithRetries confirms the topic still exists across attempts, the positive truncation result is treated as a transient race and we retry; if the topic genuinely vanished we accept the result without further retries. Add 4 unit tests covering: not-truncated short-circuit, topic-vanished early acceptance, transient race recovery, and exhaustion of retries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Improves destination child controller behavior during store migration by making the “version topic truncated” check resilient to transient broker-metadata races that can occur immediately after topic creation, preventing erroneous skipping of mirrored addVersion events.

Changes:

Split the “topic missing vs truncated” skip logic in skipMigratingVersion and add a truncation re-check with bounded exponential backoff.
Add isVersionTopicTruncatedWithRecheck helper and test-visible constants to control attempts/backoff.
Add 4 unit tests covering non-truncated, vanished-topic, transient-race recovery, and max-attempts behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceHelixAdmin.java	Adds truncation re-check logic (with backoff) and separates missing-topic logging to avoid false skips during migration mirroring.
services/venice-controller/src/test/java/com/linkedin/venice/controller/TestVeniceHelixAdminWithoutCluster.java	Adds unit tests validating the new truncation re-check behavior across key scenarios.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ranwang2024

LGTM to ship. Reviewed the lock context flagged by Copilot — skipMigratingVersion does run under the cluster read lock + store write lock at VeniceHelixAdmin.java:3193-3194, so the recheck loop can hold those for ~1.5–3s in the worst case. Two reasons this is acceptable:

The path is double-gated by store.isMigrating() and isTopicTruncated == true on the first attempt — rare in absolute terms.
The alternative is the silent permanent data divergence on the destination cluster that this PR is fixing, which is strictly worse than a few seconds of admin-channel back-pressure during a migration race.

One follow-up ask, non-blocking: please add a counter or WARN log when the recheck loop exhausts MIGRATION_TRUNCATION_RECHECK_MAX_ATTEMPTS with truncated == true && topic still exists.
That's the only way to detect in production if the broker-metadata race ever outlives the 3.5s budget — without it the original bug recurs silently and is just as hard to find as before. Happy for this to be a follow-up PR.

@VisibleForTesting

- Add WARN log when migration truncation recheck exhausts max attempts with topic still existing, to detect if the broker-metadata race ever outlives the recheck budget in production. - Replace `// Visible for testing.` comments with @VisibleForTesting annotations for consistency with the rest of VeniceHelixAdmin. - Fix typo: "slown" -> "slow". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings April 29, 2026 00:01

Copilot started reviewing on behalf of misyel April 29, 2026 00:02 View session

Copilot AI reviewed Apr 29, 2026

View reviewed changes

ranwang2024 previously approved these changes Apr 29, 2026

View reviewed changes

misyel dismissed ranwang2024’s stale review via 4301b9c May 4, 2026 17:13

ranwang2024 approved these changes May 4, 2026

View reviewed changes

misyel merged commit bf13c9a into linkedin:main May 4, 2026
106 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[controller] Recheck topic truncation in skipMigratingVersion to absorb broker-metadata race#2764

[controller] Recheck topic truncation in skipMigratingVersion to absorb broker-metadata race#2764
misyel merged 2 commits intolinkedin:mainfrom
misyel:mkwong/fix-skip-migrating-version-truncation-race

misyel commented Apr 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ranwang2024 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

misyel commented Apr 29, 2026

Problem Statement

Solution

Code changes

Concurrency-Specific Checks

How was this PR tested?

Does this PR introduce any user-facing or breaking changes?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ranwang2024 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants