Skip to content

[controller] Recheck topic truncation in skipMigratingVersion to absorb broker-metadata race#2764

Merged
misyel merged 2 commits intolinkedin:mainfrom
misyel:mkwong/fix-skip-migrating-version-truncation-race
May 4, 2026
Merged

[controller] Recheck topic truncation in skipMigratingVersion to absorb broker-metadata race#2764
misyel merged 2 commits intolinkedin:mainfrom
misyel:mkwong/fix-skip-migrating-version-truncation-race

Conversation

@misyel
Copy link
Copy Markdown
Contributor

@misyel misyel commented Apr 29, 2026

Problem Statement

During Venice store migration, the parent admin channel mirrors addVersion messages to both the source and destination clusters so the destination can ingest in lockstep. The destination-side child controller calls VeniceHelixAdmin.skipMigratingVersion to decide whether to honor the addVersion. The current check short-circuits as follows:

boolean topicExists = getTopicManager().containsTopicWithRetries(versionTopic, 5);
if (!topicExists || isTopicTruncated(versionTopic.getName())) {
  // ERROR log + return true (skip the addVersion)
}

isTopicTruncated defensively returns true whenever it catches PubSubTopicDoesNotExistException (TopicManager.java ~ line 527). That exception is thrown transiently right after topic creation: containsTopic already sees the topic via listTopics metadata, while describeConfigs (used by getTopicConfig / getTopicRetention) momentarily returns UNKNOWN_TOPIC_OR_PARTITION. The destination-side mirror is then dropped silently (a single pub_sub_admin_op_failure_count is recorded; the companion log is at DEBUG), and the destination cluster permanently lacks the version even though the source cluster has the topic and the version is healthy.

This was observed in a real prod migration: the destination cluster was stuck at current=4 in one fabric while the source had current=5 and the same fabric's destination cluster in the other two fabrics had successfully mirrored to v5. The skip log was:

ERROR Skip adding version: 5 for store: <store> in cluster: <dest> because the corresponding VT is truncated and version status is STARTED or topic doesn't exist : true

topicExists=true while isTopicTruncated=true — the contradiction is the signature of the race. No Updated topic ... with retention.ms log fired against the version topic in that window, confirming the topic was never actually truncated.

Solution

Split the existing short-circuit so the topic-missing case is handled explicitly with a dedicated log, then re-check isTopicTruncated up to 3 attempts with exponential backoff (500ms / 1000ms / max 2000ms) when it reports truncated. On each attempt:

  • If isTopicTruncated returns false → not truncated, return immediately.
  • If isTopicTruncated returns true and containsTopicWithRetries confirms the topic vanished → accept the truncated result without further retries.
  • If isTopicTruncated returns true but the topic still exists → treat as a transient broker-metadata race, back off and retry.
  • After max attempts with stable truncated == true → treat as genuinely truncated.

This preserves the original protective intent (genuinely truncated topics still skip) while making the destination-side mirror robust against the broker-metadata race. Worst case adds 1.5s on a rare race; typical case is unchanged. The fix is scoped to skipMigratingVersion rather than the shared TopicManager.isTopicTruncated method to keep blast radius small (other callers of isTopicTruncated may rely on the existing fast-fail semantics for already-truncated topics).

Code changes

  • Added new code behind a config. No config; new constants are class-private with sensible defaults (MIGRATION_TRUNCATION_RECHECK_MAX_ATTEMPTS=3, ..._INITIAL_BACKOFF_MS=500, ..._MAX_BACKOFF_MS=2000).
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging. Logs only fire on the skip path (existing rate); replaced one ERROR with two distinct ERROR/WARN, no new per-iteration logging.

Concurrency-Specific Checks

  • Code has no race conditions or thread safety issues. Method-local state only; no shared mutable state introduced.
  • Proper synchronization mechanisms are used where needed. Not applicable — no shared state.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation. Method does not hold locks; the existing caller path already tolerates blocking I/O against the broker.
  • Verified thread-safe collections are used. No new collections.
  • Validated proper exception handling in multi-threaded code. InterruptedException from Thread.sleep is propagated by re-setting the interrupt flag and conservatively returning true.

How was this PR tested?

  • New unit tests added. (4 new tests in TestVeniceHelixAdminWithoutCluster)
    • isVersionTopicTruncatedWithRecheckReturnsFalseWhenNotTruncated
    • isVersionTopicTruncatedWithRecheckReturnsTrueWhenTopicVanished
    • isVersionTopicTruncatedWithRecheckReturnsFalseAfterTransientRace
    • isVersionTopicTruncatedWithRecheckReturnsTrueAfterMaxAttempts
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility. Behavior is unchanged when the topic is genuinely truncated or genuinely missing; only the transient-race path now recovers instead of silently skipping.

Local test run:

```
BUILD SUCCESSFUL
12 tests in TestVeniceHelixAdminWithoutCluster passed (4 new + 8 existing)
```

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.

broker-metadata race

When the parent admin channel mirrors an addVersion message to the
destination cluster
during store migration, the destination-side child controller calls
skipMigratingVersion to decide whether to honor the addVersion. The prior
check

  if (!topicExists || isTopicTruncated(versionTopic.getName())) { ... }

silently skips when isTopicTruncated returns true. isTopicTruncated catches
PubSubTopicDoesNotExistException and defensively returns true, but this
exception can be thrown transiently right after topic creation: containsTopic
already sees the topic (via listTopics metadata), while describeConfigs (used
by getTopicConfig) momentarily returns UNKNOWN_TOPIC_OR_PARTITION. The
destination-side mirror is then dropped and the destination cluster
permanently
lacks the version, even though the source cluster has the topic and the
version
is healthy.

This was observed in a real migration where a single
pub_sub_admin_op_failure_count was recorded on the destination child
controller
in the same minute as the addVersion processing, with no actual truncation log
firing for the version topic.

Fix: split the existing short-circuit so the topic-missing case is handled
explicitly, and recheck isTopicTruncated up to 3 attempts with exponential
backoff (500ms / 1000ms / max 2000ms) when it reports truncated. If
containsTopicWithRetries confirms the topic still exists across attempts, the
positive truncation result is treated as a transient race and we retry; if the
topic genuinely vanished we accept the result without further retries.

Add 4 unit tests covering: not-truncated short-circuit, topic-vanished early
acceptance, transient race recovery, and exhaustion of retries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 29, 2026 00:01
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves destination child controller behavior during store migration by making the “version topic truncated” check resilient to transient broker-metadata races that can occur immediately after topic creation, preventing erroneous skipping of mirrored addVersion events.

Changes:

  • Split the “topic missing vs truncated” skip logic in skipMigratingVersion and add a truncation re-check with bounded exponential backoff.
  • Add isVersionTopicTruncatedWithRecheck helper and test-visible constants to control attempts/backoff.
  • Add 4 unit tests covering non-truncated, vanished-topic, transient-race recovery, and max-attempts behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceHelixAdmin.java Adds truncation re-check logic (with backoff) and separates missing-topic logging to avoid false skips during migration mirroring.
services/venice-controller/src/test/java/com/linkedin/venice/controller/TestVeniceHelixAdminWithoutCluster.java Adds unit tests validating the new truncation re-check behavior across key scenarios.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ranwang2024
ranwang2024 previously approved these changes Apr 29, 2026
Copy link
Copy Markdown
Contributor

@ranwang2024 ranwang2024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM to ship. Reviewed the lock context flagged by Copilot — skipMigratingVersion does run under the cluster read lock + store write lock at VeniceHelixAdmin.java:3193-3194, so the recheck loop can hold those for ~1.5–3s in the worst case. Two reasons this is acceptable:

  1. The path is double-gated by store.isMigrating() and isTopicTruncated == true on the first attempt — rare in absolute terms.
  2. The alternative is the silent permanent data divergence on the destination cluster that this PR is fixing, which is strictly worse than a few seconds of admin-channel back-pressure during a migration race.

One follow-up ask, non-blocking: please add a counter or WARN log when the recheck loop exhausts MIGRATION_TRUNCATION_RECHECK_MAX_ATTEMPTS with truncated == true && topic still exists.
That's the only way to detect in production if the broker-metadata race ever outlives the 3.5s budget — without it the original bug recurs silently and is just as hard to find as before. Happy for this to be a follow-up PR.

- Add WARN log when migration truncation recheck exhausts max attempts
  with topic still existing, to detect if the broker-metadata race ever
  outlives the recheck budget in production.
- Replace `// Visible for testing.` comments with @VisibleForTesting
  annotations for consistency with the rest of VeniceHelixAdmin.
- Fix typo: "slown" -> "slow".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@misyel misyel merged commit bf13c9a into linkedin:main May 4, 2026
106 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants