Skip to content

[controller] Auto-initialize ZK admin protocol version and enable detection service by default#2769

Open
minhmo1620 wants to merge 3 commits intolinkedin:mainfrom
minhmo1620:mn/controller/auto-detect-protocol-version-undefined-zk
Open

[controller] Auto-initialize ZK admin protocol version and enable detection service by default#2769
minhmo1620 wants to merge 3 commits intolinkedin:mainfrom
minhmo1620:mn/controller/auto-detect-protocol-version-undefined-zk

Conversation

@minhmo1620
Copy link
Copy Markdown
Contributor

Problem Statement

ProtocolVersionAutoDetectionService is a parent-controller background task that reconciles the admin operation protocol version stored in ZK with the smallest version reported by all controllers (parent + child) in a cluster.

Two issues today:

  1. The service is off by default. controller.protocol.version.auto.detection.service.enabled defaulted to false, so parent controllers needed an explicit opt-in to run the service at all. The service has been running for 6 months+ without issue, it's safe to enable it by default.

  2. Remove the required step to init version for cluster The previous predicate skipped the update whenever the upstream value in ZK was -1 (undefined).

if (upstreamVersion != -1 && currentGoodVersion != Long.MAX_VALUE
    && !Objects.equals(currentGoodVersion, upstreamVersion)) {
  admin.updateAdminOperationProtocolVersion(clusterName, currentGoodVersion);
}

We will need to manually init the version by using admin-tool. With this new change, the service will auto register the version.

Solution

  1. [controller] Enable protocol version auto-detection service by default

    • Flip CONTROLLER_PROTOCOL_VERSION_AUTO_DETECTION_SERVICE_ENABLED default from false to true in VeniceControllerClusterConfig.
    • Updated the config-key Javadoc in ConfigKeys.java to match the new default.
  2. [controller] Initialize ZK protocol version when undefined (-1)

    • Drop the upstreamVersion != -1 short-circuit in ProtocolVersionAutoDetectionService. The remaining currentGoodVersion != Long.MAX_VALUE guard still handles the empty-controllers edge case, and the !Objects.equals(currentGoodVersion, upstreamVersion) check now distinguishes "needs update" from "already in sync". When ZK is -1, the inequality is true and the service initializes ZK to the current good version on the next tick.
    • Updated the class-level Javadoc to remove the obsolete "set ZK to -1 to disable step 3" note.
    • Fixed a stale Javadoc reference in VeniceParentHelixAdmin#getWriterSchemaIdFromZK (/adminTopicMetadata -> /adminTopicMetadataV2) noticed while tracing this code path.

Code changes

  • Added new code behind a config. — Default of controller.protocol.version.auto.detection.service.enabled flipped from false -> true. Listed in the description above.
  • Introduced new log lines. — N/A; existing log lines retained.

Concurrency-Specific Checks

  • Code has no race conditions or thread safety issues. — Single-threaded scheduled executor; only the predicate and a default config value change, no new shared state.
  • Proper synchronization mechanisms are used where needed. — N/A.
  • No blocking calls inside critical sections.
  • Verified thread-safe collections are used.
  • Validated proper exception handling in multi-threaded code.

How was this PR tested?

  • Modified or extended existing tests.
    • Renamed testProtocolVersionDetectionWithNoUpdate -> testProtocolVersionDetectionWithMatchingUpstreamVersion and adjusted the setup to use a matching upstream (1L) so the no-update assertion remains meaningful under the new behavior.
    • Added testProtocolVersionDetectionUpdatesWhenUpstreamIsUndefined to cover the new behavior: when ZK has no recorded version, the service must write currentGoodVersion to ZK.
  • Verified with: ./gradlew :services:venice-controller:test --tests "com.linkedin.venice.controller.TestProtocolVersionAutoDetectionService" — 5 tests pass.

Does this PR introduce any user-facing or breaking changes?

  • No. The usage has been running for a while. This is a clean up PR.

Minh Nguyen added 3 commits April 30, 2026 15:35
ProtocolVersionAutoDetectionService previously skipped updating the
admin operation protocol version in ZK when the upstream value was the
sentinel -1 (undefined). That left fresh clusters stuck without a
recorded version, since ZK is only ever initialized when this service
writes the smallest good version observed across controllers. Drop the
-1 short-circuit so the service initializes ZK on first detection and
keeps it in sync thereafter.

Also fix a stale Javadoc reference (/adminTopicMetadata ->
/adminTopicMetadataV2) in VeniceParentHelixAdmin#getWriterSchemaIdFromZK.
Now that the auto-detection service correctly initializes ZK from the
undefined sentinel (-1), flip the default of
controller.protocol.version.auto.detection.service.enabled from false
to true so parent controllers run it without requiring an explicit
opt-in. Updated the Javadoc on the config key accordingly.
… test

CONTROLLER_PROTOCOL_VERSION_AUTO_DETECTION_SERVICE_ENABLED now defaults
to true, so TestProtocolVersionAutoDetection no longer needs to set it
explicitly. Keep the SLEEP_MS override so the test still observes
detection within its 30s timeout.
@minhmo1620 minhmo1620 changed the title [controller] Initialize ZK protocol version when undefined (-1) [controller] Auto-initialize ZK admin protocol version and enable detection service by default Apr 30, 2026
@minhmo1620 minhmo1620 marked this pull request as ready for review May 1, 2026 19:41
Copilot AI review requested due to automatic review settings May 1, 2026 19:41
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Venice parent controller’s protocol version auto-detection behavior to be enabled by default and to auto-initialize the admin operation protocol version in ZK when it is currently undefined.

Changes:

  • Enable controller.protocol.version.auto.detection.service.enabled by default in controller cluster config (and update the config key Javadoc accordingly).
  • Update ProtocolVersionAutoDetectionService to write the detected “current good” protocol version to ZK even when the upstream ZK value is the undefined sentinel -1.
  • Update/extend unit + integration tests and fix a stale ZK path reference in VeniceParentHelixAdmin Javadoc.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceControllerClusterConfig.java Flips the default to enable the auto-detection service.
internal/venice-common/src/main/java/com/linkedin/venice/ConfigKeys.java Updates config-key Javadoc to reflect the new default behavior.
services/venice-controller/src/main/java/com/linkedin/venice/controller/ProtocolVersionAutoDetectionService.java Removes the upstreamVersion != -1 short-circuit so ZK can be initialized from undefined.
services/venice-controller/src/test/java/com/linkedin/venice/controller/TestProtocolVersionAutoDetectionService.java Renames/adjusts the “no update” test and adds coverage for initializing from upstream -1.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/TestProtocolVersionAutoDetection.java Removes explicit enablement since the service is now enabled by default.
services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceParentHelixAdmin.java Fixes a stale Javadoc reference to the admin topic metadata ZK path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants