Skip to content

[da-vinci][server] Add OTel store.version ASYNC_GAUGE for current/future version numbers#2722

Merged
m-nagarajan merged 8 commits intolinkedin:mainfrom
m-nagarajan:mnagaraj/addOtelMetricsForVeniceVersionedStats
May 5, 2026
Merged

[da-vinci][server] Add OTel store.version ASYNC_GAUGE for current/future version numbers#2722
m-nagarajan merged 8 commits intolinkedin:mainfrom
m-nagarajan:mnagaraj/addOtelMetricsForVeniceVersionedStats

Conversation

@m-nagarajan
Copy link
Copy Markdown
Contributor

@m-nagarajan m-nagarajan commented Apr 11, 2026

Problem Statement

Venice stores track which version number is current and which is future for each store via Tehuti current_version and future_version AsyncGauge sensors in VeniceVersionedStatsReporter. These metrics have no OTel counterpart, making them unavailable in OTel-based monitoring dashboards.

Solution

Add StoreVersionOtelStats — a standalone StoreDataChangedListener that registers OTel ASYNC_GAUGE callbacks per store for current and future version numbers, with VERSION_ROLE dimension (CURRENT/FUTURE). Created via StoreVersionOtelStats.create() factory method in both VeniceServer and DaVinciBackend.

Design choice: A standalone listener was chosen over adding OTel to VeniceVersionedStatsReporter because each store has 6 VeniceVersionedStatsReporter instances (one per AbstractVeniceAggVersionedStats subclass). Putting OTel in the reporter would create 6 duplicate ASYNC_GAUGE registrations per store. The standalone listener ensures exactly one registration per store.

Key design decisions:

  • create() factory method combines construction + registration to eliminate temporal coupling (cannot forget to call register()).
  • register() both adds the listener and initializes gauges for pre-existing stores via getAllStores() — the metadata repository does not replay existing stores to newly registered listeners.
  • initializeStoreIfAbsent() uses computeIfAbsent only (no unconditional .set()) to avoid overwriting concurrent ZK events with stale snapshot data during startup initialization.
  • Store deletion resets AtomicReference<VersionInfo> to NON_EXISTING rather than removing the map entry — OTel async gauge callbacks cannot be deregistered, so this avoids duplicate callback registration on store re-creation.
  • Early return in handleStoreChanged/handleStoreDeleted when otelRepository == null skips all computation for non-OTel deployments.
  • Implements Closeable so DaVinciBackend/VeniceServer can unregister the StoreDataChangedListener from the metadata repository on shutdown. OTel async gauge callbacks are intentionally not deregistered because the SDK does not support it; the versionInfoMap remains as the source of truth and continues returning NON_EXISTING for closed stores.

Additional cleanup:

  • Extract computeFutureVersion into OtelVersionedStatsUtils — shared by AbstractVeniceAggVersionedStats.applyVersionInfo() and StoreVersionOtelStats, eliminating duplicated STARTED/PUSHED version logic. Caller is responsible for passing a non-null list (both current callers — Store.getVersions() and the existing-versions list in applyVersionInfo — are non-null by contract).
  • Add VersionInfo.NON_EXISTING sentinel constant, replacing 5 inline new VersionInfo(NON_EXISTING_VERSION, NON_EXISTING_VERSION) duplicates across OTel stats classes.
  • Make VersionInfo a final class to prevent subclassing that could break the shared sentinel pattern.
  • Clean up 3 unused NON_EXISTING_VERSION imports.

Code changes

  • Added new code behind a config. If so list the config names and their default values in the PR description.
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

  • New unit tests added.
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility (if applicable).

New tests:

  • StoreVersionOtelStatsTest (17 tests): CURRENT/FUTURE gauge values, version change updates, multi-store isolation, store delete/recreate lifecycle, handleStoreCreated delegation, future version computation (STARTED/PUSHED), no future when all ONLINE, NPE prevention (OTel disabled + plain MetricsRepository), register() initializes pre-existing stores, register() does not overwrite existing entries, latched-thread race with concurrent ZK events, delete for never-seen store is no-op, create() factory wires listener + initializes pre-existing stores, create() does not register the listener when OTel is disabled, close() unregisters the metadata listener, close() is a no-op when OTel is disabled.
  • VeniceVersionedStatsOtelMetricEntityTest: metric entity validation via ModuleMetricEntityTestFixture.
  • OtelVersionedStatsUtilsTest (6 new tests): computeFutureVersion — empty list, all ONLINE, STARTED only, PUSHED only, mixed returns max, ERROR ignored.

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.
  • Yes. Clearly explain the behavior change and its impact.

Copilot AI review requested due to automatic review settings April 11, 2026 22:06
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an OpenTelemetry counterpart to existing per-store Tehuti current_version / future_version gauges by introducing a dedicated store metadata listener that emits a single store.version ASYNC_GAUGE per store with a VERSION_ROLE dimension.

Changes:

  • Register a new StoreVersionOtelStats StoreDataChangedListener in both VeniceServer and DaVinciBackend to publish store.version (CURRENT/FUTURE).
  • Introduce VeniceVersionedStatsOtelMetricEntity and wire it into the server metric entity set (and associated entity-count test).
  • Extract shared “future version” computation into OtelVersionedStatsUtils.computeFutureVersion and add a VersionInfo.NON_EXISTING sentinel reused across OTel stats classes (with new/updated unit tests).

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
services/venice-server/src/main/java/com/linkedin/venice/server/VeniceServer.java Registers the new per-store OTel version gauge listener on the server metadata repo.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/DaVinciBackend.java Registers the same listener for DaVinci backend store repo events.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/StoreVersionOtelStats.java New listener emitting store.version ASYNC_GAUGE per store with CURRENT/FUTURE role.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/VeniceVersionedStatsOtelMetricEntity.java Defines the new OTel metric entity (store.version) and its required dimensions.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java Adds the new metric entity module to the server’s exported metric entity list.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/OtelVersionedStatsUtils.java Adds VersionInfo.NON_EXISTING and shared computeFutureVersion(...) helper.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/AbstractVeniceAggVersionedStats.java Switches future-version computation to the extracted helper.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/StorageEngineOtelStats.java Replaces inline non-existing sentinel with VersionInfo.NON_EXISTING.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ingestion/IngestionOtelStats.java Replaces inline non-existing sentinel with VersionInfo.NON_EXISTING.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ingestion/heartbeat/HeartbeatOtelStats.java Replaces inline non-existing sentinel with VersionInfo.NON_EXISTING and removes unused import.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ingestion/heartbeat/RecordLevelDelayOtelStats.java Replaces inline non-existing sentinel with VersionInfo.NON_EXISTING and removes unused import.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/BlobTransferOtelStats.java Replaces inline non-existing sentinel with VersionInfo.NON_EXISTING and removes unused import.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/StoreVersionOtelStatsTest.java New unit tests covering gauge values/updates, lifecycle, and OTel-disabled safety.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/VeniceVersionedStatsOtelMetricEntityTest.java Validates the new metric entity definition.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/OtelVersionedStatsUtilsTest.java Adds unit tests for computeFutureVersion(...).
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java Updates expected server metric entity count to include the new entity.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@m-nagarajan m-nagarajan added the needs-reviewer Looking for a reviewer to pick this up label Apr 20, 2026
Copilot AI review requested due to automatic review settings April 30, 2026 20:03
@m-nagarajan m-nagarajan force-pushed the mnagaraj/addOtelMetricsForVeniceVersionedStats branch from ef877e9 to ebe759e Compare April 30, 2026 20:03
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

version numbers

Add StoreVersionOtelStats — a standalone StoreDataChangedListener that emits
per-store ASYNC_GAUGE for version numbers with VERSION_ROLE dimension
(CURRENT/FUTURE).
Registered once per process in both VeniceServer and DaVinciBackend.

Also:
- Extract computeFutureVersion into OtelVersionedStatsUtils (shared by
  AbstractVeniceAggVersionedStats and StoreVersionOtelStats)
- Add VersionInfo.NON_EXISTING sentinel constant, replace 5 inline duplicates
- Clean up unused NON_EXISTING_VERSION imports in 3 OTel stats classes
Copilot AI review requested due to automatic review settings May 2, 2026 03:40
@m-nagarajan m-nagarajan force-pushed the mnagaraj/addOtelMetricsForVeniceVersionedStats branch from 914cd39 to ee6e5a6 Compare May 2, 2026 03:40
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@sushantmane sushantmane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice cleanup — sentinel + factory + extracted util are all good moves, and the test suite covers the lifecycle thoroughly. A few inline notes below; the lifecycle/leak items (#2, #3) are the most material. Nothing blocking.

Comment thread services/venice-server/src/main/java/com/linkedin/venice/server/VeniceServer.java Outdated
@sushantmane sushantmane added requested-changes Reviewer requested changes and removed needs-reviewer Looking for a reviewer to pick this up labels May 4, 2026
Copilot AI review requested due to automatic review settings May 5, 2026 00:06
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# Conflicts:
#	clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java
Copilot AI review requested due to automatic review settings May 5, 2026 19:20
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@m-nagarajan m-nagarajan merged commit 328f1a3 into linkedin:main May 5, 2026
129 of 131 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

requested-changes Reviewer requested changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants