[da-vinci][server] Add OTel store.version ASYNC_GAUGE for current/future version numbers#2722
Conversation
There was a problem hiding this comment.
Pull request overview
Adds an OpenTelemetry counterpart to existing per-store Tehuti current_version / future_version gauges by introducing a dedicated store metadata listener that emits a single store.version ASYNC_GAUGE per store with a VERSION_ROLE dimension.
Changes:
- Register a new
StoreVersionOtelStatsStoreDataChangedListenerin bothVeniceServerandDaVinciBackendto publishstore.version(CURRENT/FUTURE). - Introduce
VeniceVersionedStatsOtelMetricEntityand wire it into the server metric entity set (and associated entity-count test). - Extract shared “future version” computation into
OtelVersionedStatsUtils.computeFutureVersionand add aVersionInfo.NON_EXISTINGsentinel reused across OTel stats classes (with new/updated unit tests).
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| services/venice-server/src/main/java/com/linkedin/venice/server/VeniceServer.java | Registers the new per-store OTel version gauge listener on the server metadata repo. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/DaVinciBackend.java | Registers the same listener for DaVinci backend store repo events. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/StoreVersionOtelStats.java | New listener emitting store.version ASYNC_GAUGE per store with CURRENT/FUTURE role. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/VeniceVersionedStatsOtelMetricEntity.java | Defines the new OTel metric entity (store.version) and its required dimensions. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java | Adds the new metric entity module to the server’s exported metric entity list. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/OtelVersionedStatsUtils.java | Adds VersionInfo.NON_EXISTING and shared computeFutureVersion(...) helper. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/AbstractVeniceAggVersionedStats.java | Switches future-version computation to the extracted helper. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/StorageEngineOtelStats.java | Replaces inline non-existing sentinel with VersionInfo.NON_EXISTING. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ingestion/IngestionOtelStats.java | Replaces inline non-existing sentinel with VersionInfo.NON_EXISTING. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ingestion/heartbeat/HeartbeatOtelStats.java | Replaces inline non-existing sentinel with VersionInfo.NON_EXISTING and removes unused import. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ingestion/heartbeat/RecordLevelDelayOtelStats.java | Replaces inline non-existing sentinel with VersionInfo.NON_EXISTING and removes unused import. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/BlobTransferOtelStats.java | Replaces inline non-existing sentinel with VersionInfo.NON_EXISTING and removes unused import. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/StoreVersionOtelStatsTest.java | New unit tests covering gauge values/updates, lifecycle, and OTel-disabled safety. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/VeniceVersionedStatsOtelMetricEntityTest.java | Validates the new metric entity definition. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/OtelVersionedStatsUtilsTest.java | Adds unit tests for computeFutureVersion(...). |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java | Updates expected server metric entity count to include the new entity. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ef877e9 to
ebe759e
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
version numbers Add StoreVersionOtelStats — a standalone StoreDataChangedListener that emits per-store ASYNC_GAUGE for version numbers with VERSION_ROLE dimension (CURRENT/FUTURE). Registered once per process in both VeniceServer and DaVinciBackend. Also: - Extract computeFutureVersion into OtelVersionedStatsUtils (shared by AbstractVeniceAggVersionedStats and StoreVersionOtelStats) - Add VersionInfo.NON_EXISTING sentinel constant, replace 5 inline duplicates - Clean up unused NON_EXISTING_VERSION imports in 3 OTel stats classes
914cd39 to
ee6e5a6
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
mnagaraj/addOtelMetricsForVeniceVersionedStats
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
# Conflicts: # clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Problem Statement
Venice stores track which version number is current and which is future for each store via Tehuti
current_versionandfuture_versionAsyncGauge sensors inVeniceVersionedStatsReporter. These metrics have no OTel counterpart, making them unavailable in OTel-based monitoring dashboards.Solution
Add
StoreVersionOtelStats— a standaloneStoreDataChangedListenerthat registers OTel ASYNC_GAUGE callbacks per store for current and future version numbers, withVERSION_ROLEdimension (CURRENT/FUTURE). Created viaStoreVersionOtelStats.create()factory method in bothVeniceServerandDaVinciBackend.Design choice: A standalone listener was chosen over adding OTel to
VeniceVersionedStatsReporterbecause each store has 6VeniceVersionedStatsReporterinstances (one perAbstractVeniceAggVersionedStatssubclass). Putting OTel in the reporter would create 6 duplicate ASYNC_GAUGE registrations per store. The standalone listener ensures exactly one registration per store.Key design decisions:
create()factory method combines construction + registration to eliminate temporal coupling (cannot forget to callregister()).register()both adds the listener and initializes gauges for pre-existing stores viagetAllStores()— the metadata repository does not replay existing stores to newly registered listeners.initializeStoreIfAbsent()usescomputeIfAbsentonly (no unconditional.set()) to avoid overwriting concurrent ZK events with stale snapshot data during startup initialization.AtomicReference<VersionInfo>toNON_EXISTINGrather than removing the map entry — OTel async gauge callbacks cannot be deregistered, so this avoids duplicate callback registration on store re-creation.handleStoreChanged/handleStoreDeletedwhenotelRepository == nullskips all computation for non-OTel deployments.CloseablesoDaVinciBackend/VeniceServercan unregister theStoreDataChangedListenerfrom the metadata repository on shutdown. OTel async gauge callbacks are intentionally not deregistered because the SDK does not support it; theversionInfoMapremains as the source of truth and continues returningNON_EXISTINGfor closed stores.Additional cleanup:
computeFutureVersionintoOtelVersionedStatsUtils— shared byAbstractVeniceAggVersionedStats.applyVersionInfo()andStoreVersionOtelStats, eliminating duplicated STARTED/PUSHED version logic. Caller is responsible for passing a non-null list (both current callers —Store.getVersions()and the existing-versions list inapplyVersionInfo— are non-null by contract).VersionInfo.NON_EXISTINGsentinel constant, replacing 5 inlinenew VersionInfo(NON_EXISTING_VERSION, NON_EXISTING_VERSION)duplicates across OTel stats classes.VersionInfoafinalclass to prevent subclassing that could break the shared sentinel pattern.NON_EXISTING_VERSIONimports.Code changes
Concurrency-Specific Checks
Both reviewer and PR author to verify
synchronized,RWLock) are used where needed.ConcurrentHashMap,CopyOnWriteArrayList).How was this PR tested?
New tests:
StoreVersionOtelStatsTest(17 tests): CURRENT/FUTURE gauge values, version change updates, multi-store isolation, store delete/recreate lifecycle,handleStoreCreateddelegation, future version computation (STARTED/PUSHED), no future when all ONLINE, NPE prevention (OTel disabled + plain MetricsRepository),register()initializes pre-existing stores,register()does not overwrite existing entries, latched-thread race with concurrent ZK events, delete for never-seen store is no-op,create()factory wires listener + initializes pre-existing stores,create()does not register the listener when OTel is disabled,close()unregisters the metadata listener,close()is a no-op when OTel is disabled.VeniceVersionedStatsOtelMetricEntityTest: metric entity validation viaModuleMetricEntityTestFixture.OtelVersionedStatsUtilsTest(6 new tests):computeFutureVersion— empty list, all ONLINE, STARTED only, PUSHED only, mixed returns max, ERROR ignored.Does this PR introduce any user-facing or breaking changes?