[da-vinci] Add OTel metrics to NativeMetadataRepositoryStats#2729
Conversation
There was a problem hiding this comment.
Pull request overview
Adds an OpenTelemetry (OTel) counterpart to NativeMetadataRepositoryStats so metadata staleness is available in OTel-based dashboards, aligning with existing Tehuti and fast-client metric naming patterns.
Changes:
- Introduced a new per-store OTel async double gauge (
metadata.staleness_duration) keyed bySTORE_NAME, emitted fromNativeMetadataRepositoryStats. - Registered the new metric entity in the server metric entity aggregation and updated the entity count test.
- Added unit tests validating per-store staleness behavior, removal behavior (NaN → absent point), and safety when OTel is disabled / using a plain
MetricsRepository.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/NativeMetadataRepositoryStats.java | Adds lazy per-store OTel gauge registration and per-store staleness computation. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/NativeMetadataRepositoryOtelMetricEntity.java | Defines the new OTel metric entity (metadata.staleness_duration). |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java | Registers the new metric entity enum for server metrics. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/NativeMetadataRepositoryStatsOtelTest.java | New tests for OTel emission semantics and edge cases. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/NativeMetadataRepositoryOtelMetricEntityTest.java | Validates metric entity definition (name/type/unit/dimensions/description). |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java | Updates expected metric entity count after adding the new entity. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
502671f to
7aaecd0
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
368e916 to
d7db065
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Add per-store ASYNC_DOUBLE_GAUGE metric: metadata.staleness_duration with STORE_NAME dimension. Returns time in ms since the store metadata was last fetched from the meta system store. Returns NaN for removed stores (OTel SDK drops the data point — store disappears from dashboards). Tehuti unchanged: single high-watermark gauge across all stores. OTel: per-store staleness — backends compute max at query time.
- Fix inline comment: ASYNC_GAUGE -> ASYNC_DOUBLE_GAUGE - Add Javadoc to updateCacheTimestamp documenting clusterName is only used on first call per store
dd817ef to
c256e6f
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Problem Statement
NativeMetadataRepositoryStatshas a single Tehuti AsyncGauge for metadata staleness (high watermark across all stores) but no OTel counterpart, making this metric unavailable in OTel-based monitoring dashboards.Solution
Add a per-store
ASYNC_DOUBLE_GAUGEOTel metricmetadata.staleness_durationwithSTORE_NAMEdimension. Returns time in milliseconds since the store metadata was last fetched from the meta system store. UsesDoubleSuppliercallback so removed stores returnDouble.NaN— the OTel SDK drops NaN data points entirely, so removed stores disappear from dashboards (no stale 0 values).Design:
NaNwhen emptyASYNC_DOUBLE_GAUGE— backends can compute max at query time via aggregationmetadata.staleness_durationmatches the fast client's existing metric (different prefix:venice.server.*vsvenice.fast_client.*)computeIfAbsenton firstupdateCacheTimestampcall, bounded by number of subscribed storesCode changes
Concurrency-Specific Checks
Both reviewer and PR author to verify
synchronized,RWLock) are used where needed.ConcurrentHashMap,CopyOnWriteArrayList).How was this PR tested?
New tests:
NativeMetadataRepositoryStatsOtelTest(7 tests): per-store staleness, clock advance, cache refresh, store removal (NaN → absent data point), multi-store isolation, NPE prevention with OTel disabled and plain MetricsRepository.NativeMetadataRepositoryOtelMetricEntityTest: metric entity validation.Does this PR introduce any user-facing or breaking changes?