Skip to content

[da-vinci] Add OTel metrics to NativeMetadataRepositoryStats#2729

Merged
m-nagarajan merged 4 commits intolinkedin:mainfrom
m-nagarajan:mnagaraj/addOtelForNativeMetadataRepositoryStats
May 4, 2026
Merged

[da-vinci] Add OTel metrics to NativeMetadataRepositoryStats#2729
m-nagarajan merged 4 commits intolinkedin:mainfrom
m-nagarajan:mnagaraj/addOtelForNativeMetadataRepositoryStats

Conversation

@m-nagarajan
Copy link
Copy Markdown
Contributor

Problem Statement

NativeMetadataRepositoryStats has a single Tehuti AsyncGauge for metadata staleness (high watermark across all stores) but no OTel counterpart, making this metric unavailable in OTel-based monitoring dashboards.

Solution

Add a per-store ASYNC_DOUBLE_GAUGE OTel metric metadata.staleness_duration with STORE_NAME dimension. Returns time in milliseconds since the store metadata was last fetched from the meta system store. Uses DoubleSupplier callback so removed stores return Double.NaN — the OTel SDK drops NaN data points entirely, so removed stores disappear from dashboards (no stale 0 values).

Design:

  • Tehuti (unchanged): single high-watermark gauge — oldest staleness across all stores, returns NaN when empty
  • OTel (new): per-store ASYNC_DOUBLE_GAUGE — backends can compute max at query time via aggregation
  • Metric name metadata.staleness_duration matches the fast client's existing metric (different prefix: venice.server.* vs venice.fast_client.*)
  • Per-store callbacks registered lazily in computeIfAbsent on first updateCacheTimestamp call, bounded by number of subscribed stores

Code changes

  • Added new code behind a config.
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

  • New unit tests added.
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility (if applicable).

New tests:

  • NativeMetadataRepositoryStatsOtelTest (7 tests): per-store staleness, clock advance, cache refresh, store removal (NaN → absent data point), multi-store isolation, NPE prevention with OTel disabled and plain MetricsRepository.
  • NativeMetadataRepositoryOtelMetricEntityTest: metric entity validation.

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.
  • Yes. Clearly explain the behavior change and its impact.

Copilot AI review requested due to automatic review settings April 13, 2026 01:45
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an OpenTelemetry (OTel) counterpart to NativeMetadataRepositoryStats so metadata staleness is available in OTel-based dashboards, aligning with existing Tehuti and fast-client metric naming patterns.

Changes:

  • Introduced a new per-store OTel async double gauge (metadata.staleness_duration) keyed by STORE_NAME, emitted from NativeMetadataRepositoryStats.
  • Registered the new metric entity in the server metric entity aggregation and updated the entity count test.
  • Added unit tests validating per-store staleness behavior, removal behavior (NaN → absent point), and safety when OTel is disabled / using a plain MetricsRepository.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/NativeMetadataRepositoryStats.java Adds lazy per-store OTel gauge registration and per-store staleness computation.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/NativeMetadataRepositoryOtelMetricEntity.java Defines the new OTel metric entity (metadata.staleness_duration).
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java Registers the new metric entity enum for server metrics.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/NativeMetadataRepositoryStatsOtelTest.java New tests for OTel emission semantics and edge cases.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/NativeMetadataRepositoryOtelMetricEntityTest.java Validates metric entity definition (name/type/unit/dimensions/description).
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java Updates expected metric entity count after adding the new entity.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@m-nagarajan m-nagarajan force-pushed the mnagaraj/addOtelForNativeMetadataRepositoryStats branch from 502671f to 7aaecd0 Compare April 13, 2026 01:50
Copilot AI review requested due to automatic review settings April 16, 2026 20:27
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@m-nagarajan m-nagarajan added the needs-reviewer Looking for a reviewer to pick this up label Apr 20, 2026
@m-nagarajan m-nagarajan force-pushed the mnagaraj/addOtelForNativeMetadataRepositoryStats branch from 368e916 to d7db065 Compare April 30, 2026 21:22
Copilot AI review requested due to automatic review settings April 30, 2026 21:34
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Add per-store ASYNC_DOUBLE_GAUGE metric: metadata.staleness_duration
with STORE_NAME dimension. Returns time in ms since the store metadata
was last fetched from the meta system store. Returns NaN for removed
stores (OTel SDK drops the data point — store disappears from dashboards).

Tehuti unchanged: single high-watermark gauge across all stores.
OTel: per-store staleness — backends compute max at query time.
- Fix inline comment: ASYNC_GAUGE -> ASYNC_DOUBLE_GAUGE
- Add Javadoc to updateCacheTimestamp documenting clusterName is only used on
first call per store
@m-nagarajan m-nagarajan force-pushed the mnagaraj/addOtelForNativeMetadataRepositoryStats branch from dd817ef to c256e6f Compare May 1, 2026 00:02
Copilot AI review requested due to automatic review settings May 1, 2026 00:08
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@sushantmane sushantmane added review-in-progress Reviewer is actively reviewing and removed needs-reviewer Looking for a reviewer to pick this up labels May 4, 2026
Copy link
Copy Markdown
Contributor

@sushantmane sushantmane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@sushantmane sushantmane added approved PR has been approved and removed review-in-progress Reviewer is actively reviewing labels May 4, 2026
@m-nagarajan m-nagarajan merged commit c2038ec into linkedin:main May 4, 2026
110 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved PR has been approved

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants