[server] Add OTel metrics to BackupVersionOptimizationServiceStats by m-nagarajan · Pull Request #2726 · linkedin/venice

m-nagarajan · 2026-04-12T03:43:34Z

Problem Statement

BackupVersionOptimizationServiceStats has 2 Tehuti OccurrenceRate sensors for tracking backup version database optimization (success + error) but no OTel counterparts, making these metrics unavailable in OTel-based monitoring dashboards.

Solution

Add 1 OTel COUNTER metric version.backup.optimization.reopen_count with STORE_NAME + OPERATION_OUTCOME (SUCCESS/FAIL) dimensions using the joint Tehuti+OTel API.

Design: Two per-store VeniceConcurrentHashMap maps (successPerStore, errorPerStore) because they bind to different Tehuti sensors (success vs error) while sharing the same OTel instrument differentiated by VeniceOperationOutcome. Single getOrCreateMetric(map, storeName, tehutiName) helper shared by both maps. Tehuti sensor is registered once (first store) and shared by all subsequent stores via registerSensorIfAbsent. When OTel is disabled, otelRepository is null and OTel recording is a no-op — the maps still populate for Tehuti.

Behavioral change 1 (affects both Tehuti and OTel): Error recording moved from per-store (once after the partition loop when errored == true) to per-partition (inside the catch block). This gives finer granularity and matches the success path, which already records per-partition. If 50 of 100 partitions fail, the old code recorded 1 error; the new code records 50. This is intentional — the old per-store granularity made success count (partitions) and error count (stores) an apples-to-oranges comparison.

Behavioral change 2 (Tehuti registration timing): Pre-PR the constructor eagerly registered both Tehuti sensors (backup_version_database_optimization, backup_version_data_optimization_error) at server startup, so metricsRepository.getMetric(...) returned a non-null handle from process start. Post-PR these sensors are registered lazily on the first per-store recording (the standard pattern used by every other joint Tehuti+OTel stats class — keeps the per-store maps as the single source of truth). Impact: between server startup and the first backup-version optimization, getMetric(...) returns null and the sensor is absent from JMX. Per-scrape consumers (Grafana / InGraphs / dashboards that look up by metric name each poll cycle) are unaffected — once the first event fires, the metric appears and reports normally. Consumers that capture the metric handle at startup and reuse it across a wait loop must move the lookup inside the lambda; the in-tree integration test (TestBackupVersionDatabaseOptimization) is updated accordingly.

Code changes

Added new code behind a config.
Introduced new log lines.
- Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

Code has no race conditions or thread safety issues.
Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

New unit tests added.
New integration tests added.
Modified or extended existing tests.
Verified backward compatibility (if applicable).

New tests:

BackupVersionOptimizationServiceStatsTest (9 tests): OTel counter accumulation for success/error, outcome dimension isolation, multi-store isolation, Tehuti lazy sensor registration + recording, two-map split no-cross-contamination invariant, cluster-name argument propagation into OTel attributes (catches dimension hardcoding regressions), NPE prevention with OTel disabled and plain MetricsRepository.
BackupVersionOptimizationServiceTest.testPerPartitionErrorRecordingScalesWithFailingPartitions: pins per-partition error scaling — mocks 4 partitions with all reopenStoragePartition calls throwing, asserts recordBackupVersionDatabaseOptimizationError was called 4 times and recordBackupVersionDatabaseOptimization was never called. Catches a regression that hoists the error recording back outside the partition loop.
BackupVersionOptimizationOtelMetricEntityTest: metric entity validation.
BackupVersionOptimizationTehutiMetricNameTest: Tehuti name validation.
VeniceOperationOutcomeTest: dimension value validation (SUCCESS/FAIL).

Modified:

TestBackupVersionDatabaseOptimization (integration test): moved metricsRepository.getMetric(...) for the optimization Tehuti sensor inside the waitForNonDeterministicAssertion lambda + added an assertNotNull guard, since Tehuti sensors are now registered lazily on first per-store recording (see Behavioral change 2).

Does this PR introduce any user-facing or breaking changes?

No. You can skip the rest of this section.
Yes. Clearly explain the behavior change and its impact.
The Tehuti backup_version_data_optimization_error OccurrenceRate now records per-partition instead of per-store. Dashboards showing error rate will see higher values when multiple partitions fail in the same optimization cycle. This aligns error counting with the existing success counting granularity (already per-partition).
Tehuti sensors backup_version_database_optimization and backup_version_data_optimization_error are now registered lazily on first per-store recording rather than eagerly at server startup. Per-scrape monitoring consumers (Grafana / InGraphs) are unaffected. Consumers that capture the Tehuti metric handle once at startup must do the lookup per-scrape (or post-first-event) — the in-tree integration test was updated accordingly.

Copilot

Pull request overview

Adds OpenTelemetry visibility for backup-version DB optimization by introducing a new server OTel counter and wiring per-store + outcome dimensioned recording through the joint Tehuti+OTel stats framework.

Changes:

Introduce OTel metric entity version.backup.optimization.reopen_count (COUNTER) with VENICE_STORE_NAME + VENICE_OPERATION_OUTCOME dimensions.
Update BackupVersionOptimizationServiceStats and BackupVersionOptimizationService to record success/error per store (and errors per partition).
Add new VENICE_OPERATION_OUTCOME dimension enum + naming tests, and add unit tests covering OTel/Tehuti behavior.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
services/venice-server/src/main/java/com/linkedin/venice/stats/BackupVersionOptimizationServiceStats.java	Adds joint Tehuti+OTel per-store metric state and records success/fail via outcome dimension.
services/venice-server/src/main/java/com/linkedin/venice/cleaner/BackupVersionOptimizationService.java	Records optimization success/error with store context; errors now recorded per failing partition.
services/venice-server/src/main/java/com/linkedin/venice/server/VeniceServer.java	Passes cluster name into the updated stats constructor.
services/venice-server/src/test/java/com/linkedin/venice/stats/BackupVersionOptimizationServiceStatsTest.java	Unit tests for OTel counter accumulation, dimension isolation, Tehuti sensor registration, and OTel-disabled safety.
services/venice-server/src/test/java/com/linkedin/venice/stats/BackupVersionOptimizationTehutiMetricNameTest.java	Validates Tehuti metric name enum mappings for this stats class.
internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceOperationOutcome.java	Adds generic SUCCESS/FAIL outcome dimension enum.
internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensions.java	Adds `VENICE_OPERATION_OUTCOME` dimension key definition.
internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceOperationOutcomeTest.java	Validates outcome dimension values and dimension-name binding.
internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensionsTest.java	Extends dimension naming-format tests to cover `VENICE_OPERATION_OUTCOME`.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/BackupVersionOptimizationOtelMetricEntity.java	Defines the new OTel metric entity and required dimensions.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java	Registers the new metric entity in the server metric entity set.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/BackupVersionOptimizationOtelMetricEntityTest.java	Validates metric entity definition (name/type/unit/description/dimensions).
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java	Updates expected server metric entity count after adding the new entity.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Add 1 OTel COUNTER metric: version.backup.optimization.reopen_count with STORE_NAME + OPERATION_OUTCOME (SUCCESS/FAIL) dimensions. Joint Tehuti+OTel API with per-store maps. Two maps (successPerStore, errorPerStore) because they bind to different Tehuti sensors while sharing the same OTel instrument. Tehuti sensor registered once (first store) and shared by all subsequent stores via registerSensorIfAbsent. Also moves error recording from per-store (once after partition loop) to per-partition (inside catch block) for finer granularity.

- Fix Javadoc grammar: "record" -> "records" - Merge getOrCreateSuccessMetric/getOrCreateErrorMetric into single getOrCreateMetric

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sushantmane

Reviewed with help from a few specialized review agents. Four actionable items below. The two-map split, +1 OTel + Tehuti recording, and concurrency look correct. One important regression worth blocking on: the lazy Tehuti registration breaks the existing integration test.

Separate from the inline comments: the PR is currently CONFLICTING with main and will need a rebase before merge.

mnagaraj/addOtelMetricsForBackupVersionOptimizationStats # Conflicts: # clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java # clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java # internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensions.java # internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensionsTest.java

Copilot

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mnagaraj/addOtelMetricsForBackupVersionOptimizationStats # Conflicts: # clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java # clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java # internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensions.java # internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensionsTest.java

sushantmane

LGTM

Copilot AI review requested due to automatic review settings April 12, 2026 03:43

Copilot started reviewing on behalf of m-nagarajan April 12, 2026 03:44 View session

Copilot AI reviewed Apr 12, 2026

View reviewed changes

Comment thread ...ce-server/src/main/java/com/linkedin/venice/stats/BackupVersionOptimizationServiceStats.java Outdated

m-nagarajan commented Apr 13, 2026

View reviewed changes

Comment thread ...enice-server/src/main/java/com/linkedin/venice/cleaner/BackupVersionOptimizationService.java

m-nagarajan commented Apr 13, 2026

View reviewed changes

Comment thread ...ce-server/src/main/java/com/linkedin/venice/stats/BackupVersionOptimizationServiceStats.java Outdated

m-nagarajan commented Apr 13, 2026

View reviewed changes

Comment thread ...ce-server/src/main/java/com/linkedin/venice/stats/BackupVersionOptimizationServiceStats.java

m-nagarajan added the needs-reviewer Looking for a reviewer to pick this up label Apr 20, 2026

m-nagarajan added 2 commits April 30, 2026 14:49

address review comments

b348f3a

- Fix Javadoc grammar: "record" -> "records" - Merge getOrCreateSuccessMetric/getOrCreateErrorMetric into single getOrCreateMetric

Copilot AI review requested due to automatic review settings April 30, 2026 21:52

m-nagarajan force-pushed the mnagaraj/addOtelMetricsForBackupVersionOptimizationStats branch from ffa7973 to b348f3a Compare April 30, 2026 21:52

Copilot started reviewing on behalf of m-nagarajan April 30, 2026 21:53 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

Comment thread ...ce-server/src/main/java/com/linkedin/venice/stats/BackupVersionOptimizationServiceStats.java

address review commetns

b2cc7c9

sushantmane added review-in-progress Reviewer is actively reviewing and removed needs-reviewer Looking for a reviewer to pick this up labels May 4, 2026

sushantmane reviewed May 4, 2026

View reviewed changes

m-nagarajan added 2 commits May 4, 2026 20:52

address review comments

e195e90

Copilot AI review requested due to automatic review settings May 5, 2026 05:13

Copilot started reviewing on behalf of m-nagarajan May 5, 2026 05:14 View session

Copilot AI reviewed May 5, 2026

View reviewed changes

m-nagarajan added 2 commits May 5, 2026 10:32

address review comments

0e9da27

sushantmane approved these changes May 5, 2026

View reviewed changes

sushantmane added approved PR has been approved and removed review-in-progress Reviewer is actively reviewing labels May 5, 2026

m-nagarajan merged commit 3248bf9 into linkedin:main May 5, 2026
106 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[server] Add OTel metrics to BackupVersionOptimizationServiceStats#2726

[server] Add OTel metrics to BackupVersionOptimizationServiceStats#2726
m-nagarajan merged 7 commits into
linkedin:mainfrom
m-nagarajan:mnagaraj/addOtelMetricsForBackupVersionOptimizationStats

m-nagarajan commented Apr 12, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

sushantmane left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sushantmane left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

m-nagarajan commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem Statement

Solution

Code changes

Concurrency-Specific Checks

How was this PR tested?

Does this PR introduce any user-facing or breaking changes?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

sushantmane left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sushantmane left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

m-nagarajan commented Apr 12, 2026 •

edited

Loading