Skip to content

[server] Add OTel metrics to BackupVersionOptimizationServiceStats#2726

Merged
m-nagarajan merged 7 commits into
linkedin:mainfrom
m-nagarajan:mnagaraj/addOtelMetricsForBackupVersionOptimizationStats
May 5, 2026
Merged

[server] Add OTel metrics to BackupVersionOptimizationServiceStats#2726
m-nagarajan merged 7 commits into
linkedin:mainfrom
m-nagarajan:mnagaraj/addOtelMetricsForBackupVersionOptimizationStats

Conversation

@m-nagarajan
Copy link
Copy Markdown
Contributor

@m-nagarajan m-nagarajan commented Apr 12, 2026

Problem Statement

BackupVersionOptimizationServiceStats has 2 Tehuti OccurrenceRate sensors for tracking backup version database optimization (success + error) but no OTel counterparts, making these metrics unavailable in OTel-based monitoring dashboards.

Solution

Add 1 OTel COUNTER metric version.backup.optimization.reopen_count with STORE_NAME + OPERATION_OUTCOME (SUCCESS/FAIL) dimensions using the joint Tehuti+OTel API.

Design: Two per-store VeniceConcurrentHashMap maps (successPerStore, errorPerStore) because they bind to different Tehuti sensors (success vs error) while sharing the same OTel instrument differentiated by VeniceOperationOutcome. Single getOrCreateMetric(map, storeName, tehutiName) helper shared by both maps. Tehuti sensor is registered once (first store) and shared by all subsequent stores via registerSensorIfAbsent. When OTel is disabled, otelRepository is null and OTel recording is a no-op — the maps still populate for Tehuti.

Behavioral change 1 (affects both Tehuti and OTel): Error recording moved from per-store (once after the partition loop when errored == true) to per-partition (inside the catch block). This gives finer granularity and matches the success path, which already records per-partition. If 50 of 100 partitions fail, the old code recorded 1 error; the new code records 50. This is intentional — the old per-store granularity made success count (partitions) and error count (stores) an apples-to-oranges comparison.

Behavioral change 2 (Tehuti registration timing): Pre-PR the constructor eagerly registered both Tehuti sensors (backup_version_database_optimization, backup_version_data_optimization_error) at server startup, so metricsRepository.getMetric(...) returned a non-null handle from process start. Post-PR these sensors are registered lazily on the first per-store recording (the standard pattern used by every other joint Tehuti+OTel stats class — keeps the per-store maps as the single source of truth). Impact: between server startup and the first backup-version optimization, getMetric(...) returns null and the sensor is absent from JMX. Per-scrape consumers (Grafana / InGraphs / dashboards that look up by metric name each poll cycle) are unaffected — once the first event fires, the metric appears and reports normally. Consumers that capture the metric handle at startup and reuse it across a wait loop must move the lookup inside the lambda; the in-tree integration test (TestBackupVersionDatabaseOptimization) is updated accordingly.

Code changes

  • Added new code behind a config.
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

  • New unit tests added.
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility (if applicable).

New tests:

  • BackupVersionOptimizationServiceStatsTest (9 tests): OTel counter accumulation for success/error, outcome dimension isolation, multi-store isolation, Tehuti lazy sensor registration + recording, two-map split no-cross-contamination invariant, cluster-name argument propagation into OTel attributes (catches dimension hardcoding regressions), NPE prevention with OTel disabled and plain MetricsRepository.
  • BackupVersionOptimizationServiceTest.testPerPartitionErrorRecordingScalesWithFailingPartitions: pins per-partition error scaling — mocks 4 partitions with all reopenStoragePartition calls throwing, asserts recordBackupVersionDatabaseOptimizationError was called 4 times and recordBackupVersionDatabaseOptimization was never called. Catches a regression that hoists the error recording back outside the partition loop.
  • BackupVersionOptimizationOtelMetricEntityTest: metric entity validation.
  • BackupVersionOptimizationTehutiMetricNameTest: Tehuti name validation.
  • VeniceOperationOutcomeTest: dimension value validation (SUCCESS/FAIL).

Modified:

  • TestBackupVersionDatabaseOptimization (integration test): moved metricsRepository.getMetric(...) for the optimization Tehuti sensor inside the waitForNonDeterministicAssertion lambda + added an assertNotNull guard, since Tehuti sensors are now registered lazily on first per-store recording (see Behavioral change 2).

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.

  • Yes. Clearly explain the behavior change and its impact.

  • The Tehuti backup_version_data_optimization_error OccurrenceRate now records per-partition instead of per-store. Dashboards showing error rate will see higher values when multiple partitions fail in the same optimization cycle. This aligns error counting with the existing success counting granularity (already per-partition).

  • Tehuti sensors backup_version_database_optimization and backup_version_data_optimization_error are now registered lazily on first per-store recording rather than eagerly at server startup. Per-scrape monitoring consumers (Grafana / InGraphs) are unaffected. Consumers that capture the Tehuti metric handle once at startup must do the lookup per-scrape (or post-first-event) — the in-tree integration test was updated accordingly.

Copilot AI review requested due to automatic review settings April 12, 2026 03:43
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds OpenTelemetry visibility for backup-version DB optimization by introducing a new server OTel counter and wiring per-store + outcome dimensioned recording through the joint Tehuti+OTel stats framework.

Changes:

  • Introduce OTel metric entity version.backup.optimization.reopen_count (COUNTER) with VENICE_STORE_NAME + VENICE_OPERATION_OUTCOME dimensions.
  • Update BackupVersionOptimizationServiceStats and BackupVersionOptimizationService to record success/error per store (and errors per partition).
  • Add new VENICE_OPERATION_OUTCOME dimension enum + naming tests, and add unit tests covering OTel/Tehuti behavior.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
services/venice-server/src/main/java/com/linkedin/venice/stats/BackupVersionOptimizationServiceStats.java Adds joint Tehuti+OTel per-store metric state and records success/fail via outcome dimension.
services/venice-server/src/main/java/com/linkedin/venice/cleaner/BackupVersionOptimizationService.java Records optimization success/error with store context; errors now recorded per failing partition.
services/venice-server/src/main/java/com/linkedin/venice/server/VeniceServer.java Passes cluster name into the updated stats constructor.
services/venice-server/src/test/java/com/linkedin/venice/stats/BackupVersionOptimizationServiceStatsTest.java Unit tests for OTel counter accumulation, dimension isolation, Tehuti sensor registration, and OTel-disabled safety.
services/venice-server/src/test/java/com/linkedin/venice/stats/BackupVersionOptimizationTehutiMetricNameTest.java Validates Tehuti metric name enum mappings for this stats class.
internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceOperationOutcome.java Adds generic SUCCESS/FAIL outcome dimension enum.
internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensions.java Adds VENICE_OPERATION_OUTCOME dimension key definition.
internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceOperationOutcomeTest.java Validates outcome dimension values and dimension-name binding.
internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensionsTest.java Extends dimension naming-format tests to cover VENICE_OPERATION_OUTCOME.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/BackupVersionOptimizationOtelMetricEntity.java Defines the new OTel metric entity and required dimensions.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java Registers the new metric entity in the server metric entity set.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/BackupVersionOptimizationOtelMetricEntityTest.java Validates metric entity definition (name/type/unit/description/dimensions).
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java Updates expected server metric entity count after adding the new entity.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@m-nagarajan m-nagarajan added the needs-reviewer Looking for a reviewer to pick this up label Apr 20, 2026
Add 1 OTel COUNTER metric: version.backup.optimization.reopen_count
with STORE_NAME + OPERATION_OUTCOME (SUCCESS/FAIL) dimensions.

Joint Tehuti+OTel API with per-store maps. Two maps (successPerStore,
errorPerStore) because they bind to different Tehuti sensors while
sharing the same OTel instrument. Tehuti sensor registered once (first
store) and shared by all subsequent stores via registerSensorIfAbsent.

Also moves error recording from per-store (once after partition loop)
to per-partition (inside catch block) for finer granularity.
- Fix Javadoc grammar: "record" -> "records"
- Merge getOrCreateSuccessMetric/getOrCreateErrorMetric into single
getOrCreateMetric
Copilot AI review requested due to automatic review settings April 30, 2026 21:52
@m-nagarajan m-nagarajan force-pushed the mnagaraj/addOtelMetricsForBackupVersionOptimizationStats branch from ffa7973 to b348f3a Compare April 30, 2026 21:52
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@sushantmane sushantmane added review-in-progress Reviewer is actively reviewing and removed needs-reviewer Looking for a reviewer to pick this up labels May 4, 2026
Copy link
Copy Markdown
Contributor

@sushantmane sushantmane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed with help from a few specialized review agents. Four actionable items below. The two-map split, +1 OTel + Tehuti recording, and concurrency look correct. One important regression worth blocking on: the lazy Tehuti registration breaks the existing integration test.

Separate from the inline comments: the PR is currently CONFLICTING with main and will need a rebase before merge.

mnagaraj/addOtelMetricsForBackupVersionOptimizationStats

# Conflicts:
#	clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java
#	clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java
#	internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensions.java
#	internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensionsTest.java
Copilot AI review requested due to automatic review settings May 5, 2026 05:13
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mnagaraj/addOtelMetricsForBackupVersionOptimizationStats

# Conflicts:
#	clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java
#	clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java
#	internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensions.java
#	internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensionsTest.java
Copy link
Copy Markdown
Contributor

@sushantmane sushantmane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sushantmane sushantmane added approved PR has been approved and removed review-in-progress Reviewer is actively reviewing labels May 5, 2026
@m-nagarajan m-nagarajan merged commit 3248bf9 into linkedin:main May 5, 2026
106 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved PR has been approved

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants