[server] Add OTel metrics to BackupVersionOptimizationServiceStats#2726
Conversation
There was a problem hiding this comment.
Pull request overview
Adds OpenTelemetry visibility for backup-version DB optimization by introducing a new server OTel counter and wiring per-store + outcome dimensioned recording through the joint Tehuti+OTel stats framework.
Changes:
- Introduce OTel metric entity
version.backup.optimization.reopen_count(COUNTER) withVENICE_STORE_NAME+VENICE_OPERATION_OUTCOMEdimensions. - Update
BackupVersionOptimizationServiceStatsandBackupVersionOptimizationServiceto record success/error per store (and errors per partition). - Add new
VENICE_OPERATION_OUTCOMEdimension enum + naming tests, and add unit tests covering OTel/Tehuti behavior.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| services/venice-server/src/main/java/com/linkedin/venice/stats/BackupVersionOptimizationServiceStats.java | Adds joint Tehuti+OTel per-store metric state and records success/fail via outcome dimension. |
| services/venice-server/src/main/java/com/linkedin/venice/cleaner/BackupVersionOptimizationService.java | Records optimization success/error with store context; errors now recorded per failing partition. |
| services/venice-server/src/main/java/com/linkedin/venice/server/VeniceServer.java | Passes cluster name into the updated stats constructor. |
| services/venice-server/src/test/java/com/linkedin/venice/stats/BackupVersionOptimizationServiceStatsTest.java | Unit tests for OTel counter accumulation, dimension isolation, Tehuti sensor registration, and OTel-disabled safety. |
| services/venice-server/src/test/java/com/linkedin/venice/stats/BackupVersionOptimizationTehutiMetricNameTest.java | Validates Tehuti metric name enum mappings for this stats class. |
| internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceOperationOutcome.java | Adds generic SUCCESS/FAIL outcome dimension enum. |
| internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensions.java | Adds VENICE_OPERATION_OUTCOME dimension key definition. |
| internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceOperationOutcomeTest.java | Validates outcome dimension values and dimension-name binding. |
| internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensionsTest.java | Extends dimension naming-format tests to cover VENICE_OPERATION_OUTCOME. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/BackupVersionOptimizationOtelMetricEntity.java | Defines the new OTel metric entity and required dimensions. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java | Registers the new metric entity in the server metric entity set. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/BackupVersionOptimizationOtelMetricEntityTest.java | Validates metric entity definition (name/type/unit/description/dimensions). |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java | Updates expected server metric entity count after adding the new entity. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Add 1 OTel COUNTER metric: version.backup.optimization.reopen_count with STORE_NAME + OPERATION_OUTCOME (SUCCESS/FAIL) dimensions. Joint Tehuti+OTel API with per-store maps. Two maps (successPerStore, errorPerStore) because they bind to different Tehuti sensors while sharing the same OTel instrument. Tehuti sensor registered once (first store) and shared by all subsequent stores via registerSensorIfAbsent. Also moves error recording from per-store (once after partition loop) to per-partition (inside catch block) for finer granularity.
- Fix Javadoc grammar: "record" -> "records" - Merge getOrCreateSuccessMetric/getOrCreateErrorMetric into single getOrCreateMetric
ffa7973 to
b348f3a
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
sushantmane
left a comment
There was a problem hiding this comment.
Reviewed with help from a few specialized review agents. Four actionable items below. The two-map split, +1 OTel + Tehuti recording, and concurrency look correct. One important regression worth blocking on: the lazy Tehuti registration breaks the existing integration test.
Separate from the inline comments: the PR is currently CONFLICTING with main and will need a rebase before merge.
mnagaraj/addOtelMetricsForBackupVersionOptimizationStats # Conflicts: # clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java # clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java # internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensions.java # internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensionsTest.java
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
mnagaraj/addOtelMetricsForBackupVersionOptimizationStats # Conflicts: # clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java # clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java # internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensions.java # internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensionsTest.java
Problem Statement
BackupVersionOptimizationServiceStatshas 2 Tehuti OccurrenceRate sensors for tracking backup version database optimization (success + error) but no OTel counterparts, making these metrics unavailable in OTel-based monitoring dashboards.Solution
Add 1 OTel COUNTER metric
version.backup.optimization.reopen_countwithSTORE_NAME+OPERATION_OUTCOME(SUCCESS/FAIL) dimensions using the joint Tehuti+OTel API.Design: Two per-store
VeniceConcurrentHashMapmaps (successPerStore,errorPerStore) because they bind to different Tehuti sensors (success vs error) while sharing the same OTel instrument differentiated byVeniceOperationOutcome. SinglegetOrCreateMetric(map, storeName, tehutiName)helper shared by both maps. Tehuti sensor is registered once (first store) and shared by all subsequent stores viaregisterSensorIfAbsent. When OTel is disabled,otelRepositoryis null and OTel recording is a no-op — the maps still populate for Tehuti.Behavioral change 1 (affects both Tehuti and OTel): Error recording moved from per-store (once after the partition loop when
errored == true) to per-partition (inside the catch block). This gives finer granularity and matches the success path, which already records per-partition. If 50 of 100 partitions fail, the old code recorded 1 error; the new code records 50. This is intentional — the old per-store granularity made success count (partitions) and error count (stores) an apples-to-oranges comparison.Behavioral change 2 (Tehuti registration timing): Pre-PR the constructor eagerly registered both Tehuti sensors (
backup_version_database_optimization,backup_version_data_optimization_error) at server startup, sometricsRepository.getMetric(...)returned a non-null handle from process start. Post-PR these sensors are registered lazily on the first per-store recording (the standard pattern used by every other joint Tehuti+OTel stats class — keeps the per-store maps as the single source of truth). Impact: between server startup and the first backup-version optimization,getMetric(...)returns null and the sensor is absent from JMX. Per-scrape consumers (Grafana / InGraphs / dashboards that look up by metric name each poll cycle) are unaffected — once the first event fires, the metric appears and reports normally. Consumers that capture the metric handle at startup and reuse it across a wait loop must move the lookup inside the lambda; the in-tree integration test (TestBackupVersionDatabaseOptimization) is updated accordingly.Code changes
Concurrency-Specific Checks
Both reviewer and PR author to verify
synchronized,RWLock) are used where needed.ConcurrentHashMap,CopyOnWriteArrayList).How was this PR tested?
New tests:
BackupVersionOptimizationServiceStatsTest(9 tests): OTel counter accumulation for success/error, outcome dimension isolation, multi-store isolation, Tehuti lazy sensor registration + recording, two-map split no-cross-contamination invariant, cluster-name argument propagation into OTel attributes (catches dimension hardcoding regressions), NPE prevention with OTel disabled and plain MetricsRepository.BackupVersionOptimizationServiceTest.testPerPartitionErrorRecordingScalesWithFailingPartitions: pins per-partition error scaling — mocks 4 partitions with allreopenStoragePartitioncalls throwing, assertsrecordBackupVersionDatabaseOptimizationErrorwas called 4 times andrecordBackupVersionDatabaseOptimizationwas never called. Catches a regression that hoists the error recording back outside the partition loop.BackupVersionOptimizationOtelMetricEntityTest: metric entity validation.BackupVersionOptimizationTehutiMetricNameTest: Tehuti name validation.VeniceOperationOutcomeTest: dimension value validation (SUCCESS/FAIL).Modified:
TestBackupVersionDatabaseOptimization(integration test): movedmetricsRepository.getMetric(...)for the optimization Tehuti sensor inside thewaitForNonDeterministicAssertionlambda + added anassertNotNullguard, since Tehuti sensors are now registered lazily on first per-store recording (see Behavioral change 2).Does this PR introduce any user-facing or breaking changes?
No. You can skip the rest of this section.
Yes. Clearly explain the behavior change and its impact.
The Tehuti
backup_version_data_optimization_errorOccurrenceRate now records per-partition instead of per-store. Dashboards showing error rate will see higher values when multiple partitions fail in the same optimization cycle. This aligns error counting with the existing success counting granularity (already per-partition).Tehuti sensors
backup_version_database_optimizationandbackup_version_data_optimization_errorare now registered lazily on first per-store recording rather than eagerly at server startup. Per-scrape monitoring consumers (Grafana / InGraphs) are unaffected. Consumers that capture the Tehuti metric handle once at startup must do the lookup per-scrape (or post-first-event) — the in-tree integration test was updated accordingly.