[da-vinci] Add OTel metrics to AggVersionedDaVinciRecordTransformerStats#2725
Conversation
There was a problem hiding this comment.
Pull request overview
Adds OpenTelemetry (OTel) coverage for DaVinci record transformer operation stats so they’re available in OTel-based dashboards, aligning this component with the existing metric-entity + dimension patterns used across DaVinci/server stats.
Changes:
- Introduces a new OTel dimension
VENICE_RECORD_TRANSFORMER_OPERATION(PUT/DELETE) and its enum value type. - Adds two new OTel metric entities:
record_transformer.latency(histogram) andrecord_transformer.error_count(counter), and wires them intoAggVersionedDaVinciRecordTransformerStats. - Registers the new metric entity enum in server metric aggregation and adds unit tests validating dimension naming and metric entity definitions/recording.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensions.java | Adds the new VENICE_RECORD_TRANSFORMER_OPERATION dimension definition. |
| internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceRecordTransformerOperation.java | Introduces PUT/DELETE dimension enum implementing VeniceDimensionInterface. |
| internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensionsTest.java | Extends dimension-name format tests to cover the new dimension. |
| internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceRecordTransformerOperationTest.java | Validates dimension value mappings for PUT/DELETE. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/DaVinciRecordTransformerOtelMetricEntity.java | Defines OTel metric entities for record transformer latency and error count. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java | Registers the new metric entity enum so it’s part of server metric aggregation. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/AggVersionedDaVinciRecordTransformerStats.java | Records OTel histogram/counter at the call site with per-store cached metric state. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java | Updates expected aggregated metric entity count after adding new entities. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/DaVinciRecordTransformerOtelMetricEntityTest.java | Verifies metric entity definitions (name/type/unit/dimensions) for the new entities. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/AggVersionedDaVinciRecordTransformerStatsOtelTest.java | Adds OTel recording tests for latency/errors and validates operation-dimension isolation + NPE prevention. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Add 2 OTel metrics for DaVinci record transformer: - record_transformer.latency (HISTOGRAM with PUT/DELETE operation dimension) - record_transformer.error_count (COUNTER with PUT/DELETE operation dimension) Separate Tehuti+OTel API: Tehuti uses the Reporter layer (AsyncGauge polling), OTel records at the call point in AggVersionedDaVinciRecordTransformerStats. Per-store VeniceConcurrentHashMap maps for latency and error count metrics, cleaned up in handleStoreDeleted.
- Add emitOtelMetrics guard to skip OTel recording when disabled - Fix Javadoc: reference AggVersionedDaVinciRecordTransformerStats (not Stats class) - Merge getOrCreateLatencyMetric/getOrCreateErrorCountMetric into single getOrCreateMetric
1ff7662 to
2818990
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
sushantmane
left a comment
There was a problem hiding this comment.
Reviewed with help from a few specialized review agents. Four actionable items below — two important (matching the established AggVersionedDIVStats short-circuit pattern, and a multi-store mock that silently collapses), two suggestions. The HISTOGRAM+COUNTER split, separate Tehuti+OTel API choice, and handleStoreDeleted cleanup all look correct.
Separate from the inline comments: the PR is currently CONFLICTING with main and will need a rebase before merge.
mnagaraj/addOtelMetricsForAggVersionedDaVinciRecordTransformerStats # Conflicts: # clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java # clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java # internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensions.java # internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensionsTest.java
Problem Statement
DaVinciRecordTransformerStatshas 4 Tehuti sensors for record transformer operations (put/delete latency and error counts) but no OTel counterparts, making these metrics unavailable in OTel-based monitoring dashboards.Solution
Add 2 OTel metrics to
AggVersionedDaVinciRecordTransformerStatswith aVENICE_RECORD_TRANSFORMER_OPERATIONdimension (PUT/DELETE):record_transformer.latency(HISTOGRAM) — DaVinci record transformer operation latencyrecord_transformer.error_count(COUNTER) — DaVinci record transformer operation error countUses the separate Tehuti+OTel API: Tehuti metrics are managed by the Reporter layer (
DaVinciRecordTransformerStatsReporter) with AsyncGauge polling, while OTel records directly at the call point inAggVersionedDaVinciRecordTransformerStats. This follows the established pattern for versioned stats where Tehuti and OTel have fundamentally different recording paths.Key design decisions:
VeniceConcurrentHashMapmaps for latency and error count metrics, cleaned up inhandleStoreDeleted.private final boolean emitOtelMetricscached at construction;recordOtelLatencyandrecordOtelErrorCountearly-return when OTel is disabled, so the per-store maps are never populated in non-OTel deployments. Mirrors the establishedAggVersionedDIVStatspattern.@VisibleForTestingaccessors (hasLatencyMetricFor,latencyStoreCount, etc.) let tests pin the cleanup contract onhandleStoreDeletedwithout exposing the internal maps.Code changes
Concurrency-Specific Checks
Both reviewer and PR author to verify
synchronized,RWLock) are used where needed.ConcurrentHashMap,CopyOnWriteArrayList).How was this PR tested?
New tests:
AggVersionedDaVinciRecordTransformerStatsOtelTest(9 tests): OTel histogram for put/delete latency, OTel counter for put/delete errors, operation dimension isolation, multi-store isolation (validates per-store entries stay distinct via thedoAnswerStore mock soAbstractVeniceAggVersionedStatsdoesn't collapse them),handleStoreDeletedcleanup contract (asserts per-store maps go from 1 entry → 0 → 1 across record/delete/re-record), NPE prevention with OTel disabled and plain MetricsRepository.DaVinciRecordTransformerOtelMetricEntityTest: metric entity validation for both metrics.VeniceRecordTransformerOperationTest: dimension value validation (PUT/DELETE).Does this PR introduce any user-facing or breaking changes?