[da-vinci] Add OTel metrics to ParticipantStateTransitionStats#2728
Conversation
There was a problem hiding this comment.
Pull request overview
Adds OpenTelemetry (OTel) parity for ParticipantStateTransitionStats so Helix partition state transition lifecycle metrics are available in OTel-based monitoring, while keeping existing Tehuti AsyncGauge metrics intact.
Changes:
- Introduced 3 new OTel
UP_DOWN_COUNTERmetric entities for blocked threads, in-progress transitions, and steady-state active partition counts. - Added new Helix-state dimension enums (
from,to,steady) and registered corresponding dimension keys inVeniceMetricsDimensions. - Added unit tests covering metric entity definitions, lifecycle (+1/-1) behavior, enum/dimension mappings, and OTel-disabled safety.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensions.java | Adds new Helix-related dimension keys for OTel attributes. |
| internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceHelixFromState.java | New FROM-state dimension enum for transition metrics. |
| internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceHelixToState.java | New TO-state dimension enum for transition metrics. |
| internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceHelixSteadyState.java | New steady-state dimension enum for active partition counts. |
| internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensionsTest.java | Extends dimension-name format coverage for the new Helix dimensions. |
| internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceHelixFromStateTest.java | Validates dimension enum values and mappings for FROM-state. |
| internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceHelixToStateTest.java | Validates dimension enum values and mappings for TO-state. |
| internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceHelixSteadyStateTest.java | Validates dimension enum values and mappings for steady-state. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ParticipantStateTransitionOtelMetricEntity.java | Defines the 3 new OTel metric entities and their required dimensions. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ParticipantStateTransitionStats.java | Records new OTel up-down counters alongside existing Tehuti trackers/sensors. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java | Registers the new module metric entity enum in the server metric entity aggregation. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ParticipantStateTransitionOtelMetricEntityTest.java | Asserts metric entity definitions (name/type/unit/description/dimensions). |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ParticipantStateTransitionStatsOtelTest.java | Validates OTel recording semantics (+1/-1) and OTel-disabled / non-Venice repo safety. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java | Updates expected aggregated metric entity count to include the 3 new metrics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
2272357 to
67d405c
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Add 3 UP_DOWN_COUNTER metrics for partition state transition lifecycle: - partition.state.transition.blocked_thread_count (FROM/TO state dimensions) - partition.state.transition.in_progress_count (FROM/TO state dimensions) - partition.state.active_count (STATE dimension) All three use UP_DOWN_COUNTER (+1/-1 at call sites). Tehuti AsyncGauge sensors remain backed by AtomicInteger trackers (Tehuti-only artifacts, removable when Tehuti is retired). New dimension enums: VeniceHelixFromState, VeniceHelixToState (5 values each for transition metrics), VeniceHelixSteadyState (ERROR/STANDBY/LEADER for active partition count, synced with ENABLED_STEADY_STATES via guard test).
c89c2a4 to
5fda736
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
sushantmane
left a comment
There was a problem hiding this comment.
Reviewed with help from a few specialized review agents. Six actionable items below — mostly tightening tests and small contract clarifications. The OTel split, +1/-1 ordering, and thread safety all look correct.
Separate from the inline comments: the PR is currently CONFLICTING with main and will need a rebase before merge.
…nStats # Conflicts: # clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java # internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensions.java # internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensionsTest.java
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 17 out of 17 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…nStats # Conflicts: # clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java # clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java # internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensions.java # internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensionsTest.java
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Problem Statement
ParticipantStateTransitionStatshas dynamic Tehuti AsyncGauge sensors for tracking partition state transitions (blocked threads, in-progress transitions, steady-state counts) but no OTel counterparts, making these metrics unavailable in OTel-based monitoring dashboards.Solution
Add 3 OTel UP_DOWN_COUNTER metrics for partition state transition lifecycle:
partition.state.transition.blocked_thread_count— threads currently blocked on a state transition (FROM/TO state dimensions). Currently only OFFLINE→DROPPED blocks threads; the FROM/TO dimensions allow future transitions to use this metric without schema changes.partition.state.transition.in_progress_count— partitions currently transitioning between Helix states (FROM/TO state dimensions). Records +1 on transition start, -1 on completion.partition.state.active_count— partitions currently in a given Helix state: ERROR, STANDBY, LEADER (STATE dimension). Records -1 when a partition leaves a state, +1 when it enters.All three use UP_DOWN_COUNTER (+1/-1 at call sites). Tehuti AsyncGauge sensors remain backed by AtomicInteger trackers — these are Tehuti-only artifacts that will be removed when Tehuti is retired.
Exception-safety fix in
AbstractPartitionStateModel.executeStateTransition: wrapshandler.run()in a try/catch — on exception calls a newtrackStateTransitionFailed(from, to)method (decrements in-progress, leaves the already-decrementedfromsteady-state alone, does NOT increment thetosteady-state since the partition never reached it) and rethrows. Without this, a throwing transition handler would leak +1 onpartition.state.transition.in_progress_countforever and drift thefromsteady-state monotonically down. Pre-existing bug in the Tehuti AsyncGauges; this PR widens it into OTel dashboards, so fixing it as part of the same change.New dimension enums:
VeniceHelixFromState/VeniceHelixToState— 5 values each (OFFLINE, STANDBY, LEADER, ERROR, DROPPED) for transition metrics. Same values but different dimension keys.VeniceHelixSteadyState— 3 values (ERROR, STANDBY, LEADER) for active partition count. Synced withENABLED_STEADY_STATESvia a guard test.Code changes
Concurrency-Specific Checks
Both reviewer and PR author to verify
synchronized,RWLock) are used where needed.ConcurrentHashMap,CopyOnWriteArrayList).How was this PR tested?
New tests:
ParticipantStateTransitionStatsOtelTest(8 tests): blocked thread UP_DOWN_COUNTER, in-progress transition lifecycle, steady-state lifecycle (STANDBY→LEADER) with mid-transition ordering assertion, ERROR steady state, guard test syncingVeniceHelixSteadyStatewithENABLED_STEADY_STATES, defensiveIllegalArgumentExceptioncatch inrecordInProgressOtelsurfaces viametric_record_failure, NPE prevention with OTel disabled and plainMetricsRepository.ParticipantStateTransitionStatsTest(2 tests): existing started/completed Tehuti gauge updates + newtrackStateTransitionFaileddecrements in-progress only (no steady-state side effect forto).ParticipantStateTransitionOtelMetricEntityTest: metric entity validation for all 3 metrics.LeaderFollowerPartitionStateModelTest.testStateTransitionFailureRecordedOnHandlerException: verifiesexecuteStateTransitioncallstrackStateTransitionFailedon handler exception (nottrackStateTransitionCompleted) and the original exception propagates.VeniceHelixFromStateTest,VeniceHelixToStateTest,VeniceHelixSteadyStateTest: dimension value validation.From/Toalso include a guard test asserting everyHelixStatevalue (exceptUNKNOWN) is representable, so the defensivevalueOfcatch stays unreachable in practice.Drive-by stability fix:
ServerLoadStatsTest.testTehutiSensorsNoCrossContamination: added1e-6delta to twoassertEqualscalls onOccurrenceRatesnapshots. Same flake pattern Yug fixed in [server][test] Fix flaky testTehutiCrossIsolation #2746 — clock advances between snapshot and re-read cause sub-microscopic drift on the rate value.Does this PR introduce any user-facing or breaking changes?