Skip to content

[da-vinci] Add OTel metrics to ParticipantStateTransitionStats#2728

Merged
m-nagarajan merged 9 commits intolinkedin:mainfrom
m-nagarajan:mnagaraj/addOtelForParticipantStateTransitionStats
May 6, 2026
Merged

[da-vinci] Add OTel metrics to ParticipantStateTransitionStats#2728
m-nagarajan merged 9 commits intolinkedin:mainfrom
m-nagarajan:mnagaraj/addOtelForParticipantStateTransitionStats

Conversation

@m-nagarajan
Copy link
Copy Markdown
Contributor

@m-nagarajan m-nagarajan commented Apr 12, 2026

Problem Statement

ParticipantStateTransitionStats has dynamic Tehuti AsyncGauge sensors for tracking partition state transitions (blocked threads, in-progress transitions, steady-state counts) but no OTel counterparts, making these metrics unavailable in OTel-based monitoring dashboards.

Solution

Add 3 OTel UP_DOWN_COUNTER metrics for partition state transition lifecycle:

  • partition.state.transition.blocked_thread_count — threads currently blocked on a state transition (FROM/TO state dimensions). Currently only OFFLINE→DROPPED blocks threads; the FROM/TO dimensions allow future transitions to use this metric without schema changes.
  • partition.state.transition.in_progress_count — partitions currently transitioning between Helix states (FROM/TO state dimensions). Records +1 on transition start, -1 on completion.
  • partition.state.active_count — partitions currently in a given Helix state: ERROR, STANDBY, LEADER (STATE dimension). Records -1 when a partition leaves a state, +1 when it enters.

All three use UP_DOWN_COUNTER (+1/-1 at call sites). Tehuti AsyncGauge sensors remain backed by AtomicInteger trackers — these are Tehuti-only artifacts that will be removed when Tehuti is retired.

Exception-safety fix in AbstractPartitionStateModel.executeStateTransition: wraps handler.run() in a try/catch — on exception calls a new trackStateTransitionFailed(from, to) method (decrements in-progress, leaves the already-decremented from steady-state alone, does NOT increment the to steady-state since the partition never reached it) and rethrows. Without this, a throwing transition handler would leak +1 on partition.state.transition.in_progress_count forever and drift the from steady-state monotonically down. Pre-existing bug in the Tehuti AsyncGauges; this PR widens it into OTel dashboards, so fixing it as part of the same change.

New dimension enums:

  • VeniceHelixFromState / VeniceHelixToState — 5 values each (OFFLINE, STANDBY, LEADER, ERROR, DROPPED) for transition metrics. Same values but different dimension keys.
  • VeniceHelixSteadyState — 3 values (ERROR, STANDBY, LEADER) for active partition count. Synced with ENABLED_STEADY_STATES via a guard test.

Code changes

  • Added new code behind a config.
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

  • New unit tests added.
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility (if applicable).

New tests:

  • ParticipantStateTransitionStatsOtelTest (8 tests): blocked thread UP_DOWN_COUNTER, in-progress transition lifecycle, steady-state lifecycle (STANDBY→LEADER) with mid-transition ordering assertion, ERROR steady state, guard test syncing VeniceHelixSteadyState with ENABLED_STEADY_STATES, defensive IllegalArgumentException catch in recordInProgressOtel surfaces via metric_record_failure, NPE prevention with OTel disabled and plain MetricsRepository.
  • ParticipantStateTransitionStatsTest (2 tests): existing started/completed Tehuti gauge updates + new trackStateTransitionFailed decrements in-progress only (no steady-state side effect for to).
  • ParticipantStateTransitionOtelMetricEntityTest: metric entity validation for all 3 metrics.
  • LeaderFollowerPartitionStateModelTest.testStateTransitionFailureRecordedOnHandlerException: verifies executeStateTransition calls trackStateTransitionFailed on handler exception (not trackStateTransitionCompleted) and the original exception propagates.
  • VeniceHelixFromStateTest, VeniceHelixToStateTest, VeniceHelixSteadyStateTest: dimension value validation. From/To also include a guard test asserting every HelixState value (except UNKNOWN) is representable, so the defensive valueOf catch stays unreachable in practice.

Drive-by stability fix:

  • ServerLoadStatsTest.testTehutiSensorsNoCrossContamination: added 1e-6 delta to two assertEquals calls on OccurrenceRate snapshots. Same flake pattern Yug fixed in [server][test] Fix flaky testTehutiCrossIsolation #2746 — clock advances between snapshot and re-read cause sub-microscopic drift on the rate value.

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.
  • Yes. Clearly explain the behavior change and its impact.

Copilot AI review requested due to automatic review settings April 12, 2026 20:17
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds OpenTelemetry (OTel) parity for ParticipantStateTransitionStats so Helix partition state transition lifecycle metrics are available in OTel-based monitoring, while keeping existing Tehuti AsyncGauge metrics intact.

Changes:

  • Introduced 3 new OTel UP_DOWN_COUNTER metric entities for blocked threads, in-progress transitions, and steady-state active partition counts.
  • Added new Helix-state dimension enums (from, to, steady) and registered corresponding dimension keys in VeniceMetricsDimensions.
  • Added unit tests covering metric entity definitions, lifecycle (+1/-1) behavior, enum/dimension mappings, and OTel-disabled safety.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated no comments.

Show a summary per file
File Description
internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensions.java Adds new Helix-related dimension keys for OTel attributes.
internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceHelixFromState.java New FROM-state dimension enum for transition metrics.
internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceHelixToState.java New TO-state dimension enum for transition metrics.
internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceHelixSteadyState.java New steady-state dimension enum for active partition counts.
internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensionsTest.java Extends dimension-name format coverage for the new Helix dimensions.
internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceHelixFromStateTest.java Validates dimension enum values and mappings for FROM-state.
internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceHelixToStateTest.java Validates dimension enum values and mappings for TO-state.
internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceHelixSteadyStateTest.java Validates dimension enum values and mappings for steady-state.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ParticipantStateTransitionOtelMetricEntity.java Defines the 3 new OTel metric entities and their required dimensions.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ParticipantStateTransitionStats.java Records new OTel up-down counters alongside existing Tehuti trackers/sensors.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java Registers the new module metric entity enum in the server metric entity aggregation.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ParticipantStateTransitionOtelMetricEntityTest.java Asserts metric entity definitions (name/type/unit/description/dimensions).
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ParticipantStateTransitionStatsOtelTest.java Validates OTel recording semantics (+1/-1) and OTel-disabled / non-Venice repo safety.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java Updates expected aggregated metric entity count to include the 3 new metrics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@m-nagarajan m-nagarajan added the needs-reviewer Looking for a reviewer to pick this up label Apr 20, 2026
@m-nagarajan m-nagarajan force-pushed the mnagaraj/addOtelForParticipantStateTransitionStats branch from 2272357 to 67d405c Compare April 30, 2026 21:38
Copilot AI review requested due to automatic review settings April 30, 2026 21:43
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Add 3 UP_DOWN_COUNTER metrics for partition state transition lifecycle:
- partition.state.transition.blocked_thread_count (FROM/TO state dimensions)
- partition.state.transition.in_progress_count (FROM/TO state dimensions)
- partition.state.active_count (STATE dimension)

All three use UP_DOWN_COUNTER (+1/-1 at call sites). Tehuti AsyncGauge
sensors remain backed by AtomicInteger trackers (Tehuti-only artifacts,
removable when Tehuti is retired).

New dimension enums: VeniceHelixFromState, VeniceHelixToState (5 values
each for transition metrics), VeniceHelixSteadyState (ERROR/STANDBY/LEADER
for active partition count, synced with ENABLED_STEADY_STATES via guard test).
@m-nagarajan m-nagarajan force-pushed the mnagaraj/addOtelForParticipantStateTransitionStats branch from c89c2a4 to 5fda736 Compare May 1, 2026 00:13
Copilot AI review requested due to automatic review settings May 1, 2026 00:19
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@sushantmane sushantmane added requested-changes Reviewer requested changes and removed needs-reviewer Looking for a reviewer to pick this up labels May 4, 2026
Copy link
Copy Markdown
Contributor

@sushantmane sushantmane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed with help from a few specialized review agents. Six actionable items below — mostly tightening tests and small contract clarifications. The OTel split, +1/-1 ordering, and thread safety all look correct.

Separate from the inline comments: the PR is currently CONFLICTING with main and will need a rebase before merge.

…nStats

# Conflicts:
#	clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java
#	internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensions.java
#	internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensionsTest.java
Copilot AI review requested due to automatic review settings May 5, 2026 07:12
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…nStats

# Conflicts:
#	clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java
#	clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java
#	internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensions.java
#	internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensionsTest.java
Copilot AI review requested due to automatic review settings May 5, 2026 22:21
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@m-nagarajan m-nagarajan added review-addressed Author has addressed review comments and removed requested-changes Reviewer requested changes labels May 6, 2026
@m-nagarajan m-nagarajan merged commit a59a126 into linkedin:main May 6, 2026
106 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review-addressed Author has addressed review comments

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants