Skip to content

[da-vinci] Add OTel metrics to StuckConsumerRepairStats#2723

Merged
m-nagarajan merged 3 commits into
linkedin:mainfrom
m-nagarajan:mnagaraj/addOtelMetricsForStuckConsumerRepairStats
May 6, 2026
Merged

[da-vinci] Add OTel metrics to StuckConsumerRepairStats#2723
m-nagarajan merged 3 commits into
linkedin:mainfrom
m-nagarajan:mnagaraj/addOtelMetricsForStuckConsumerRepairStats

Conversation

@m-nagarajan
Copy link
Copy Markdown
Contributor

Problem Statement

StuckConsumerRepairStats has 3 Tehuti OccurrenceRate sensors for the stuck consumer detection and repair lifecycle but no OTel counterparts, making these metrics unavailable in OTel-based monitoring dashboards.

Solution

Add 3 OTel COUNTER metrics under the ingestion.pubsub.consumer.stuck.* namespace using the joint Tehuti+OTel API (MetricEntityStateBase):

  • ingestion.pubsub.consumer.stuck.detected_count — scans that detected a stuck PubSub consumer
  • ingestion.pubsub.consumer.stuck.task_repaired_count — ingestion tasks killed to unblock a stuck consumer
  • ingestion.pubsub.consumer.stuck.unresolved_count — stuck consumers found with no fixable task identified

This is a singleton stats class with CLUSTER_NAME as the only dimension. clusterName was added to the constructor, passed from AggKafkaConsumerService via serverConfig.getClusterName().

Code changes

  • Added new code behind a config.
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

  • New unit tests added.
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility (if applicable).

New tests:

  • StuckConsumerRepairStatsTest (6 tests): OTel counter accumulation for all 3 metrics, Tehuti sensor registration + recording, NPE prevention with OTel disabled and plain MetricsRepository.
  • StuckConsumerRepairOtelMetricEntityTest: metric entity validation for all 3 metrics.
  • StuckConsumerRepairTehutiMetricNameTest: Tehuti name validation for all 3 enum values.

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.
  • Yes. Clearly explain the behavior change and its impact.

Copilot AI review requested due to automatic review settings April 11, 2026 22:31
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds OpenTelemetry (OTel) COUNTER metrics for the stuck PubSub consumer detection/repair lifecycle so these signals are available in OTel-based dashboards, while preserving existing Tehuti sensors.

Changes:

  • Introduces StuckConsumerRepairOtelMetricEntity with 3 new ingestion.pubsub.consumer.stuck.* counters (cluster-name dimensioned).
  • Refactors StuckConsumerRepairStats to use MetricEntityStateBase (joint Tehuti + OTel) and requires clusterName in the constructor.
  • Wires the new metric entity into ServerMetricEntity and adds unit tests validating OTel entities, Tehuti names, and counter accumulation.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/StuckConsumerRepairStats.java Switches stuck-consumer stats recording to joint Tehuti+OTel metric state with cluster dimension setup.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/StuckConsumerRepairOtelMetricEntity.java Defines 3 new OTel COUNTER metric entities under ingestion.pubsub.consumer.stuck.*.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java Registers the new metric entity enum so it’s included in aggregated server metric entities.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/AggKafkaConsumerService.java Passes serverConfig.getClusterName() into the stats constructor when stuck-consumer repair is enabled.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/StuckConsumerRepairStatsTest.java Adds unit tests for OTel counter accumulation, Tehuti sensor presence, and NPE safety when OTel is disabled.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/StuckConsumerRepairOtelMetricEntityTest.java Validates the OTel metric entity definitions (name/type/unit/description/dimensions).
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/StuckConsumerRepairTehutiMetricNameTest.java Validates Tehuti metric name enum mappings.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java Updates expected aggregated server metric entity count (+3).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@m-nagarajan m-nagarajan added the needs-reviewer Looking for a reviewer to pick this up label Apr 20, 2026
Add 3 COUNTER metrics under ingestion.pubsub.consumer.stuck.* namespace:
- detected_count: scans that detected a stuck consumer
- task_repaired_count: ingestion tasks killed to unblock
- unresolved_count: stuck consumers found with no fixable task

Joint Tehuti+OTel API via MetricEntityStateBase. Singleton class with
CLUSTER_NAME dimension only. clusterName added to constructor, passed
from AggKafkaConsumerService via serverConfig.
@m-nagarajan m-nagarajan force-pushed the mnagaraj/addOtelMetricsForStuckConsumerRepairStats branch from 2730809 to d4e2673 Compare April 30, 2026 23:20
Copilot AI review requested due to automatic review settings April 30, 2026 23:25
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sushantmane
sushantmane previously approved these changes May 4, 2026
@sushantmane sushantmane added approved PR has been approved and removed needs-reviewer Looking for a reviewer to pick this up labels May 4, 2026
…rStats

# Conflicts:
#	clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java
#	clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java
@m-nagarajan m-nagarajan merged commit 07bb5d5 into linkedin:main May 6, 2026
106 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved PR has been approved

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants