[da-vinci] Add OTel metrics to StuckConsumerRepairStats#2723
Merged
m-nagarajan merged 3 commits intoMay 6, 2026
Merged
Conversation
There was a problem hiding this comment.
Pull request overview
Adds OpenTelemetry (OTel) COUNTER metrics for the stuck PubSub consumer detection/repair lifecycle so these signals are available in OTel-based dashboards, while preserving existing Tehuti sensors.
Changes:
- Introduces
StuckConsumerRepairOtelMetricEntitywith 3 newingestion.pubsub.consumer.stuck.*counters (cluster-name dimensioned). - Refactors
StuckConsumerRepairStatsto useMetricEntityStateBase(joint Tehuti + OTel) and requiresclusterNamein the constructor. - Wires the new metric entity into
ServerMetricEntityand adds unit tests validating OTel entities, Tehuti names, and counter accumulation.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/StuckConsumerRepairStats.java | Switches stuck-consumer stats recording to joint Tehuti+OTel metric state with cluster dimension setup. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/StuckConsumerRepairOtelMetricEntity.java | Defines 3 new OTel COUNTER metric entities under ingestion.pubsub.consumer.stuck.*. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java | Registers the new metric entity enum so it’s included in aggregated server metric entities. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/AggKafkaConsumerService.java | Passes serverConfig.getClusterName() into the stats constructor when stuck-consumer repair is enabled. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/StuckConsumerRepairStatsTest.java | Adds unit tests for OTel counter accumulation, Tehuti sensor presence, and NPE safety when OTel is disabled. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/StuckConsumerRepairOtelMetricEntityTest.java | Validates the OTel metric entity definitions (name/type/unit/description/dimensions). |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/StuckConsumerRepairTehutiMetricNameTest.java | Validates Tehuti metric name enum mappings. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java | Updates expected aggregated server metric entity count (+3). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Add 3 COUNTER metrics under ingestion.pubsub.consumer.stuck.* namespace: - detected_count: scans that detected a stuck consumer - task_repaired_count: ingestion tasks killed to unblock - unresolved_count: stuck consumers found with no fixable task Joint Tehuti+OTel API via MetricEntityStateBase. Singleton class with CLUSTER_NAME dimension only. clusterName added to constructor, passed from AggKafkaConsumerService via serverConfig.
2730809 to
d4e2673
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
sushantmane
previously approved these changes
May 4, 2026
…rStats # Conflicts: # clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java # clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java
sushantmane
approved these changes
May 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem Statement
StuckConsumerRepairStatshas 3 Tehuti OccurrenceRate sensors for the stuck consumer detection and repair lifecycle but no OTel counterparts, making these metrics unavailable in OTel-based monitoring dashboards.Solution
Add 3 OTel COUNTER metrics under the
ingestion.pubsub.consumer.stuck.*namespace using the joint Tehuti+OTel API (MetricEntityStateBase):ingestion.pubsub.consumer.stuck.detected_count— scans that detected a stuck PubSub consumeringestion.pubsub.consumer.stuck.task_repaired_count— ingestion tasks killed to unblock a stuck consumeringestion.pubsub.consumer.stuck.unresolved_count— stuck consumers found with no fixable task identifiedThis is a singleton stats class with
CLUSTER_NAMEas the only dimension.clusterNamewas added to the constructor, passed fromAggKafkaConsumerServiceviaserverConfig.getClusterName().Code changes
Concurrency-Specific Checks
Both reviewer and PR author to verify
synchronized,RWLock) are used where needed.ConcurrentHashMap,CopyOnWriteArrayList).How was this PR tested?
New tests:
StuckConsumerRepairStatsTest(6 tests): OTel counter accumulation for all 3 metrics, Tehuti sensor registration + recording, NPE prevention with OTel disabled and plain MetricsRepository.StuckConsumerRepairOtelMetricEntityTest: metric entity validation for all 3 metrics.StuckConsumerRepairTehutiMetricNameTest: Tehuti name validation for all 3 enum values.Does this PR introduce any user-facing or breaking changes?