[server]heartbeat monitor version cleanup#2717
Conversation
There was a problem hiding this comment.
Pull request overview
Reduces noisy ERROR logs from HeartbeatMonitoringService during version cleanup by treating SET_FOLLOWER_MONITOR + missing version metadata as an expected race and emitting a WARN plus a counter metric instead.
Changes:
- Updates
HeartbeatMonitoringService.updateLagMonitor()to special-caseSET_FOLLOWER_MONITOR && version == nullwithWARN+ metric and early-return. - Adds a new Tehuti+OTel counter:
ingestion.heartbeat_monitoring.version_not_found_for_lag_monitor_count. - Adds a unit test covering the version-cleanup race scenario.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ingestion/heartbeat/HeartbeatMonitoringService.java | Special-cases null-version handling for follower monitor updates and records a new metric. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/HeartbeatMonitoringServiceStats.java | Adds a new metric state + recording method for “version not found for lag monitor”. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/HeartbeatMonitoringOtelMetricEntity.java | Defines the new OTel metric entity for the counter. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ingestion/heartbeat/HeartbeatMonitoringServiceTest.java | Adds a test validating the new behavior and metric emission. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "Version not found for resource: {} with trigger: {}. Likely version cleanup in progress, skipping lag monitor update.", | ||
| replicaId, | ||
| heartbeatLagMonitorAction.getTrigger()); | ||
| getHeartbeatMonitoringServiceStats().recordVersionNotFoundForLagMonitor(); |
There was a problem hiding this comment.
updateLagMonitor() now calls getHeartbeatMonitoringServiceStats().recordVersionNotFoundForLagMonitor() on the SET_FOLLOWER_MONITOR && version == null path. HeartbeatMonitoringService is constructed with a nullable HeartbeatMonitoringServiceStats in several unit tests (and potentially other call sites), so this can introduce a new NPE in a previously safe path. Consider either enforcing a non-null stats instance in the constructor (e.g., create a no-op implementation when null is passed) or adding a null-check before recording the metric so the service remains safe when stats are not wired.
| getHeartbeatMonitoringServiceStats().recordVersionNotFoundForLagMonitor(); | |
| if (getHeartbeatMonitoringServiceStats() != null) { | |
| getHeartbeatMonitoringServiceStats().recordVersionNotFoundForLagMonitor(); | |
| } |
ebbae3c to
4dcb3c4
Compare
…o WARN during version cleanup During version cleanup, the controller deletes the version from ZK as part of STANDBY->OFFLINE while SIT processes LEADER->STANDBY asynchronously from a single-threaded queue. By the time SIT calls updateLagMonitor, the version may already be gone, causing a noisy but harmless ERROR log. Downgrade to WARN for SET_FOLLOWER_MONITOR only; SET_LEADER_MONITOR null-version remains an ERROR. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ps during version cleanup Track occurrences of the LEADER->STANDBY / version-cleanup race condition via a new OTel+Tehuti counter (heartbeat-monitor-version-not-found-for-lag-monitor). This ensures systematic ZK issues (e.g. connection loss causing widespread null versions) remain observable even after downgrading the log from ERROR to WARN. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…etric entries HeartbeatMonitoringOtelMetricEntityTest and HeartbeatMonitoringTehutiMetricNameTest enumerate all metric entity/name enum values. Add the new HEARTBEAT_MONITORING_VERSION_NOT_FOUND_FOR_LAG_MONITOR_COUNT OTel entity and HEARTBEAT_MONITOR_VERSION_NOT_FOUND_FOR_LAG_MONITOR Tehuti name entries introduced in the previous commit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
HeartbeatMonitoringService can be constructed with a null HeartbeatMonitoringServiceStats (as done in several unit tests). Add a null-check before calling recordVersionNotFoundForLagMonitor() on the SET_FOLLOWER_MONITOR + null-version path to prevent NPE in call sites that do not wire stats. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The new HEARTBEAT_MONITORING_VERSION_NOT_FOUND_FOR_LAG_MONITOR_COUNT metric entity added to HeartbeatMonitoringOtelMetricEntity (which is already registered in ServerMetricEntity.getMetricEntityEnumClasses()) brings the total unique metric entity count from 152 to 153. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
19f78a6 to
1227d43
Compare
| setOf(VENICE_CLUSTER_NAME, VENICE_HEARTBEAT_COMPONENT) | ||
| ), | ||
| HEARTBEAT_MONITORING_VERSION_NOT_FOUND_FOR_LAG_MONITOR_COUNT( | ||
| "ingestion.heartbeat_monitoring.version_not_found_for_lag_monitor_count", MetricType.COUNTER, MetricUnit.NUMBER, |
There was a problem hiding this comment.
Should we add this metric at all?
The PR description says the metric is there in case we want to alert later, with no current threshold ("recommend observing baseline after rollout before setting one"). That's adding observability speculatively. The WARN log introduced in the same change carries the same information (replicaId, trigger), and a Kusto query against InLogs can derive a count or rate-based alert without a new time series.
Counters that fire during expected behavior (the cleanup race is documented as normal) tend to clutter dashboards without driving action, and HEARTBEAT_MONITORING_EXCEPTION_COUNT already catches genuine breakages downstream.
Suggest dropping the metric and relying on the WARN log + log-based alerting if a concrete concern emerges.
If you'd rather keep it, please rename: version_not_found_for_lag_monitor_count reads as if it covers all actions, but the counter is only ever incremented for SET_FOLLOWER_MONITOR. Something like set_follower_monitor_version_not_found_count (and the matching Tehuti name) would be self-describing.
| "Version not found for resource: {} with trigger: {}. Likely version cleanup in progress, skipping lag monitor update.", | ||
| replicaId, | ||
| heartbeatLagMonitorAction.getTrigger()); | ||
| if (getHeartbeatMonitoringServiceStats() != null) { |
There was a problem hiding this comment.
This null check is unnecessary. heartbeatMonitoringServiceStats is declared final (line 80) and assigned unconditionally in the constructor (line 116) — it cannot be null in production. The other call sites in this file (lines 893, 903, 928, 939) all use the field directly without a null check.
Suggest matching that pattern:
getHeartbeatMonitoringServiceStats().recordVersionNotFoundForLagMonitor();
The getter is fine to keep — it's needed so the test can override it via Mockito.
| * SET_LEADER_MONITOR with a null version should still be treated as an unexpected error. | ||
| */ | ||
| @Test | ||
| public void testUpdateLagMonitorDuringVersionCleanup() { |
There was a problem hiding this comment.
The bulk of this test duplicates section 2 of testUpdateLagMonitor (lines 660-671), which already exercises SET_FOLLOWER_MONITOR with store != null && version == null and verifies addFollowerLagMonitor is never called. The only genuinely new coverage here is:
recordVersionNotFoundForLagMonitor()is called exactly once forSET_FOLLOWER_MONITOR- It is not called for
SET_LEADER_MONITOR
Consider folding that into the existing test by adding the HeartbeatMonitoringServiceStats mock and the two verify(stats, times(N)) assertions next to the cases already running. That keeps the SET_FOLLOWER_MONITOR + null-version coverage in one place — when behavior evolves there's only one test to update.
…move null-check, consolidate test - Drop the version_not_found_for_lag_monitor_count OTel counter and its Tehuti counterpart: the WARN log introduced in the same change carries equivalent information and log-based alerting is sufficient; a counter for expected race behavior would clutter dashboards without driving action. - Remove the null-check guard around the (now-deleted) metric call. heartbeatMonitoringServiceStats is final and assigned unconditionally in the constructor, matching the pattern of all other call sites in the file. - Delete testUpdateLagMonitorDuringVersionCleanup; its only unique coverage was the metric verify calls. The SET_FOLLOWER_MONITOR + null-version graceful-skip behavior is already exercised by section 2 of testUpdateLagMonitor. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| return metadataRepository; | ||
| } | ||
|
|
||
| HeartbeatMonitoringServiceStats getHeartbeatMonitoringServiceStats() { |
Problem Statement
During version cleanup,
HeartbeatMonitoringServiceemits a high volume of noisy but harmlessERRORlog lines:ERROR [HeartbeatMonitoringService] Failed to get version for resource: _v11-115
with trigger: LEADER->STANDBY. Will not update lag monitor.
This is caused by a race condition between two parallel execution paths:
Path A — Helix state machine (fast, milliseconds):
LEADER→STANDBY→STANDBY→OFFLINE→OFFLINE→DROPPEDto tear down a version being deletedSTANDBY→OFFLINE, the controller deletes the version from ZooKeeperPath B — SIT consumer action queue (slow, single-threaded):
LEADER_TO_STANDBYactions and processes them one partition at a timeBy the time SIT processes
LEADER_TO_STANDBYfor the last partition and callsupdateLagMonitor()withSET_FOLLOWER_MONITOR, the version has already been deleted from ZK. The current code treats this as an unexpected error and logs atERRORlevel, even though the scenariois expected and harmless — the version is being deleted anyway.
Solution
Split the
version == nullhandling inupdateLagMonitor()into three distinct cases:SET_FOLLOWER_MONITORERROR+ returnWARN+ metric + returnSET_LEADER_MONITORERROR+ returnERROR+ return (unchanged)REMOVE_MONITORVersionImplSET_LEADER_MONITORis intentionally kept asERROR— a leader losing its version metadata is more serious and warrants investigation.A new OTel+Tehuti counter (
ingestion.heartbeat_monitoring.version_not_found_for_lag_monitor_count) is added to preserve observability. If this counter spikes beyond the expected baseline (e.g. due to ZK instability), it can be used to trigger an alert. No alert threshold isconfigured yet — recommend observing baseline after rollout before setting one.
Notice: please read
This change is introducing the logic that when SET_FOLLOWER_MONITOR, it is ignoring the null version at all. In production we should make sure that this is always safe to ignore HeartbeatMonitoringService without version
Observability Plan
The only observable changes are:
ERRORlog →WARNlog forSET_FOLLOWER_MONITOR+ null version during version cleanup