Skip to content

[server]heartbeat monitor version cleanup#2717

Open
yininzhou wants to merge 6 commits into
linkedin:mainfrom
yininzhou:yininzho/heartbeat-monitor-version-cleanup
Open

[server]heartbeat monitor version cleanup#2717
yininzhou wants to merge 6 commits into
linkedin:mainfrom
yininzhou:yininzho/heartbeat-monitor-version-cleanup

Conversation

@yininzhou
Copy link
Copy Markdown

Problem Statement

During version cleanup, HeartbeatMonitoringService emits a high volume of noisy but harmless ERROR log lines:

ERROR [HeartbeatMonitoringService] Failed to get version for resource: _v11-115
with trigger: LEADER->STANDBY. Will not update lag monitor.

This is caused by a race condition between two parallel execution paths:

Path A — Helix state machine (fast, milliseconds):

  • Helix sends LEADER→STANDBYSTANDBY→OFFLINEOFFLINE→DROPPED to tear down a version being deleted
  • As part of STANDBY→OFFLINE, the controller deletes the version from ZooKeeper

Path B — SIT consumer action queue (slow, single-threaded):

  • The Store Ingestion Task enqueues LEADER_TO_STANDBY actions and processes them one partition at a time
  • For a store with 300+ partitions on one host, it can take ~90 seconds to process all partitions

By the time SIT processes LEADER_TO_STANDBY for the last partition and calls updateLagMonitor() with SET_FOLLOWER_MONITOR, the version has already been deleted from ZK. The current code treats this as an unexpected error and logs at ERROR level, even though the scenario
is expected and harmless — the version is being deleted anyway.

Solution

Split the version == null handling in updateLagMonitor() into three distinct cases:

Action version == null Before After
SET_FOLLOWER_MONITOR Expected during version cleanup race ERROR + return WARN + metric + return
SET_LEADER_MONITOR Unexpected ERROR + return ERROR + return (unchanged)
REMOVE_MONITOR Expected during deletion synthetic VersionImpl unchanged

SET_LEADER_MONITOR is intentionally kept as ERROR — a leader losing its version metadata is more serious and warrants investigation.

A new OTel+Tehuti counter (ingestion.heartbeat_monitoring.version_not_found_for_lag_monitor_count) is added to preserve observability. If this counter spikes beyond the expected baseline (e.g. due to ZK instability), it can be used to trigger an alert. No alert threshold is
configured yet — recommend observing baseline after rollout before setting one.

Notice: please read

This change is introducing the logic that when SET_FOLLOWER_MONITOR, it is ignoring the null version at all. In production we should make sure that this is always safe to ignore HeartbeatMonitoringService without version

Observability Plan

  1. Merge and deploy the fix
  2. Let the counter run for a week with no alert
  3. Look at the baseline in the dashboard (Grafana/InGraphs)
  4. Set the alert threshold at ~3-5x the observed baseline

The only observable changes are:

  • ERROR log → WARN log for SET_FOLLOWER_MONITOR + null version during version cleanup
  • New metric counter emitted in the same scenario

@yininzhou yininzhou marked this pull request as ready for review April 10, 2026 00:58
@yininzhou yininzhou changed the title Yininzho/heartbeat monitor version cleanup [server]heartbeat monitor version cleanup Apr 10, 2026
@sixpluszero sixpluszero requested a review from Copilot April 13, 2026 23:07
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Reduces noisy ERROR logs from HeartbeatMonitoringService during version cleanup by treating SET_FOLLOWER_MONITOR + missing version metadata as an expected race and emitting a WARN plus a counter metric instead.

Changes:

  • Updates HeartbeatMonitoringService.updateLagMonitor() to special-case SET_FOLLOWER_MONITOR && version == null with WARN + metric and early-return.
  • Adds a new Tehuti+OTel counter: ingestion.heartbeat_monitoring.version_not_found_for_lag_monitor_count.
  • Adds a unit test covering the version-cleanup race scenario.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ingestion/heartbeat/HeartbeatMonitoringService.java Special-cases null-version handling for follower monitor updates and records a new metric.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/HeartbeatMonitoringServiceStats.java Adds a new metric state + recording method for “version not found for lag monitor”.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/HeartbeatMonitoringOtelMetricEntity.java Defines the new OTel metric entity for the counter.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ingestion/heartbeat/HeartbeatMonitoringServiceTest.java Adds a test validating the new behavior and metric emission.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

"Version not found for resource: {} with trigger: {}. Likely version cleanup in progress, skipping lag monitor update.",
replicaId,
heartbeatLagMonitorAction.getTrigger());
getHeartbeatMonitoringServiceStats().recordVersionNotFoundForLagMonitor();
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updateLagMonitor() now calls getHeartbeatMonitoringServiceStats().recordVersionNotFoundForLagMonitor() on the SET_FOLLOWER_MONITOR && version == null path. HeartbeatMonitoringService is constructed with a nullable HeartbeatMonitoringServiceStats in several unit tests (and potentially other call sites), so this can introduce a new NPE in a previously safe path. Consider either enforcing a non-null stats instance in the constructor (e.g., create a no-op implementation when null is passed) or adding a null-check before recording the metric so the service remains safe when stats are not wired.

Suggested change
getHeartbeatMonitoringServiceStats().recordVersionNotFoundForLagMonitor();
if (getHeartbeatMonitoringServiceStats() != null) {
getHeartbeatMonitoringServiceStats().recordVersionNotFoundForLagMonitor();
}

Copilot uses AI. Check for mistakes.
@yininzhou yininzhou force-pushed the yininzho/heartbeat-monitor-version-cleanup branch from ebbae3c to 4dcb3c4 Compare April 14, 2026 21:11
Yining Zhou and others added 5 commits April 28, 2026 15:59
…o WARN during version cleanup

During version cleanup, the controller deletes the version from ZK as part of
STANDBY->OFFLINE while SIT processes LEADER->STANDBY asynchronously from a
single-threaded queue. By the time SIT calls updateLagMonitor, the version may
already be gone, causing a noisy but harmless ERROR log. Downgrade to WARN for
SET_FOLLOWER_MONITOR only; SET_LEADER_MONITOR null-version remains an ERROR.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ps during version cleanup

Track occurrences of the LEADER->STANDBY / version-cleanup race condition via
a new OTel+Tehuti counter (heartbeat-monitor-version-not-found-for-lag-monitor).
This ensures systematic ZK issues (e.g. connection loss causing widespread null
versions) remain observable even after downgrading the log from ERROR to WARN.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…etric

entries

HeartbeatMonitoringOtelMetricEntityTest and
HeartbeatMonitoringTehutiMetricNameTest
enumerate all metric entity/name enum values. Add the new
HEARTBEAT_MONITORING_VERSION_NOT_FOUND_FOR_LAG_MONITOR_COUNT OTel entity and
HEARTBEAT_MONITOR_VERSION_NOT_FOUND_FOR_LAG_MONITOR Tehuti name entries
introduced
in the previous commit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
HeartbeatMonitoringService can be constructed with a null
HeartbeatMonitoringServiceStats (as done in several unit tests). Add a
null-check before calling recordVersionNotFoundForLagMonitor() on the
SET_FOLLOWER_MONITOR + null-version path to prevent NPE in call sites
that do not wire stats.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The new HEARTBEAT_MONITORING_VERSION_NOT_FOUND_FOR_LAG_MONITOR_COUNT
metric entity added to HeartbeatMonitoringOtelMetricEntity (which is
already registered in ServerMetricEntity.getMetricEntityEnumClasses())
brings the total unique metric entity count from 152 to 153.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@yininzhou yininzhou force-pushed the yininzho/heartbeat-monitor-version-cleanup branch from 19f78a6 to 1227d43 Compare April 28, 2026 23:03
setOf(VENICE_CLUSTER_NAME, VENICE_HEARTBEAT_COMPONENT)
),
HEARTBEAT_MONITORING_VERSION_NOT_FOUND_FOR_LAG_MONITOR_COUNT(
"ingestion.heartbeat_monitoring.version_not_found_for_lag_monitor_count", MetricType.COUNTER, MetricUnit.NUMBER,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add this metric at all?

The PR description says the metric is there in case we want to alert later, with no current threshold ("recommend observing baseline after rollout before setting one"). That's adding observability speculatively. The WARN log introduced in the same change carries the same information (replicaId, trigger), and a Kusto query against InLogs can derive a count or rate-based alert without a new time series.

Counters that fire during expected behavior (the cleanup race is documented as normal) tend to clutter dashboards without driving action, and HEARTBEAT_MONITORING_EXCEPTION_COUNT already catches genuine breakages downstream.

Suggest dropping the metric and relying on the WARN log + log-based alerting if a concrete concern emerges.

If you'd rather keep it, please rename: version_not_found_for_lag_monitor_count reads as if it covers all actions, but the counter is only ever incremented for SET_FOLLOWER_MONITOR. Something like set_follower_monitor_version_not_found_count (and the matching Tehuti name) would be self-describing.

"Version not found for resource: {} with trigger: {}. Likely version cleanup in progress, skipping lag monitor update.",
replicaId,
heartbeatLagMonitorAction.getTrigger());
if (getHeartbeatMonitoringServiceStats() != null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This null check is unnecessary. heartbeatMonitoringServiceStats is declared final (line 80) and assigned unconditionally in the constructor (line 116) — it cannot be null in production. The other call sites in this file (lines 893, 903, 928, 939) all use the field directly without a null check.

Suggest matching that pattern:

getHeartbeatMonitoringServiceStats().recordVersionNotFoundForLagMonitor();

The getter is fine to keep — it's needed so the test can override it via Mockito.

* SET_LEADER_MONITOR with a null version should still be treated as an unexpected error.
*/
@Test
public void testUpdateLagMonitorDuringVersionCleanup() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bulk of this test duplicates section 2 of testUpdateLagMonitor (lines 660-671), which already exercises SET_FOLLOWER_MONITOR with store != null && version == null and verifies addFollowerLagMonitor is never called. The only genuinely new coverage here is:

  • recordVersionNotFoundForLagMonitor() is called exactly once for SET_FOLLOWER_MONITOR
  • It is not called for SET_LEADER_MONITOR

Consider folding that into the existing test by adding the HeartbeatMonitoringServiceStats mock and the two verify(stats, times(N)) assertions next to the cases already running. That keeps the SET_FOLLOWER_MONITOR + null-version coverage in one place — when behavior evolves there's only one test to update.

…move

null-check, consolidate test

- Drop the version_not_found_for_lag_monitor_count OTel counter and its
  Tehuti counterpart: the WARN log introduced in the same change carries
  equivalent information and log-based alerting is sufficient; a counter
  for expected race behavior would clutter dashboards without driving
  action.
- Remove the null-check guard around the (now-deleted) metric call.
  heartbeatMonitoringServiceStats is final and assigned unconditionally
  in the constructor, matching the pattern of all other call sites in
  the file.
- Delete testUpdateLagMonitorDuringVersionCleanup; its only unique
  coverage was the metric verify calls. The SET_FOLLOWER_MONITOR +
  null-version graceful-skip behavior is already exercised by section 2
  of testUpdateLagMonitor.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
return metadataRepository;
}

HeartbeatMonitoringServiceStats getHeartbeatMonitoringServiceStats() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line seems useless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants