[server]heartbeat monitor version cleanup by yininzhou · Pull Request #2717 · linkedin/venice

yininzhou · 2026-04-10T00:54:06Z

Problem Statement

During version cleanup, HeartbeatMonitoringService emits a high volume of noisy but harmless ERROR log lines:

ERROR [HeartbeatMonitoringService] Failed to get version for resource: _v11-115
with trigger: LEADER->STANDBY. Will not update lag monitor.

This is caused by a race condition between two parallel execution paths:

Path A — Helix state machine (fast, milliseconds):

Helix sends LEADER→STANDBY → STANDBY→OFFLINE → OFFLINE→DROPPED to tear down a version being deleted
As part of STANDBY→OFFLINE, the controller deletes the version from ZooKeeper

Path B — SIT consumer action queue (slow, single-threaded):

The Store Ingestion Task enqueues LEADER_TO_STANDBY actions and processes them one partition at a time
For a store with 300+ partitions on one host, it can take ~90 seconds to process all partitions

By the time SIT processes LEADER_TO_STANDBY for the last partition and calls updateLagMonitor() with SET_FOLLOWER_MONITOR, the version has already been deleted from ZK. The current code treats this as an unexpected error and logs at ERROR level, even though the scenario
is expected and harmless — the version is being deleted anyway.

Solution

Split the version == null handling in updateLagMonitor() into three distinct cases:

Action	version == null	Before	After
`SET_FOLLOWER_MONITOR`	Expected during version cleanup race	`ERROR` + return	`WARN` + metric + return
`SET_LEADER_MONITOR`	Unexpected	`ERROR` + return	`ERROR` + return (unchanged)
`REMOVE_MONITOR`	Expected during deletion	synthetic `VersionImpl`	unchanged

SET_LEADER_MONITOR is intentionally kept as ERROR — a leader losing its version metadata is more serious and warrants investigation.

A new OTel+Tehuti counter (ingestion.heartbeat_monitoring.version_not_found_for_lag_monitor_count) is added to preserve observability. If this counter spikes beyond the expected baseline (e.g. due to ZK instability), it can be used to trigger an alert. No alert threshold is
configured yet — recommend observing baseline after rollout before setting one.

Notice: please read

This change is introducing the logic that when SET_FOLLOWER_MONITOR, it is ignoring the null version at all. In production we should make sure that this is always safe to ignore HeartbeatMonitoringService without version

Observability Plan

Merge and deploy the fix
Let the counter run for a week with no alert
Look at the baseline in the dashboard (Grafana/InGraphs)
Set the alert threshold at ~3-5x the observed baseline

The only observable changes are:

ERROR log → WARN log for SET_FOLLOWER_MONITOR + null version during version cleanup
New metric counter emitted in the same scenario

Copilot

Pull request overview

Reduces noisy ERROR logs from HeartbeatMonitoringService during version cleanup by treating SET_FOLLOWER_MONITOR + missing version metadata as an expected race and emitting a WARN plus a counter metric instead.

Changes:

Updates HeartbeatMonitoringService.updateLagMonitor() to special-case SET_FOLLOWER_MONITOR && version == null with WARN + metric and early-return.
Adds a new Tehuti+OTel counter: ingestion.heartbeat_monitoring.version_not_found_for_lag_monitor_count.
Adds a unit test covering the version-cleanup race scenario.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ingestion/heartbeat/HeartbeatMonitoringService.java	Special-cases null-version handling for follower monitor updates and records a new metric.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/HeartbeatMonitoringServiceStats.java	Adds a new metric state + recording method for “version not found for lag monitor”.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/HeartbeatMonitoringOtelMetricEntity.java	Defines the new OTel metric entity for the counter.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ingestion/heartbeat/HeartbeatMonitoringServiceTest.java	Adds a test validating the new behavior and metric emission.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-13T23:10:24Z

+              "Version not found for resource: {} with trigger: {}. Likely version cleanup in progress, skipping lag monitor update.",
+              replicaId,
+              heartbeatLagMonitorAction.getTrigger());
+          getHeartbeatMonitoringServiceStats().recordVersionNotFoundForLagMonitor();


updateLagMonitor() now calls getHeartbeatMonitoringServiceStats().recordVersionNotFoundForLagMonitor() on the SET_FOLLOWER_MONITOR && version == null path. HeartbeatMonitoringService is constructed with a nullable HeartbeatMonitoringServiceStats in several unit tests (and potentially other call sites), so this can introduce a new NPE in a previously safe path. Consider either enforcing a non-null stats instance in the constructor (e.g., create a no-op implementation when null is passed) or adding a null-check before recording the metric so the service remains safe when stats are not wired.

Suggested change

getHeartbeatMonitoringServiceStats().recordVersionNotFoundForLagMonitor();

if (getHeartbeatMonitoringServiceStats() != null) {

getHeartbeatMonitoringServiceStats().recordVersionNotFoundForLagMonitor();

}

…o WARN during version cleanup During version cleanup, the controller deletes the version from ZK as part of STANDBY->OFFLINE while SIT processes LEADER->STANDBY asynchronously from a single-threaded queue. By the time SIT calls updateLagMonitor, the version may already be gone, causing a noisy but harmless ERROR log. Downgrade to WARN for SET_FOLLOWER_MONITOR only; SET_LEADER_MONITOR null-version remains an ERROR. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ps during version cleanup Track occurrences of the LEADER->STANDBY / version-cleanup race condition via a new OTel+Tehuti counter (heartbeat-monitor-version-not-found-for-lag-monitor). This ensures systematic ZK issues (e.g. connection loss causing widespread null versions) remain observable even after downgrading the log from ERROR to WARN. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…etric entries HeartbeatMonitoringOtelMetricEntityTest and HeartbeatMonitoringTehutiMetricNameTest enumerate all metric entity/name enum values. Add the new HEARTBEAT_MONITORING_VERSION_NOT_FOUND_FOR_LAG_MONITOR_COUNT OTel entity and HEARTBEAT_MONITOR_VERSION_NOT_FOUND_FOR_LAG_MONITOR Tehuti name entries introduced in the previous commit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

HeartbeatMonitoringService can be constructed with a null HeartbeatMonitoringServiceStats (as done in several unit tests). Add a null-check before calling recordVersionNotFoundForLagMonitor() on the SET_FOLLOWER_MONITOR + null-version path to prevent NPE in call sites that do not wire stats. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The new HEARTBEAT_MONITORING_VERSION_NOT_FOUND_FOR_LAG_MONITOR_COUNT metric entity added to HeartbeatMonitoringOtelMetricEntity (which is already registered in ServerMetricEntity.getMetricEntityEnumClasses()) brings the total unique metric entity count from 152 to 153. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

haoxu07 · 2026-04-28T23:17:54Z

      setOf(VENICE_CLUSTER_NAME, VENICE_HEARTBEAT_COMPONENT)
+  ),
+  HEARTBEAT_MONITORING_VERSION_NOT_FOUND_FOR_LAG_MONITOR_COUNT(
+      "ingestion.heartbeat_monitoring.version_not_found_for_lag_monitor_count", MetricType.COUNTER, MetricUnit.NUMBER,


Should we add this metric at all?

The PR description says the metric is there in case we want to alert later, with no current threshold ("recommend observing baseline after rollout before setting one"). That's adding observability speculatively. The WARN log introduced in the same change carries the same information (replicaId, trigger), and a Kusto query against InLogs can derive a count or rate-based alert without a new time series.

Counters that fire during expected behavior (the cleanup race is documented as normal) tend to clutter dashboards without driving action, and HEARTBEAT_MONITORING_EXCEPTION_COUNT already catches genuine breakages downstream.

Suggest dropping the metric and relying on the WARN log + log-based alerting if a concrete concern emerges.

If you'd rather keep it, please rename: version_not_found_for_lag_monitor_count reads as if it covers all actions, but the counter is only ever incremented for SET_FOLLOWER_MONITOR. Something like set_follower_monitor_version_not_found_count (and the matching Tehuti name) would be self-describing.

haoxu07 · 2026-04-28T23:17:58Z

+              "Version not found for resource: {} with trigger: {}. Likely version cleanup in progress, skipping lag monitor update.",
+              replicaId,
+              heartbeatLagMonitorAction.getTrigger());
+          if (getHeartbeatMonitoringServiceStats() != null) {


This null check is unnecessary. heartbeatMonitoringServiceStats is declared final (line 80) and assigned unconditionally in the constructor (line 116) — it cannot be null in production. The other call sites in this file (lines 893, 903, 928, 939) all use the field directly without a null check.

Suggest matching that pattern:

getHeartbeatMonitoringServiceStats().recordVersionNotFoundForLagMonitor();

The getter is fine to keep — it's needed so the test can override it via Mockito.

haoxu07 · 2026-04-28T23:18:02Z

+   * SET_LEADER_MONITOR with a null version should still be treated as an unexpected error.
+   */
+  @Test
+  public void testUpdateLagMonitorDuringVersionCleanup() {


The bulk of this test duplicates section 2 of testUpdateLagMonitor (lines 660-671), which already exercises SET_FOLLOWER_MONITOR with store != null && version == null and verifies addFollowerLagMonitor is never called. The only genuinely new coverage here is:

recordVersionNotFoundForLagMonitor() is called exactly once for SET_FOLLOWER_MONITOR

It is not called for SET_LEADER_MONITOR

Consider folding that into the existing test by adding the HeartbeatMonitoringServiceStats mock and the two verify(stats, times(N)) assertions next to the cases already running. That keeps the SET_FOLLOWER_MONITOR + null-version coverage in one place — when behavior evolves there's only one test to update.

…move null-check, consolidate test - Drop the version_not_found_for_lag_monitor_count OTel counter and its Tehuti counterpart: the WARN log introduced in the same change carries equivalent information and log-based alerting is sufficient; a counter for expected race behavior would clutter dashboards without driving action. - Remove the null-check guard around the (now-deleted) metric call. heartbeatMonitoringServiceStats is final and assigned unconditionally in the constructor, matching the pattern of all other call sites in the file. - Delete testUpdateLagMonitorDuringVersionCleanup; its only unique coverage was the metric verify calls. The SET_FOLLOWER_MONITOR + null-version graceful-skip behavior is already exercised by section 2 of testUpdateLagMonitor. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

haoxu07 · 2026-04-29T23:07:18Z

    return metadataRepository;
  }

+  HeartbeatMonitoringServiceStats getHeartbeatMonitoringServiceStats() {


This line seems useless.

yininzhou marked this pull request as ready for review April 10, 2026 00:58

yininzhou changed the title ~~Yininzho/heartbeat monitor version cleanup~~ [server]heartbeat monitor version cleanup Apr 10, 2026

sixpluszero requested a review from Copilot April 13, 2026 23:07

Copilot started reviewing on behalf of sixpluszero April 13, 2026 23:08 View session

Copilot AI reviewed Apr 13, 2026

View reviewed changes

yininzhou force-pushed the yininzho/heartbeat-monitor-version-cleanup branch from ebbae3c to 4dcb3c4 Compare April 14, 2026 21:11

Yining Zhou and others added 5 commits April 28, 2026 15:59

yininzhou force-pushed the yininzho/heartbeat-monitor-version-cleanup branch from 19f78a6 to 1227d43 Compare April 28, 2026 23:03

haoxu07 reviewed Apr 28, 2026

View reviewed changes

haoxu07 approved these changes Apr 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[server]heartbeat monitor version cleanup#2717

[server]heartbeat monitor version cleanup#2717
yininzhou wants to merge 6 commits into
linkedin:mainfrom
yininzhou:yininzho/heartbeat-monitor-version-cleanup

yininzhou commented Apr 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

haoxu07 Apr 28, 2026

Uh oh!

haoxu07 Apr 28, 2026

Uh oh!

haoxu07 Apr 28, 2026

Uh oh!

haoxu07 Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yininzhou commented Apr 10, 2026

Problem Statement

Solution

Notice: please read

Observability Plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

haoxu07 Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

haoxu07 Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

haoxu07 Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

haoxu07 Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants