[server] Add SLO classification dimensions to record-level delay metric#2736
[server] Add SLO classification dimensions to record-level delay metric#2736sushantmane wants to merge 11 commits intolinkedin:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds SLO-tier classification dimensions to the server-side OTel record-level delay metric (ingestion.replication.record.delay) so operators can slice latency/SLOs by locality, partial-update status, and chunking without joining against external store config.
Changes:
- Extend
RecordLevelDelayOtelStats/RecordLevelDelayOtelMetricEntityto emit new dimensions:venice.region.locality,venice.partial_update.status,venice.chunking.status. - Plumb local region + store metadata into
HeartbeatVersionedStatsso the delay metric can be tagged at stats creation time. - Add unit + integration/E2E tests validating the new dimensions are present on emitted histograms.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/NearlineE2ELatencyTest.java | Adds an E2E assertion that at least one server emits the record-delay histogram with expected SLO dimensions. |
| internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VenicePartialUpdateStatusTest.java | New unit test for the partial-update dimension enum wiring/value mapping. |
| internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VenicePartialUpdateStatus.java | Introduces VenicePartialUpdateStatus dimension enum. |
| internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensions.java | Registers the new VENICE_PARTIAL_UPDATE_STATUS dimension name. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ingestion/heartbeat/RecordLevelDelayOtelStatsTest.java | Updates/extends tests to validate added SLO dimensions on record-delay histograms. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ingestion/heartbeat/RecordLevelDelayOtelMetricEntityTest.java | Updates metric-entity dimension expectations to include the new dimensions. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ingestion/heartbeat/HeartbeatVersionedStatsTest.java | Adapts tests for new constructor args and validates record-delay dimension tagging. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ingestion/heartbeat/RecordLevelDelayOtelStats.java | Adds locality/partial-update/chunking tagging via base dimensions + per-region base dimensions. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ingestion/heartbeat/RecordLevelDelayOtelMetricEntity.java | Expands the metric entity’s required dimension set to include SLO dimensions. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ingestion/heartbeat/HeartbeatVersionedStats.java | Supplies local region and store-derived flags into RecordLevelDelayOtelStats creation. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ingestion/heartbeat/HeartbeatMonitoringService.java | Passes localRegionName into HeartbeatVersionedStats to enable locality computation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Store store = metadataRepository.getStore(storeName); | ||
| boolean partialUpdateEnabled = store != null && store.isWriteComputationEnabled(); | ||
| boolean chunkingEnabled = store != null && store.isChunkingEnabled(); |
There was a problem hiding this comment.
chunkingEnabled for the SLO dimension is derived from store.isChunkingEnabled(), but the PR description states the source should be version.isChunkingEnabled(). Since chunking is a version-level attribute (and can diverge from the store-level flag for existing versions), this may misclassify the metric for CURRENT/FUTURE roles. Consider deriving chunking status from the relevant Version object(s) (e.g., current/future version) or otherwise align the implementation/description so the emitted dimension reflects the intended semantics.
| */ | ||
| @Test | ||
| public void testWriteComputeEnabledDimension() { | ||
| RecordLevelDelayOtelStats wcStats = new RecordLevelDelayOtelStats( |
There was a problem hiding this comment.
This test Javadoc mentions write_compute_status=write_compute_enabled, but the new dimension being asserted is venice.partial_update.status with values like partial_update_enabled/disabled. Please update the comment to match the actual dimension name/value semantics to avoid confusion when reading the test failures.
3fa3100 to
a15ff88
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // because the metric also has version_role, replica_type, and replica_state dimensions. | ||
| for (MetricData metric: reader.collectAllMetrics()) { | ||
| if (!metric.getName().contains("ingestion.replication.record.delay")) { | ||
| continue; | ||
| } | ||
| ExponentialHistogramData histData = metric.getExponentialHistogramData(); | ||
| if (histData == null) { | ||
| continue; | ||
| } | ||
| for (ExponentialHistogramPointData point: histData.getPoints()) { | ||
| if (point.getCount() == 0) { | ||
| continue; | ||
| } | ||
| String locality = point.getAttributes().get(AttributeKey.stringKey(localityKey)); | ||
| String writeType = point.getAttributes().get(AttributeKey.stringKey(writeTypeKey)); |
There was a problem hiding this comment.
validateExponentialHistogramPointDataAtLeast matches histogram points by exact attribute equality (see OpenTelemetryDataTestUtils.getExponentialHistogramPointData), but expectedAttributes here only includes a subset of the metric dimensions. Since ingestion.replication.record.delay also includes version role, replica type, and replica state dimensions, this lookup will never find a point and the test will fail. Either include the full attribute set (and consider trying both LEADER/FOLLOWER + READY_TO_SERVE/CATCHING_UP as needed), or change the validation to search for any point whose attributes contain the expected SLO dimensions rather than requiring exact equality.
| } | ||
|
|
||
| /** | ||
| * Verifies that write-compute-enabled stores emit write_compute_status=write_compute_enabled. |
There was a problem hiding this comment.
The Javadoc claims the dimension emitted is write_compute_status=write_compute_enabled, but the code/assertions below are actually validating VENICE_STORE_WRITE_TYPE=partial_update. Please update the comment to reflect the real dimension name/value so future readers don’t look for a non-existent label.
| * Verifies that write-compute-enabled stores emit write_compute_status=write_compute_enabled. | |
| * Verifies that write-compute-enabled stores emit VENICE_STORE_WRITE_TYPE=partial_update. |
| // Start with base dimensions (store name, cluster name) and add SLO classification dimensions | ||
| this.baseDimensionsMap = new HashMap<>(otelSetup.getBaseDimensionsMap()); | ||
| VeniceStoreWriteType writeType = | ||
| partialUpdateEnabled ? VeniceStoreWriteType.PARTIAL_UPDATE : VeniceStoreWriteType.REGULAR_PUT; | ||
| this.baseDimensionsMap.put(VeniceMetricsDimensions.VENICE_STORE_WRITE_TYPE, writeType.getDimensionValue()); | ||
| VeniceChunkingStatus chunkStatus = chunkingEnabled ? VeniceChunkingStatus.CHUNKED : VeniceChunkingStatus.UNCHUNKED; | ||
| this.baseDimensionsMap.put(VeniceMetricsDimensions.VENICE_CHUNKING_STATUS, chunkStatus.getDimensionValue()); |
There was a problem hiding this comment.
PR description says the new partial-update dimension is venice.partial_update.status with values like partial_update_enabled/disabled, but the implementation adds VENICE_STORE_WRITE_TYPE with values regular_put/partial_update. If the metric contract changed from the proposal, please update the PR description and/or rename the dimension/value to match the intended taxonomy so dashboards/queries don’t diverge from the documented labels.
| * local-region ingestion. After streaming records and waiting for ingestion, verifies that at least | ||
| * one server emitted the histogram with the expected dimension values. | ||
| */ | ||
| @Test(timeOut = TEST_TIMEOUT) | ||
| public void testRecordLevelDelaySloDimensions() { | ||
| String storeName = "test-slo-dims"; | ||
| String parentControllerUrls = parentController.getControllerUrl(); |
There was a problem hiding this comment.
This test’s Javadoc refers to a “partial update status” dimension, but the assertions below validate VENICE_STORE_WRITE_TYPE (regular_put/partial_update). Consider aligning the wording with the actual label/value emitted by the code to avoid confusion about which dimension is being verified.
bf30d8f to
d91cda3
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ExponentialHistogramData histData = metric.getExponentialHistogramData(); | ||
| if (histData == null) { | ||
| continue; | ||
| } | ||
| for (ExponentialHistogramPointData point: histData.getPoints()) { |
There was a problem hiding this comment.
ExponentialHistogramData and ExponentialHistogramPointData are used but not imported, so this test won’t compile. Add the appropriate io.opentelemetry.sdk.metrics.data.* imports (or fully qualify the types).
| Store store = metadataRepository.getStore(storeName); | ||
| boolean partialUpdateEnabled = store != null && store.isWriteComputationEnabled(); | ||
| boolean chunkingEnabled = store != null && store.isChunkingEnabled(); | ||
| return recordLevelDelayOtelStatsMap.computeIfAbsent(storeName, key -> { | ||
| RecordLevelDelayOtelStats stats = new RecordLevelDelayOtelStats(getMetricsRepository(), storeName, clusterName); | ||
| RecordLevelDelayOtelStats stats = new RecordLevelDelayOtelStats( | ||
| getMetricsRepository(), | ||
| storeName, | ||
| clusterName, | ||
| localRegionName, | ||
| partialUpdateEnabled, | ||
| chunkingEnabled); |
There was a problem hiding this comment.
Chunking status is derived from store.isChunkingEnabled(), but chunking is a version-level property as well (Version#isChunkingEnabled()), and versions can diverge from store defaults (e.g., cloned versions or later version-level changes). This can mislabel the record-delay metric when current/future versions differ in chunking. Consider deriving chunking from the relevant Version (e.g., current/future version) or making the chunking dimension part of the per-(region, versionRole) metric state instead of a static base dimension.
| VENICE_DRAINER_TYPE("venice.drainer.type"); | ||
| VENICE_DRAINER_TYPE("venice.drainer.type"), | ||
|
|
||
| /** {@link VeniceStoreWriteType} Store write type: regular_put or partial_update. */ |
There was a problem hiding this comment.
The Javadoc for VENICE_STORE_WRITE_TYPE describes values regular_put / partial_update, but the newly added enum emits regular / write_compute (via name().toLowerCase()). Please update the comment to reflect the actual emitted dimension values and ensure this aligns with the dimension contract described in the PR (which mentions a partial-update status dimension).
| /** {@link VeniceStoreWriteType} Store write type: regular_put or partial_update. */ | |
| /** | |
| * {@link VeniceStoreWriteType} Store write type: regular or write_compute. | |
| * Partial-update status is tracked separately from this dimension. | |
| */ |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Store store = metadataRepository.getStore(storeName); | ||
| boolean partialUpdateEnabled = store != null && store.isWriteComputationEnabled(); | ||
| Version version = store != null ? store.getVersion(currentVersion) : null; | ||
| boolean chunkingEnabled = | ||
| version != null ? version.isChunkingEnabled() : (store != null && store.isChunkingEnabled()); | ||
| return recordLevelDelayOtelStatsMap.computeIfAbsent(storeName, key -> { | ||
| RecordLevelDelayOtelStats stats = new RecordLevelDelayOtelStats(getMetricsRepository(), storeName, clusterName); | ||
| RecordLevelDelayOtelStats stats = new RecordLevelDelayOtelStats( | ||
| getMetricsRepository(), | ||
| storeName, | ||
| clusterName, | ||
| localRegionName, | ||
| partialUpdateEnabled, | ||
| chunkingEnabled); |
There was a problem hiding this comment.
RecordLevelDelayOtelStats is created once per store and the SLO base dimensions (partialUpdateEnabled/chunkingEnabled) are captured only at creation time (and chunkingEnabled is derived from the then-current version). If the store’s current version swaps or store configs change (e.g., chunking toggled), subsequent delay metrics will keep emitting the old dimension values. Consider refreshing/recreating the RecordLevelDelayOtelStats entry on handleStoreChanged/onVersionInfoUpdated when these classification inputs change, or otherwise recomputing the dimension values before recording so the tags stay accurate over time.
| /** {@link VeniceStoreWriteType} Store write type: regular_put or partial_update. */ | ||
| VENICE_STORE_WRITE_TYPE("venice.store.write_type"); |
There was a problem hiding this comment.
The new dimension’s Javadoc/value expectations don’t match what the code actually emits (and what the PR description states). VeniceStoreWriteType defaults to name().toLowerCase() so the values will be regular / write_compute, not regular_put / partial_update; additionally the PR description calls this dimension venice.partial_update.status but the code adds venice.store.write_type. Please align the dimension name + documented values (either update the Javadoc/PR description to match what’s emitted, or rename/override the enum value mapping if the intended values are different).
Add three new OTel dimensions to ingestion.replication.record.delay for nearline write SLO tier classification: - venice.region.locality (LOCAL/REMOTE): computed by comparing the record's source region against the server's local region - venice.write_compute.status (WRITE_COMPUTE_ENABLED/DISABLED): from store config, distinguishes full PUT vs write compute workloads - venice.chunking.status (CHUNKED/UNCHUNKED): from version config, identifies stores that handle large values All three are static base dimensions (set once per store/version), so they add zero cardinality overhead. These dimensions enable SLO queries like: "P99 leader record delay for local-colo, non-WC, non-chunked stores" without requiring external config JOINs at query time.
Rename VENICE_WRITE_COMPUTE_STATUS → VENICE_PARTIAL_UPDATE_STATUS and VeniceWriteComputeStatus → VenicePartialUpdateStatus to use the user-facing terminology (partial update) rather than the internal implementation name (write compute).
Add testRecordLevelDelaySloDimensions to NearlineE2ELatencyTest that verifies the new OTel dimensions (region locality, partial update status, chunking status) are correctly emitted on the ingestion.replication.record.delay histogram after nearline ingestion.
Replace VenicePartialUpdateStatus (partial_update_enabled/disabled) with VeniceStoreWriteType (regular_put/partial_update) for clearer SLO tier naming. Dimension renamed from venice.partial_update.status to venice.store.write_type.
The VENICE_STORE_WRITE_TYPE enum entry was lost during rebase because main added VENICE_DRAINER_TYPE at the same position. Re-add the entry and add the corresponding switch cases to all three naming format tests.
The previous test used validateExponentialHistogramPointDataAtLeast which requires exact attribute match. Since the metric has additional dimensions (version_role, replica_type, replica_state) beyond the SLO dimensions we want to verify, use manual iteration to check that the SLO dimensions (region_locality, store_write_type, chunking_status) are present with the expected values.
Cleaner dimension values in metrics explorer: Venice.Store.WriteType = "regular" | "write_compute"
Add back ExponentialHistogramData and ExponentialHistogramPointData imports that were accidentally removed during the earlier refactor.
isChunkingEnabled is a version-level property — different versions of the same store can have different chunking settings. Look up the current version from the store and use its chunking config. Fall back to the store-level default if the version is not found. isWriteComputationEnabled remains store-level (not available on Version).
Revert to store-level isChunkingEnabled() since the stats object is shared across versions (keyed by store name). Add comment explaining the trade-off and how to get per-version accuracy if needed.
…vadocs - Use OpenTelemetryMetricsSetup.builder().addCustomDimension() for write type and chunking status instead of manually adding to baseDimensionsMap (reviewer: m-nagarajan) - Fix stale javadocs referencing old dimension names (write_compute_status, partial_update_status -> store.write_type)
96b523b to
603612e
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @Test(timeOut = TEST_TIMEOUT) | ||
| public void testRecordLevelDelaySloDimensions() { | ||
| String storeName = "test-slo-dims"; | ||
| String parentControllerUrls = parentController.getControllerUrl(); | ||
| try (ControllerClient parentControllerCli = new ControllerClient(CLUSTER_NAME, parentControllerUrls); |
There was a problem hiding this comment.
testRecordLevelDelaySloDimensions can exceed the @Test(timeOut = TEST_TIMEOUT) budget: it may wait up to 60s for push completion + 30s for read availability + 90s for the heartbeat reporter cycle (>120s total), causing deterministic timeouts/flakiness. Consider increasing the timeout for this test (or using a larger per-test timeout constant) and/or shortening the wait by configuring a shorter reporter interval for the test cluster.
Problem Statement
The
ingestion.replication.record.delayOTel metric lacks dimensions needed for nearline write SLO tier classification. To measure SLO attainment per workload tier, operators must JOIN metric data with external store config — complex, fragile, and not supported natively in Grafana/PromQL.Solution
Add three new OTel base dimensions to
ingestion.replication.record.delay:venice.region.localitylocal,remoteserverConfig.getRegionName()venice.partial_update.statuspartial_update_enabled,partial_update_disabledstore.isWriteComputationEnabled()venice.chunking.statuschunked,unchunkedversion.isChunkingEnabled()All three are static base dimensions (set once per store/version, like store name), so they add zero cardinality overhead.
Region locality is computed once per region at metric state creation time. Partial update and chunking status are looked up from the store metadata repository when the
RecordLevelDelayOtelStatsis lazily created.Code changes
ingestion.replication.record.delaymetric gated byserver.record.level.timestamp.enabledandserver.per.record.otel.metrics.enabled(both default: false). No new configs introduced.Concurrency-Specific Checks
VeniceConcurrentHashMapfor per-region metric states (unchanged).How was this PR tested?
Unit tests:
VenicePartialUpdateStatusTest,RecordLevelDelayOtelStatsTest(3 new: region locality, partial update, chunking),HeartbeatVersionedStatsTest,RecordLevelDelayOtelMetricEntityTestE2E test:
NearlineE2ELatencyTest.testRecordLevelDelaySloDimensions— verifies SLO dimensions through the full nearline ingestion pipelineDoes this PR introduce any user-facing or breaking changes?