[server] Add OTel metrics to RocksDBStats#2731
Conversation
There was a problem hiding this comment.
Pull request overview
Adds OpenTelemetry (OTel) metric coverage for RocksDB ticker-based stats to enable migration from Tehuti-only monitoring, including new dimension enums for per-component/per-level breakdowns.
Changes:
- Add 13 new OTel metric entities for RocksDB ticker stats (per-component block cache, per-level get-hit, joint counters, and read amplification).
- Wire the new metric entities into server metric aggregation and fix
AggRocksDBStatsto pass the correctMetricsRepository/cluster name. - Add new dimension enums + comprehensive unit tests for both Tehuti and OTel reporting paths.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| services/venice-server/src/main/java/com/linkedin/venice/stats/RocksDBStats.java | Registers Tehuti ticker gauges and adds OTel async metrics with component/level dimensions + read amplification behavior. |
| services/venice-server/src/main/java/com/linkedin/venice/stats/AggRocksDBStats.java | Fixes factory lambda to use the lambda repo parameter and passes cluster name into RocksDBStats. |
| services/venice-server/src/test/java/com/linkedin/venice/stats/RocksDBStatsTehutiTest.java | Adds Tehuti-baseline tests for ticker gauges, hit ratio edge case, read amplification sentinel behavior, and setter guards. |
| services/venice-server/src/test/java/com/linkedin/venice/stats/RocksDBStatsOtelTest.java | Adds OTel emission tests validating attributes/dimensions, NaN-drop behavior, and NPE safety when OTel is disabled/plain repo. |
| internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceRocksDBLevel.java | Introduces VENICE_ROCKSDB_LEVEL dimension enum. |
| internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceRocksDBBlockCacheComponent.java | Introduces VENICE_ROCKSDB_BLOCK_CACHE_COMPONENT dimension enum. |
| internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensions.java | Adds the two new dimension keys to the shared dimension registry. |
| internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceRocksDBLevelTest.java | Validates VeniceRocksDBLevel dimension values via the dimension fixture. |
| internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceRocksDBBlockCacheComponentTest.java | Validates VeniceRocksDBBlockCacheComponent dimension values via the dimension fixture. |
| internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensionsTest.java | Updates naming-format assertions for the two new dimensions (snake/camel/pascal). |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/RocksDBStatsOtelMetricEntity.java | Defines the new RocksDBStats OTel MetricEntity set (13 entities). |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java | Registers RocksDBStatsOtelMetricEntity into the server metric entity aggregation list. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/RocksDBStatsOtelMetricEntityTest.java | Adds entity-definition validation for all new metric entities/dimensions. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java | Updates the expected aggregated entity count after adding the new enum. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
services/venice-server/src/main/java/com/linkedin/venice/stats/RocksDBStats.java:829 💡 [SUGGESTION] Double-lambda wrapper in The helper method creates a LongSupplier callback = () -> rocksDBStat != null ? rocksDBStat.getTickerCount(tickerType) : -1;
registerSensorIfAbsent(new AsyncGauge((ig, ig2) -> callback.getAsLong(), tehutiSensorName));
registerSensorIfAbsent(new AsyncGauge((ig, ig2) -> callback.getAsLong(), tehutiSensorName));…which is what the code already does, so no change needed. Actually the issue is that the callback could be used directly as an |
|
services/venice-server/src/main/java/com/linkedin/venice/stats/RocksDBStats.java:660 💡 [SUGGESTION] Overall block cache The three overall block cache counters ( Suggested fix: |
Add 13 OTel metrics alongside existing 27 Tehuti sensors for RocksDB ticker-based statistics (gated by rocksdb.statistics.enabled). Per-component block cache metrics (miss, hit, add, bytes_inserted) are consolidated into 4 OTel metrics with a BLOCK_CACHE_COMPONENT dimension (INDEX/FILTER/DATA/COMPRESSION_DICT). Get-hit-by-level uses a LEVEL dimension (LEVEL_0/LEVEL_1/LEVEL_2_AND_UP). Overall block cache miss/hit/add are Tehuti-only (OTel derives via sum by component). Block cache hit ratio is Tehuti-only (derivable from per-component OTel gauges). Read amplification uses ASYNC_DOUBLE_GAUGE with split callbacks (Tehuti returns -1, OTel returns NaN when unavailable). Also fixes: AggRocksDBStats lambda capturing outer metricsRepository instead of lambda parameter, and adds null guard to setRocksDBStat.
- Make rocksDBStat volatile for JMM correctness - Rename block cache byte metrics for consistency: bytes.read, bytes.written (matches bytes.inserted)
ed688f6 to
80deab0
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
# Conflicts: # clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java # clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java # internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensions.java # internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensionsTest.java
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
services/venice-server/src/main/java/com/linkedin/venice/stats/RocksDBStats.java:318
setRocksDBStatis still racy under concurrent callers: two threads can both observerocksDBStat == null, both skip the exception, and the later write silently overwrites the first one.volatilefixes visibility for readers but does not make this one-time initialization check atomic.
public void setRocksDBStat(Statistics stat) {
if (stat == null) {
throw new IllegalArgumentException("'stat' must not be null");
}
if (this.rocksDBStat != null) {
throw new VeniceException("'rocksDBStat' has already been initialized");
}
this.rocksDBStat = stat;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Problem Statement
RocksDBStats (gated by
rocksdb.statistics.enabled, default false) reports 27 Tehuti AsyncGauge sensors for RocksDB ticker counters but has no OTel metrics. This prevents migration to OTel-based monitoring for RocksDB block cache, memtable, bloom filter, compaction, and read amplification statistics.Solution
Add 13 OTel metrics alongside the existing 27 Tehuti sensors, consolidating per-component metrics using dimensions:
miss.count,hit.count,add.count,bytes.inserted) withVENICE_ROCKSDB_BLOCK_CACHE_COMPONENTdimension (INDEX/FILTER/DATA/COMPRESSION_DICT) — consolidates 16 Tehuti sensors into 4 OTel metricsget.hit.count) withVENICE_ROCKSDB_LEVELdimension (LEVEL_0/LEVEL_1/LEVEL_2_AND_UP) — consolidates 3 Tehuti sensors into 1 OTel metricASYNC_DOUBLE_GAUGEwith split callbacks (Tehuti returns -1 sentinel, OTel returns NaN when unavailable so SDK drops data point)Overall block cache miss/hit/add counters remain Tehuti-only — OTel totals are derivable via
sum by (component). Block cache hit ratio is also Tehuti-only (derivable from per-component OTel gauges).Tehuti → OTel metric mapping (click to expand)
rocksdb_block_cache_index_missrocksdb.block_cache.miss.countrocksdb_block_cache_filter_missrocksdb.block_cache.miss.countrocksdb_block_cache_data_missrocksdb.block_cache.miss.countrocksdb_block_cache_compression_dict_missrocksdb.block_cache.miss.countrocksdb_block_cache_index_hitrocksdb.block_cache.hit.countrocksdb_block_cache_filter_hitrocksdb.block_cache.hit.countrocksdb_block_cache_data_hitrocksdb.block_cache.hit.countrocksdb_block_cache_compression_dict_hitrocksdb.block_cache.hit.countrocksdb_block_cache_index_addrocksdb.block_cache.add.countrocksdb_block_cache_filter_addrocksdb.block_cache.add.countrocksdb_block_cache_data_addrocksdb.block_cache.add.countrocksdb_block_cache_compression_dict_addrocksdb.block_cache.add.countrocksdb_block_cache_index_bytes_insertrocksdb.block_cache.bytes.insertedrocksdb_block_cache_filter_bytes_insertrocksdb.block_cache.bytes.insertedrocksdb_block_cache_data_bytes_insertrocksdb.block_cache.bytes.insertedrocksdb_block_cache_compression_dict_bytes_insertrocksdb.block_cache.bytes.insertedrocksdb_block_cache_add_failuresrocksdb.block_cache.add.failure.countrocksdb_block_cache_bytes_readrocksdb.block_cache.bytes.readrocksdb_block_cache_bytes_writerocksdb.block_cache.bytes.writtenrocksdb_bloom_filter_usefulrocksdb.bloom_filter.useful_countrocksdb_memtable_hitrocksdb.memtable.hit.countrocksdb_memtable_missrocksdb.memtable.miss.countrocksdb_compaction_cancelledrocksdb.compaction.cancelled_countrocksdb_get_hit_l0rocksdb.get.hit.countrocksdb_get_hit_l1rocksdb.get.hit.countrocksdb_get_hit_l2_and_uprocksdb.get.hit.countrocksdb_read_amplification_factorrocksdb.read_amplification_factorrocksdb_block_cache_missrocksdb_block_cache_hitrocksdb_block_cache_addrocksdb_block_cache_hit_ratioDimension-to-TickerType mappings use static
EnumMaps. A missing mapping causes theliveStateResolverto returnnull(dormant — no data point emitted) rather than throwing; completeness is enforced by per-enum CI tests (testEveryVeniceRocksDBLevelEmitsForGetHitCount,testEveryVeniceRocksDBBlockCacheComponentEmitsForAllPerComponentMetrics) so adding a newVeniceRocksDBLevelorVeniceRocksDBBlockCacheComponentvalue without updating the ticker map fails CI rather than spamming failure metrics every collection cycle.Also fixes:
AggRocksDBStatslambda was capturing the outermetricsRepositoryconstructor parameter instead of the lambda parametermetricsRepo, and adds null guard tosetRocksDBStat.rocksDBStatfield madevolatilefor JMM correctness.RocksDBStatsdoes NOT propagateisTotalStats()toOpenTelemetryMetricsSetup—AggRocksDBStatsis fire-and-forget inVeniceServer.java, so thetotalStatsinstance is the onlyRocksDBStatsever created in production. PropagatingisTotalStats(true)would suppress every RocksDB OTel emission. This matches the pattern used byRocksDBMemoryStats(the closest single-instance analog).Code changes
Concurrency-Specific Checks
Both reviewer and PR author to verify
synchronized,RWLock) are used where needed.ConcurrentHashMap,CopyOnWriteArrayList).How was this PR tested?
New test files:
RocksDBStatsOtelTest(15 tests): per-component dimension validation for all 4 block cache metrics × 4 components, joint metrics, SST-level dimensions, read amplificationASYNC_DOUBLE_GAUGEwith NaN edge case, pre-init behavior (jointAsyncMetricEntityStateBasemetrics emit-1sentinel; per-component / per-levelAsyncMetricEntityStateOneEnummetrics honor the dormant contract and emit no data point), per-enum completeness guards (testEveryVeniceRocksDBLevelEmitsForGetHitCount,testEveryVeniceRocksDBBlockCacheComponentEmitsForAllPerComponentMetrics),isTotalStatsregression guard (testOtelEmitsForTotalStatsNameconstructsRocksDBStats(repo, "total", ...)and asserts emission so re-introducingisTotalStats()propagation fails CI), NPE prevention (OTel disabled + plainMetricsRepository). Tests use a dedicatedAsyncGauge.AsyncGaugeExecutorperVeniceMetricsRepositorysoclose()doesn't shut down the JVM-wide singleton executor and break later AsyncGauge tests.RocksDBStatsTehutiTest(5 tests): Tehuti sensor baseline (ticker values, hit ratio formula, read amplification -1 sentinel, NaN on zero denominator, setRocksDBStat guards)RocksDBStatsOtelMetricEntityTest: validates all 13 entity definitions via ModuleMetricEntityTestFixtureVeniceRocksDBBlockCacheComponentTest+VeniceRocksDBLevelTest: dimension enum validation via VeniceDimensionTestFixtureVeniceMetricsDimensionsTest: updated for both new dimensions across all 3 naming formatsServerMetricEntityTest: count updated from 159 to 172Does this PR introduce any user-facing or breaking changes?