[server] Add OTel metrics to ServerLoadStats#2724
Conversation
There was a problem hiding this comment.
Pull request overview
Adds OpenTelemetry (OTel) coverage for server-side load controller metrics and fixes an accounting gap where non-rejected-but-slow requests were previously not counted as “accepted” in ServerLoadStats.
Changes:
- Introduces OTel metric entities for server load control request counts (by outcome) and rejection ratio, wired through the joint Tehuti+OTel metric state API.
- Adds a new OTel dimension
VENICE_SERVER_LOAD_REQUEST_OUTCOMEwith valuesaccepted/rejected. - Fixes
ServerLoadControllerHandler.recordLatency()to record an accepted request for all non-overloaded responses (while still only feeding the admission algorithm for below-threshold latencies), with updated/new unit tests.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| services/venice-server/src/main/java/com/linkedin/venice/stats/ServerLoadStats.java | Implements joint Tehuti+OTel emission for request outcome counter + rejection ratio histogram; keeps Tehuti-only total sensor. |
| services/venice-server/src/main/java/com/linkedin/venice/listener/ServerLoadControllerHandler.java | Ensures all non-overloaded requests are counted as accepted in metrics; preserves admission algorithm behavior for fast requests. |
| services/venice-server/src/main/java/com/linkedin/venice/listener/HttpChannelInitializer.java | Updates ServerLoadStats construction to pass cluster name for OTel base attributes. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java | Registers the new ServerLoadOtelMetricEntity module so server metric entities include the new definitions. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerLoadOtelMetricEntity.java | Defines OTel metric entities for load controller request count and rejection ratio. |
| internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceServerLoadRequestOutcome.java | Adds the enum backing the request outcome dimension values. |
| internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensions.java | Adds the new VENICE_SERVER_LOAD_REQUEST_OUTCOME dimension key. |
| internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensionsTest.java | Extends naming-format tests to cover the new dimension. |
| internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceServerLoadRequestOutcomeTest.java | Verifies dimension interface behavior and expected values for the new outcome enum. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java | Updates expected server metric entity count due to the two new metric entities. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerLoadOtelMetricEntityTest.java | Validates metric entity definitions (name/type/unit/description/dimensions) for the new OTel metrics. |
| services/venice-server/src/test/java/com/linkedin/venice/stats/ServerLoadStatsTest.java | Adds comprehensive unit tests for OTel counter/histogram behavior and Tehuti sensor registration, plus NPE-prevention paths. |
| services/venice-server/src/test/java/com/linkedin/venice/stats/ServerLoadTehutiMetricNameTest.java | Validates Tehuti metric names for the new joint-API-managed sensors. |
| services/venice-server/src/test/java/com/linkedin/venice/listener/ServerLoadControllerHandlerTest.java | Updates expectations to reflect the corrected “accepted request” counting behavior (fast + slow requests). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
services/venice-server/src/main/java/com/linkedin/venice/stats/ServerLoadStats.java:346 💡 [SUGGESTION] Arrays.asList used for an immutable list in Java 11+ codebase The Suggested fix: |
c2566a5 to
a78e439
Compare
There was a problem hiding this comment.
Pull request overview
Adds OpenTelemetry equivalents for server load-control metrics so ServerLoadStats data is available in OTel dashboards, and fixes accepted-request counting to include all non-load-shed requests.
Changes:
- Introduce OTel metric entities for load controller request count (by ACCEPTED/REJECTED outcome) and rejection ratio, plus a new
VENICE_SERVER_LOAD_REQUEST_OUTCOMEdimension. - Update
ServerLoadStatsto emit joint Tehuti+OTel metrics (keepingtotal_requestTehuti-only) and plumb cluster name into the stats constructor. - Fix
ServerLoadControllerHandler.recordLatency()to record accepted requests for all non-rejected requests; update/add unit tests accordingly.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| services/venice-server/src/main/java/com/linkedin/venice/stats/ServerLoadStats.java | Adds joint Tehuti+OTel metric wiring for request outcome counter + rejection ratio histogram. |
| services/venice-server/src/main/java/com/linkedin/venice/listener/ServerLoadControllerHandler.java | Ensures all served (non-load-shed) requests are counted as accepted in metrics. |
| services/venice-server/src/main/java/com/linkedin/venice/listener/HttpChannelInitializer.java | Passes cluster name into updated ServerLoadStats constructor. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerLoadOtelMetricEntity.java | Defines new server-load OTel metric entities (count + rejection ratio). |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java | Registers the new ServerLoadOtelMetricEntity in the server metric entity aggregation. |
| internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensions.java | Adds VENICE_SERVER_LOAD_REQUEST_OUTCOME dimension constant. |
| internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceServerLoadRequestOutcome.java | Adds enum for ACCEPTED/REJECTED dimension values. |
| services/venice-server/src/test/java/com/linkedin/venice/stats/ServerLoadStatsTest.java | New unit coverage for OTel counter/histogram behavior + Tehuti sensors + OTel-disabled safety. |
| services/venice-server/src/test/java/com/linkedin/venice/stats/ServerLoadTehutiMetricNameTest.java | Verifies Tehuti metric-name mapping for the new enum. |
| services/venice-server/src/test/java/com/linkedin/venice/listener/ServerLoadControllerHandlerTest.java | Updates assertions for accepted-request counting behavior change. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerLoadOtelMetricEntityTest.java | Validates metric entity definitions (name/type/unit/dimensions). |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java | Updates expected entity count after adding 2 new metric entities. |
| internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensionsTest.java | Extends dimension naming-format test coverage for the new dimension. |
| internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceServerLoadRequestOutcomeTest.java | Validates dimension interface/value mappings for the new enum. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
96c3f23 to
980a3a7
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Add 2 OTel metrics for server load control: - load_controller.request.count (COUNTER with ACCEPTED/REJECTED dimension) - load_controller.request.rejection_ratio (MIN_MAX_COUNT_SUM_AGGREGATIONS) Joint Tehuti+OTel API. Shared OTel instrument pattern for request count: two MetricEntityStateOneEnum instances (accepted + rejected) share the same OTel counter, each bound to its own Tehuti sensor. Tehuti total_request sensor remains Tehuti-only (OTel derives total at query time). Also fixes accepted request counting: all non-rejected requests are now recorded as ACCEPTED (previously only below-threshold requests were).
980a3a7 to
d7e8bcb
Compare
Problem Statement
ServerLoadStatshas 4 Tehuti sensors for server load control (total/rejected/accepted request rates + rejection ratio) but no OTel counterparts, making these metrics unavailable in OTel-based monitoring dashboards.Solution
Add 2 OTel metrics for server load control using the joint Tehuti+OTel API:
load_controller.request.count(COUNTER) — count of requests by load control outcome, withVENICE_SERVER_LOAD_REQUEST_OUTCOMEdimension (ACCEPTED/REJECTED). Uses the shared OTel instrument pattern: twoMetricEntityStateOneEnuminstances share the same OTel counter, each bound to its own Tehuti sensor.load_controller.request.rejection_ratio(MIN_MAX_COUNT_SUM_AGGREGATIONS) — server load rejection ratio as a fraction.The Tehuti
total_requestsensor remains Tehuti-only — OTel derives total at query time fromsum(ACCEPTED + REJECTED).Behavioral fix:
recordAcceptedRequest()is now called for ALL non-rejected requests inServerLoadControllerHandler.recordLatency(), not just those below the latency threshold. Previously, requests above the threshold but not rejected were uncounted.loadController.recordAccept()(the internal admission algorithm feed) still only fires for below-threshold requests.Code changes
Concurrency-Specific Checks
Both reviewer and PR author to verify
synchronized,RWLock) are used where needed.ConcurrentHashMap,CopyOnWriteArrayList).How was this PR tested?
New tests:
ServerLoadStatsTest(9 tests): OTel counter accumulation for accepted/rejected, dimension isolation, rejection ratio histogram (min/max/count/sum), all Tehuti sensors, double-counting prevention (total doesn't produce OTel data), NPE prevention with OTel disabled and plain MetricsRepository.ServerLoadOtelMetricEntityTest: metric entity validation for both metrics.ServerLoadTehutiMetricNameTest: Tehuti name validation for 3 enum values.VeniceServerLoadRequestOutcomeTest: dimension value validation.ServerLoadControllerHandlerTest: updated to verify all 6 requests (3 fast + 3 slow) are recorded as accepted.Does this PR introduce any user-facing or breaking changes?