Skip to content

[server] Add OTel metrics to ServerLoadStats#2724

Merged
m-nagarajan merged 3 commits into
linkedin:mainfrom
m-nagarajan:mnagaraj/addOtelMetricsForServerLoadStats
May 4, 2026
Merged

[server] Add OTel metrics to ServerLoadStats#2724
m-nagarajan merged 3 commits into
linkedin:mainfrom
m-nagarajan:mnagaraj/addOtelMetricsForServerLoadStats

Conversation

@m-nagarajan
Copy link
Copy Markdown
Contributor

Problem Statement

ServerLoadStats has 4 Tehuti sensors for server load control (total/rejected/accepted request rates + rejection ratio) but no OTel counterparts, making these metrics unavailable in OTel-based monitoring dashboards.

Solution

Add 2 OTel metrics for server load control using the joint Tehuti+OTel API:

  • load_controller.request.count (COUNTER) — count of requests by load control outcome, with VENICE_SERVER_LOAD_REQUEST_OUTCOME dimension (ACCEPTED/REJECTED). Uses the shared OTel instrument pattern: two MetricEntityStateOneEnum instances share the same OTel counter, each bound to its own Tehuti sensor.
  • load_controller.request.rejection_ratio (MIN_MAX_COUNT_SUM_AGGREGATIONS) — server load rejection ratio as a fraction.

The Tehuti total_request sensor remains Tehuti-only — OTel derives total at query time from sum(ACCEPTED + REJECTED).

Behavioral fix: recordAcceptedRequest() is now called for ALL non-rejected requests in ServerLoadControllerHandler.recordLatency(), not just those below the latency threshold. Previously, requests above the threshold but not rejected were uncounted. loadController.recordAccept() (the internal admission algorithm feed) still only fires for below-threshold requests.

Code changes

  • Added new code behind a config.
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

  • New unit tests added.
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility (if applicable).

New tests:

  • ServerLoadStatsTest (9 tests): OTel counter accumulation for accepted/rejected, dimension isolation, rejection ratio histogram (min/max/count/sum), all Tehuti sensors, double-counting prevention (total doesn't produce OTel data), NPE prevention with OTel disabled and plain MetricsRepository.
  • ServerLoadOtelMetricEntityTest: metric entity validation for both metrics.
  • ServerLoadTehutiMetricNameTest: Tehuti name validation for 3 enum values.
  • VeniceServerLoadRequestOutcomeTest: dimension value validation.
  • ServerLoadControllerHandlerTest: updated to verify all 6 requests (3 fast + 3 slow) are recorded as accepted.

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.
  • Yes. Clearly explain the behavior change and its impact.

Copilot AI review requested due to automatic review settings April 11, 2026 23:04
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds OpenTelemetry (OTel) coverage for server-side load controller metrics and fixes an accounting gap where non-rejected-but-slow requests were previously not counted as “accepted” in ServerLoadStats.

Changes:

  • Introduces OTel metric entities for server load control request counts (by outcome) and rejection ratio, wired through the joint Tehuti+OTel metric state API.
  • Adds a new OTel dimension VENICE_SERVER_LOAD_REQUEST_OUTCOME with values accepted/rejected.
  • Fixes ServerLoadControllerHandler.recordLatency() to record an accepted request for all non-overloaded responses (while still only feeding the admission algorithm for below-threshold latencies), with updated/new unit tests.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
services/venice-server/src/main/java/com/linkedin/venice/stats/ServerLoadStats.java Implements joint Tehuti+OTel emission for request outcome counter + rejection ratio histogram; keeps Tehuti-only total sensor.
services/venice-server/src/main/java/com/linkedin/venice/listener/ServerLoadControllerHandler.java Ensures all non-overloaded requests are counted as accepted in metrics; preserves admission algorithm behavior for fast requests.
services/venice-server/src/main/java/com/linkedin/venice/listener/HttpChannelInitializer.java Updates ServerLoadStats construction to pass cluster name for OTel base attributes.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java Registers the new ServerLoadOtelMetricEntity module so server metric entities include the new definitions.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerLoadOtelMetricEntity.java Defines OTel metric entities for load controller request count and rejection ratio.
internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceServerLoadRequestOutcome.java Adds the enum backing the request outcome dimension values.
internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensions.java Adds the new VENICE_SERVER_LOAD_REQUEST_OUTCOME dimension key.
internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensionsTest.java Extends naming-format tests to cover the new dimension.
internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceServerLoadRequestOutcomeTest.java Verifies dimension interface behavior and expected values for the new outcome enum.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java Updates expected server metric entity count due to the two new metric entities.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerLoadOtelMetricEntityTest.java Validates metric entity definitions (name/type/unit/description/dimensions) for the new OTel metrics.
services/venice-server/src/test/java/com/linkedin/venice/stats/ServerLoadStatsTest.java Adds comprehensive unit tests for OTel counter/histogram behavior and Tehuti sensor registration, plus NPE-prevention paths.
services/venice-server/src/test/java/com/linkedin/venice/stats/ServerLoadTehutiMetricNameTest.java Validates Tehuti metric names for the new joint-API-managed sensors.
services/venice-server/src/test/java/com/linkedin/venice/listener/ServerLoadControllerHandlerTest.java Updates expectations to reflect the corrected “accepted request” counting behavior (fast + slow requests).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@m-nagarajan
Copy link
Copy Markdown
Contributor Author

services/venice-server/src/main/java/com/linkedin/venice/stats/ServerLoadStats.java:346

💡 [SUGGESTION] Arrays.asList used for an immutable list in Java 11+ codebase

The rejectionRatioMetric construction uses Arrays.asList(new Avg(), new Max()) for the Tehuti stats list. In a Java 11+ codebase, List.of(new Avg(), new Max()) is the idiomatic choice for an immutable list and communicates intent more clearly. The two MetricEntityStateOneEnum calls directly above already use Collections.singletonList, so consistency within this constructor is also off.

Suggested fix:
Replace Arrays.asList(new Avg(), new Max()) with List.of(new Avg(), new Max()) and remove the import java.util.Arrays; import.

@m-nagarajan m-nagarajan added the needs-reviewer Looking for a reviewer to pick this up label Apr 20, 2026
Copilot AI review requested due to automatic review settings April 30, 2026 22:27
@m-nagarajan m-nagarajan force-pushed the mnagaraj/addOtelMetricsForServerLoadStats branch from c2566a5 to a78e439 Compare April 30, 2026 22:27
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds OpenTelemetry equivalents for server load-control metrics so ServerLoadStats data is available in OTel dashboards, and fixes accepted-request counting to include all non-load-shed requests.

Changes:

  • Introduce OTel metric entities for load controller request count (by ACCEPTED/REJECTED outcome) and rejection ratio, plus a new VENICE_SERVER_LOAD_REQUEST_OUTCOME dimension.
  • Update ServerLoadStats to emit joint Tehuti+OTel metrics (keeping total_request Tehuti-only) and plumb cluster name into the stats constructor.
  • Fix ServerLoadControllerHandler.recordLatency() to record accepted requests for all non-rejected requests; update/add unit tests accordingly.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated no comments.

Show a summary per file
File Description
services/venice-server/src/main/java/com/linkedin/venice/stats/ServerLoadStats.java Adds joint Tehuti+OTel metric wiring for request outcome counter + rejection ratio histogram.
services/venice-server/src/main/java/com/linkedin/venice/listener/ServerLoadControllerHandler.java Ensures all served (non-load-shed) requests are counted as accepted in metrics.
services/venice-server/src/main/java/com/linkedin/venice/listener/HttpChannelInitializer.java Passes cluster name into updated ServerLoadStats constructor.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerLoadOtelMetricEntity.java Defines new server-load OTel metric entities (count + rejection ratio).
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ServerMetricEntity.java Registers the new ServerLoadOtelMetricEntity in the server metric entity aggregation.
internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensions.java Adds VENICE_SERVER_LOAD_REQUEST_OUTCOME dimension constant.
internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceServerLoadRequestOutcome.java Adds enum for ACCEPTED/REJECTED dimension values.
services/venice-server/src/test/java/com/linkedin/venice/stats/ServerLoadStatsTest.java New unit coverage for OTel counter/histogram behavior + Tehuti sensors + OTel-disabled safety.
services/venice-server/src/test/java/com/linkedin/venice/stats/ServerLoadTehutiMetricNameTest.java Verifies Tehuti metric-name mapping for the new enum.
services/venice-server/src/test/java/com/linkedin/venice/listener/ServerLoadControllerHandlerTest.java Updates assertions for accepted-request counting behavior change.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerLoadOtelMetricEntityTest.java Validates metric entity definitions (name/type/unit/dimensions).
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ServerMetricEntityTest.java Updates expected entity count after adding 2 new metric entities.
internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensionsTest.java Extends dimension naming-format test coverage for the new dimension.
internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VeniceServerLoadRequestOutcomeTest.java Validates dimension interface/value mappings for the new enum.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sushantmane
sushantmane previously approved these changes May 4, 2026
Copy link
Copy Markdown
Contributor

@sushantmane sushantmane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sushantmane sushantmane added approved PR has been approved and removed needs-reviewer Looking for a reviewer to pick this up labels May 4, 2026
Copilot AI review requested due to automatic review settings May 4, 2026 22:12
@m-nagarajan m-nagarajan force-pushed the mnagaraj/addOtelMetricsForServerLoadStats branch from 96c3f23 to 980a3a7 Compare May 4, 2026 22:12
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Add 2 OTel metrics for server load control:
- load_controller.request.count (COUNTER with ACCEPTED/REJECTED dimension)
- load_controller.request.rejection_ratio (MIN_MAX_COUNT_SUM_AGGREGATIONS)

Joint Tehuti+OTel API. Shared OTel instrument pattern for request count:
two MetricEntityStateOneEnum instances (accepted + rejected) share the same
OTel counter, each bound to its own Tehuti sensor. Tehuti total_request
sensor remains Tehuti-only (OTel derives total at query time).

Also fixes accepted request counting: all non-rejected requests are now
recorded as ACCEPTED (previously only below-threshold requests were).
@m-nagarajan m-nagarajan force-pushed the mnagaraj/addOtelMetricsForServerLoadStats branch from 980a3a7 to d7e8bcb Compare May 4, 2026 22:25
@m-nagarajan m-nagarajan merged commit b519953 into linkedin:main May 4, 2026
184 of 186 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved PR has been approved

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants