Skip to content

feat: add OpenTelemetry tracing for tool calls#21

Open
priyank766 wants to merge 2 commits into
kubeflow:mainfrom
priyank766:feat/otel-tracing
Open

feat: add OpenTelemetry tracing for tool calls#21
priyank766 wants to merge 2 commits into
kubeflow:mainfrom
priyank766:feat/otel-tracing

Conversation

@priyank766
Copy link
Copy Markdown

Description

Adds optional OpenTelemetry tracing for tool calls in Kubeflow MCP Server.

  • Adds core/telemetry.py with setup_tracing() and get_tracer() plus safe no-op fallback when OTel deps are unavailable.
  • Instruments core.server._audit_wrap to create one span per tool invocation, set tool/persona/duration/success attributes, attach correlation_id, and record
    exceptions.
  • Wires tracing config through CLI/env/config:
    • --otel-endpoint
    • KUBEFLOW_MCP_OTEL_ENDPOINT
    • observability.otel_endpoint
  • Adds unit tests for no-op path, provider setup/reuse, endpoint validation, and span behavior.
  • Updates README with an Observability section.

Important compatibility note: correlation_id semantics are preserved and exposed as a span attribute; it is not remapped to OTel trace ID.

Type of Change

  • feat: New feature

Checklist

  • Tests pass locally (make test-python)
  • Linting passes (make verify)
  • Documentation updated (if applicable)
  • Commit messages follow conventional format

Related Issues

Fixes #18

@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign astefanutti for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: priyank <priyank8445@gmail.com>
@priyank766 priyank766 force-pushed the feat/otel-tracing branch from ec68142 to bec06a0 Compare May 15, 2026 16:51
@abhijeet-dhumal
Copy link
Copy Markdown
Member

Hey @priyank766 , this looks great 🚀
Thanks for working on this !

Comment thread pyproject.toml Outdated
"sphinx-design>=0.5",
]
otel = [
"opentelemetry-exporter-otlp>=1.25.0",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"opentelemetry-exporter-otlp>=1.25.0",
"opentelemetry-exporter-otlp-proto-http>=1.25.0",

this avoids installation of unnecessary GRPC subpackage

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the Suggestion
I will change the package as well

Comment thread kubeflow_mcp/core/server.py Outdated
exc_info=True,
)
raise
with tracer.start_as_current_span("tool_call") as span:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it just now and it seems span name "tool_call" makes it hard to filter by tool in tracing UIs, Maybe we can consider naming it after the tool: f"tool:{tool_name}" or even just tool_name. The attribute tool.name is still good to keep for structured querying, but the name gives context at a glance.
wdyt?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for Reviewing @abhijeet-dhumal
I will push both changes and then you can review it later whenever you get time
Yes I think this naming scheme is much better I will change it .. 👍🏻

Signed-off-by: priyank <priyank8445@gmail.com>
Copilot AI review requested due to automatic review settings May 18, 2026 15:53
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds optional OpenTelemetry tracing for Kubeflow MCP tool calls, wiring tracing through configuration, CLI, runtime instrumentation, tests, and docs.

Changes:

  • Adds telemetry setup/no-op helpers and an otel optional dependency extra.
  • Instruments _audit_wrap spans with tool/persona/correlation/success/duration attributes.
  • Wires --otel-endpoint, config/env loading, startup logging, tests, and README documentation.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
kubeflow_mcp/core/telemetry.py Adds OpenTelemetry setup, validation, provider reuse, and no-op fallback helpers.
kubeflow_mcp/core/server.py Adds span creation and attributes around audited tool calls.
kubeflow_mcp/cli.py Adds --otel-endpoint and invokes tracing setup during server startup.
kubeflow_mcp/core/config.py Adds observability.otel_endpoint config/env support.
kubeflow_mcp/core/logging.py Includes tracing_enabled in structured startup logs.
tests/unit/core/test_telemetry.py Adds telemetry helper and span attribute tests.
kubeflow_mcp/cli_test.py Adds CLI tracing setup wiring tests.
README.md Documents optional tracing setup and span attributes.
pyproject.toml Adds the otel optional dependency extra.
uv.lock Locks OpenTelemetry optional dependencies and related resolution changes.
Comments suppressed due to low confidence (1)

kubeflow_mcp/core/server.py:126

  • The circuit-open early-return tracing path is not covered by the new span behavior tests. Add coverage for a breaker with can_execute() == False so the tool.success=false and duration attributes remain verified for this failure mode.
            if not breaker.can_execute():
                duration_ms = int((time.monotonic() - start) * 1000)
                span.set_attribute("tool.success", False)
                span.set_attribute("tool.duration_ms", duration_ms)
                logger.warning("circuit_open", extra={"tool": tool_name})
                return {
                    "error": f"Circuit breaker open for '{tool_name}' — K8s API may be degraded. Retries automatically after recovery timeout.",
                    "error_code": ErrorCode.CIRCUIT_OPEN,
                }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +102 to +105
with tracer.start_as_current_span(f"tool:{tool_name}") as span:
span.set_attribute("tool.name", tool_name)
span.set_attribute("kubeflow.persona", persona)
span.set_attribute("correlation_id", cid)
breaker.record_failure()
span.set_attribute("tool.success", False)
span.set_attribute("tool.duration_ms", duration_ms)
span.record_exception(exc)
Comment on lines +242 to +248
observability_file = file_config.get("observability", {})
observability = ObservabilityConfig(
otel_endpoint=os.getenv(
"KUBEFLOW_MCP_OTEL_ENDPOINT",
observability_file.get("otel_endpoint"),
)
)
Comment on lines +107 to +115
if _rate_limiter is not None and not _rate_limiter.acquire():
duration_ms = int((time.monotonic() - start) * 1000)
span.set_attribute("tool.success", False)
span.set_attribute("tool.duration_ms", duration_ms)
logger.warning("rate_limited", extra={"tool": tool_name})
return {
"error": "Rate limit exceeded. Retry after a brief pause.",
"error_code": ErrorCode.RATE_LIMITED,
}
Comment thread README.md
Comment on lines +173 to +180
- Install optional dependencies: `pip install ".[otel]"`
- Enable tracing with CLI flag or env var:

```bash
kubeflow-mcp serve --otel-endpoint http://localhost:4318/v1/traces
# or
export KUBEFLOW_MCP_OTEL_ENDPOINT=http://localhost:4318/v1/traces
kubeflow-mcp serve
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add OpenTelemetry tracing to tool calls

3 participants