Skip to content

Fix broken dashboard panels (Active Sessions, Tool Success Rate, API Error Rate) + add usage panels#11

Open
404pilo wants to merge 1 commit into
ColeMurray:mainfrom
ai-ready-future:fix/dashboard-broken-panels
Open

Fix broken dashboard panels (Active Sessions, Tool Success Rate, API Error Rate) + add usage panels#11
404pilo wants to merge 1 commit into
ColeMurray:mainfrom
ai-ready-future:fix/dashboard-broken-panels

Conversation

@404pilo

@404pilo 404pilo commented May 29, 2026

Copy link
Copy Markdown

Summary

Three panels in the bundled Grafana dashboard never resolve against a real stack. This PR fixes their root causes and adds six high-value panels. Every query was verified against a live OTel Collector + Prometheus + Loki stack receiving real Claude Code telemetry.

Bugs fixed

1. "Active Sessions" always shows 0

The panel uses increase(claude_code_session_count_total[1h]), but session.count is a one-shot counter — it ticks once at session start and never updates. The Collector's Prometheus exporter drops idle series after the default metric_expiration (~5m), so the series disappears and increase() has nothing to compute.

Fix: derive active sessions from the continuously-updated active_time metric:

count(count by (session_id) (increase(claude_code_active_time_seconds_total{job="otel-collector"}[1h]) > 0))

…and raise the exporter's metric_expiration to 2h so one-shot counters (session, lines_of_code, commit, pull_request, code_edit_decision) survive between updates. Without this, "Lines of Code", "Code Changes", and "Development Activity" also flatline once idle.

2. "Tool Success Rate" and "API Error Rate" show "No data"

Both pipe Loki events through | json. But Claude Code's OTLP log events put the event name in the log body (e.g. claude_code.tool_result) and every field (success, tool_name, status_code, …) in structured-metadata labels. | json tries to parse the body, throws JSONParserErr on every line, and silently zeroes the panels.

Fix: filter the labels directly — | success="true", sum by (status_code) (…) — no | json stage.

New panels

  • Cache Hit RatiocacheRead / (cacheRead + input); cost efficiency at a glance
  • Cost by query_source — main vs subagent vs auxiliary (surfaces multi-agent cost runaway)
  • Token Spend by Model
  • Active Time: User vs CLI
  • Tool Decisions: Accept vs Reject — uses the decision label on claude_code.tool_decision
  • API Latency P95 by Model

Test plan

  • claude-code-dashboard.json is valid JSON (29 panels)
  • Every PromQL/LogQL query returns data against a live Prometheus (:9090) and Loki (:3100)
  • Collector restarts cleanly with metric_expiration: 2h
  • Grafana provisions the dashboard with no errors

🤖 Generated with Claude Code

Three panels query data that never resolves against a real stack:

- "Active Sessions" uses increase(claude_code_session_count_total), but
  session.count is a one-shot counter; the collector's Prometheus
  exporter drops it after the default ~5m metric_expiration, so the
  panel reads 0. Derive active sessions from the continuously-updated
  active_time metric instead, and raise the exporter's metric_expiration
  to 2h so one-shot counters (session / lines_of_code / commit / PR /
  code_edit_decision) survive between updates.
- "Tool Success Rate" and "API Error Rate" pipe Loki events through
  `| json`, but Claude Code log bodies are the literal event name and
  every field is a structured-metadata label. `| json` throws
  JSONParserErr and silently zeroes both panels. Filter labels directly
  (`| success="true"`, `sum by (status_code)`).

Adds six panels: Cache Hit Ratio, Cost by query_source (main vs subagent
vs auxiliary), Token Spend by Model, Active Time (user vs cli), Tool
Decision accept/reject, and API Latency P95 by model.

All queries verified against a live OTel Collector + Prometheus + Loki
stack receiving real Claude Code telemetry.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant