TKAI-2: add session tracing propagation by figitaki · Pull Request #1 · tkhq/valet

figitaki · 2026-05-06T23:41:26Z

Summary

add a dependency-free OTLP/HTTP JSON tracer shared by worker and runner
propagate W3C traceparent through sandbox env, DO→runner protocol messages, and runner→OpenCode HTTP calls
instrument worker lifecycle/dispatch spans plus runner bootstrap, repo setup, turn, workflow, tool, and LLM usage spans
document TKAI-2 implementation coverage in the session tracing spec

Test plan

Unit tests
Smoke test
Deploy

Need to set.

OTEL_EXPORTER_OTLP_ENDPOINT=<grafana otlp endpoint>
OTEL_EXPORTER_OTLP_HEADERS=<auth headers>

Verification

pnpm --filter @valet/shared typecheck
pnpm --filter @valet/runner typecheck
pnpm --filter @valet/worker typecheck
pnpm --filter @valet/runner test -- src/prompt.test.ts
pnpm --filter @valet/worker exec vitest run src/durable-objects/prompt-queue.test.ts src/durable-objects/runner-link.test.ts

Refs TKAI-2

collectAlarmDeadlines() returned past timestamps that scheduleAlarm() passed directly to setAlarm(), causing Cloudflare to fire the alarm immediately in a tight loop. The main culprit: any prompt processing longer than 5 minutes created a past watchdog deadline that the health monitor correctly skipped (runner still connected) but never cleared. Two-part fix: - Add a 30s floor in scheduleAlarm() for past deadlines as a universal safety net against this class of bug - Skip unactionable deadlines in collectAlarmDeadlines(): watchdog when past + runner connected, idle-queue when runner disconnected

figitaki · 2026-05-07T03:01:51Z

+      this.tracer = new SimpleTracer({
+        serviceName: 'valet-worker',
+        endpoint: this.env.OTEL_EXPORTER_OTLP_ENDPOINT,
+        headers: this.env.OTEL_EXPORTER_OTLP_HEADERS,


I'm not sure why we need these?

These are the auth headers / endpoints that receive the telemetry since there's no intermediate grafana agent until this is running in k8s

…solver getChannelForMessage previously returned null for both "row missing" and "row exists but lacks channel context", making all drops log as no_prompt_row. This made it hard to tell legitimate drops (system prompts on orchestrator sessions) from unexpected ones. Now returns a discriminated union with a specific reason, so each call site passes the accurate drop reason to dropEmission.

Deployment previously used a single .env.deploy file with no environment separation. This made it impossible to deploy to dev and prod from the same checkout without manually swapping config files. Now all deploy, bootstrap, and destroy targets require ENVIRONMENT to be set explicitly (e.g. ENVIRONMENT=prod make deploy). Each environment gets its own config file (.env.deploy.dev, .env.deploy.prod) and its own isolated set of Cloudflare and Modal resources. - deploy.sh sources .env.deploy.${ENVIRONMENT} instead of .env.deploy - Makefile adds _require-env guard on deploy/bootstrap/destroy targets - Modal app name is parameterized via MODAL_APP_NAME env var (fixes mismatch between hardcoded "valet-backend" in app.py and Makefile's ${PROJECT_NAME}-backend derivation) - Local dev targets (dev-worker, dev-client, etc.) unchanged

…ploy feat: require ENVIRONMENT=dev|prod for all deploy targets

…collisions Modal webhook subdomains are globally unique per workspace, so two apps with the same labels (e.g. create-session) conflict. Each environment now prefixes labels with MODAL_LABEL_PREFIX (defaults to ${ENVIRONMENT}-) and the URL template includes the prefix so the worker's {label} replacement still produces correct URLs.

f3nry

Flush batching — see inline comment.

f3nry · 2026-05-07T21:50:03Z

+
+  private startPromptDispatchSpan(attrs: SpanAttributes): SimpleSpan {
+    const queuedAt = this.promptQueue.promptReceivedAt;
+    const waitMs = queuedAt > 0 ? Date.now() - queuedAt : 0;


Every call site does span.end() then this.flushTracing(), which fires a separate ctx.waitUntil(tracer.flush()) — and each flush() POSTs all accumulated spans to the OTLP endpoint. During a session lifecycle (spawn → dispatch → dispatch → hibernate → restore → dispatch…) this generates a lot of small HTTP requests to Grafana Cloud.

Consider batching: either flush on a timer/threshold inside SimpleTracer (e.g. every 5s or 50 spans), or consolidate flush calls to session-level boundaries rather than per-span. The Runner side already naturally batches because tool/LLM spans accumulate and only flush at the turn finally block — the Worker should do the same.

…-flush msToUnixNano multiplied a JS number by 1_000_000, exceeding MAX_SAFE_INTEGER for present-day epoch milliseconds and silently rounding span timestamps. Adds maxQueuedSpans + scheduleFlush options so a hot tracer can auto-flush in batches once a buffer threshold is hit, with a host hook for ctx.waitUntil. Addresses review feedback on #1 and yourbuddyconner#45.

Previously every span ended in a session-agent dispatch path called flushTracing() inline, fanning out to a separate ctx.waitUntil(flush()) per span and POSTing the buffer to OTLP each time. Across spawn → dispatch → dispatch → hibernate → restore lifecycles this generated many small HTTP requests against Grafana Cloud. Configures SimpleTracer with maxQueuedSpans=50 and a scheduleFlush hook that registers auto-flush promises with ctx.waitUntil. Drops the per-span flushTracing() calls at dispatch sites; explicit flushes remain at session-level boundaries (hibernate, terminate, wake guard, child session error finally blocks) where the DO may go idle before the threshold is hit. Addresses f3nry review on #1.

The plugin-sync middleware awaited syncPluginsOnce on every request, which on cold start performed ~47 serial D1 writes before the route handler even ran. Public latency-critical endpoints like /auth/providers got stuck behind it, causing the login page to show its loading skeleton (the dark gray pulse-loaders) for tens of seconds — long enough that users reported "no sign-in buttons." Move the sync into ctx.waitUntil so it runs after the response is returned. The registry sync is already idempotent and best-effort, so the first cold-start request seeing slightly-stale content is fine.

Even with the sync running in the background, kicking it off on unauthenticated traffic (/auth/providers, /auth/<provider> redirects, /health) wastes D1 capacity on the cold-start path that's most visible to end users. Skip the sync entirely for those prefixes — authenticated traffic on /api/* still drives it.

Previously, if /auth/providers errored out (default retry=1), the loading skeleton would disappear, providers would be undefined, and the card would render with just the title and zero buttons — visually identical to a "no login providers configured" state. Users had no indication anything was wrong and no way to recover short of a full page reload. Render an explicit error message with a Retry button when the query fails, and bump retry to 3 for this specific query so transient worker cold-starts don't trip it.

Add minimal Progressive Web App support so the Valet dashboard can be installed as a desktop app from Chrome (and other browsers that support PWA). This eliminates the need to hunt through browser tabs to find it. Changes: - Add web app manifest (manifest.json) with app name, icons, and standalone display mode - Add app icons (192x192, 512x512, 512x512 maskable, apple-touch-icon) matching the existing dark theme with cyan 'V' branding - Add minimal service worker (sw.js) with fetch handler to satisfy Chrome's PWA installability criteria — no offline caching - Register service worker from main.tsx on page load - Add manifest link, apple-touch-icon, and apple-mobile-web-app meta tags to index.html This is intentionally minimal — no offline support, no push notifications, no caching strategy. Just enough for the install prompt. Includes a TODO for future ACP (Agent Client Protocol) integration to bridge cloud-hosted agent sessions with local dev tools.

Add initial thoughts for how incident scribing works.

The interactive handler returned 200 to Slack right away (to meet the 3-second ack deadline) and only updated the message via chat.update after the DO finished executing the approved action. For slow tool calls that gap was many seconds, during which Slack dropped its spinner but the buttons stayed enabled — letting users re-click Approve/Deny while processing was already underway. Now we POST to the payload's response_url before scheduling the DO call to replace the actions block with a "⏳ Processing…" context. The DO's existing final-state chat.update still runs at the end and overwrites with "✅ Approved by …" / "❌ Denied by …".

fix: unblock /auth/providers on cold start + show error UI on login

Move slack-app-manifest.json from packages/worker/ to packages/plugin-slack/ where it belongs. Add assistant_view feature block required by the assistant:write scope. Update "AI coding agent" framing to "AI coworker" across manifest, README, CLAUDE.md, and slack-setup docs.

Linear's MCP OAuth server only supports the 'write' scope (which implies read access). Requesting 'read' caused an invalid_scope error in prod where the OAuth client was freshly registered.

assembleRepoEnv used a raw DB lookup that bypassed the credential service's token refresh logic, causing 403 push errors when the user's 8-hour GitHub App OAuth token expired. All credential resolution paths now use getCredential() which checks expiry and auto-refreshes via the App's OAuth refresh endpoint. Additional fixes in this commit: - Add push-permission validation (check permissions.push, not just HTTP status) and 401 force-refresh retry for legacy credentials - Migrate repos.ts and repo-providers.ts from old app_install credential model to github_installations + on-demand token minting - Remove dead repo-providers OAuth link/callback flow and stale manifest callback URL - Add comments throughout clarifying the unified GitHub App auth model (no classic OAuth App exists)

fix: strip Slack approval buttons immediately on click

…hread Child session idle/completion events were delivered to the orchestrator's main thread instead of the thread that spawned the child. The root cause was that parentThreadId was only stored in the child DO's transient SQL state, which is lost if the DO is re-initialized. - Persist parent_thread_id in D1 sessions table as durable fallback - Fall back to D1 value in notifyParentEvent when DO state is empty - Add hydration to handleSystemMessage's direct dispatch path so thread channels resolve the correct OpenCode session (was missing, unlike handlePrompt and sendNextQueuedPrompt which already hydrate) - Extract ensureThreadOcSessionHydrated helper to deduplicate the hydration pattern across dispatch paths - Add diagnostic logging at spawn, notify, and receive points

Drive files.list calls default corpora to 'user', hiding org-shared docs. Add a driveCorpora org setting (user/domain/allDrives) that flows through guardConfig to the Google Workspace plugin, letting admins control file discovery scope from the admin UI.

…i-49-drive-corpora-setting feat: configurable Drive corpora org setting (TKAI-49)

The GET /integrations/google_workspace/labels endpoint used stale tokens without retry logic, unlike action execution which detects 401s and force-refreshes. Extract the fetch into a helper, detect 401 on the first attempt, force-refresh the credential, and retry once.

… stop Channel followups were never resolved when a prompt completed — only when the agent explicitly called channel_reply with follow_up=true. This left pending followup rows that fired stale reminders via the alarm handler, causing the agent to re-attempt already-completed work and post contradictory messages to Slack. Resolve followups in handlePromptComplete (including the thread-origin fallback for web UI messages targeting Slack-originated threads), bulk- resolve on hibernation, and delete on session stop.

Slack forwarded messages store their content in event.attachments with is_msg_unfurl: true, but parseInbound only read event.text (the link URL). Extract unfurl text/fallback and include it in the parsed message. Also fix pre-existing test that used 11MB for a non-image file size limit check when the actual limit is 25MB.

Cloudflare's minutely cron occasionally skips invocations. When it misses the exact minute a trigger is scheduled for, that trigger silently fails until the next matching time (24h later for daily triggers). Add a two-pass dispatch: Pass 1 fires triggers matching the current minute (existing behavior). Pass 2 scans backward up to 4 hours from each trigger's last_run_at to find and dispatch missed ticks. The existing schedule_ticks dedup table prevents double-dispatch. Also fixes: - Midnight bug: use hourCycle:'h23' instead of hour12:false (V8 returns "24" for midnight with hour12:false) - Orchestrator tick ordering: claim tick before dispatch, release on retriable failure (prevents both burned ticks and duplicate prompts) - Workflow enqueue ordering: update last_run_at after enqueue succeeds - Tick table cleanup: nightly batched delete of rows older than 7 days - Type safety: remove as-any from getActiveScheduleTriggers, add config.cron guard, add Invalid Date guard on last_run_at - Extract cron functions to lib/cron.ts with 40 unit tests

The empty-state copy said "No tools enabled. This persona will only have built-in coding tools." but the backend treats an empty whitelist as unrestricted (all tools available). Updated to accurately describe the restriction semantics.

…agents When a session spawns a task (subagent), the parent session's sync prompt timeout (5 min) would fire because the parent receives no SSE events while the subagent works. This caused a spurious model failover mid-task. Two fixes: - Use OpenCode's server.heartbeat (every ~10s) to reset the sync timeout when tools are in running/pending state - When the timeout fires, check for running tools before aborting — if tools are still active, extend the timeout instead

…d rule

63MB of docker/opencode/node_modules was being uploaded every deploy, only to be overwritten by bun install during image build.

…iginating thread The pending followup banner, thinking indicator, and promote/edit/cancel actions were displaying globally across all threads instead of only in the thread that owns the queued work. Include threadId in agentStatus broadcasts from the server and filter all affected UI elements by the active thread. Closes TKAI-94

When clicking "new thread," focus the text input so the user can start typing immediately without an extra click. Closes TKAI-99

…proval-overrides feat: add per-user approval policy overrides

… refresh Session-scoped "allow for this session" overrides were effectively permanent on orchestrator sessions because they never reach a terminal status. Now they expire when the sandbox is recycled — on hibernation (idle 15min) and explicit refresh — matching the original intent that ephemeral approvals last only as long as the OpenCode instance. Also compact the approval prompt buttons from a vertical listbox to a horizontal button row to reduce vertical space.

The soft-expire approach (setting expiresAt on expired rows) created ghost rows that every query path had to filter out, leading to a cascade of bugs: settings page showing stale overrides, upserts reviving expired rows, partial unique index collisions on re-grants. Switch to hard DELETE — the table only contains active rows, so queries and indexes work without expiry filtering. Per-invocation audit trail is preserved on action_invocations (user_override_id is ON DELETE SET NULL, and policy_source/lifetime/scope are independent columns). Also adds the action-policy-overrides explainer spec and compacts approval prompt buttons to a horizontal layout.

…-flush msToUnixNano multiplied a JS number by 1_000_000, exceeding MAX_SAFE_INTEGER for present-day epoch milliseconds and silently rounding span timestamps. Adds maxQueuedSpans + scheduleFlush options so a hot tracer can auto-flush in batches once a buffer threshold is hit, with a host hook for ctx.waitUntil. Addresses review feedback on #1 and yourbuddyconner#45.

The prompt handler signature had grown to 14 positional args (messageId, content, model, author, modelPreferences, attachments, channelType, channelId, opencodeSessionId, continuationContext, threadId, replyChannelType, replyChannelId, traceparent), making call sites and the wire shape unreviewable. Introduces PromptDispatch / PromptHandlerFn so onPrompt and handlePrompt take a single typed object. Updates the agent-client dispatcher, PromptHandler.handlePrompt, the bin.ts callback, and the prompt unit tests to use the new shape. Addresses figitaki review on yourbuddyconner#45.

Previously every span ended in a session-agent dispatch path called flushTracing() inline, fanning out to a separate ctx.waitUntil(flush()) per span and POSTing the buffer to OTLP each time. Across spawn → dispatch → dispatch → hibernate → restore lifecycles this generated many small HTTP requests against Grafana Cloud. Configures SimpleTracer with maxQueuedSpans=50 and a scheduleFlush hook that registers auto-flush promises with ctx.waitUntil. Drops the per-span flushTracing() calls at dispatch sites; explicit flushes remain at session-level boundaries (hibernate, terminate, wake guard, child session error finally blocks) where the DO may go idle before the threshold is hit. Addresses f3nry review on #1.

figitaki requested review from benturnkey and f3nry May 7, 2026 02:55

figitaki commented May 7, 2026

View reviewed changes

yourbuddyconner added 4 commits May 7, 2026 10:22

Merge pull request #47 from yourbuddyconner/feat/multi-environment-de…

087bf4e

…ploy feat: require ENVIRONMENT=dev|prod for all deploy targets

f3nry reviewed May 7, 2026

View reviewed changes

figitaki and others added 11 commits May 8, 2026 13:53

specs: Add incident scribe spec

56bbd32

Add initial thoughts for how incident scribing works.

Merge pull request #48 from tkhq/fix/auth-providers-loading

f2204e4

fix: unblock /auth/providers on cold start + show error UI on login

fix: remove unsupported 'read' scope from Linear MCP OAuth

f497dba

Linear's MCP OAuth server only supports the 'write' scope (which implies read access). Requesting 'read' caused an invalid_scope error in prod where the OAuth client was freshly registered.

Merge pull request #51 from tkhq/fix/slack-approval-button-lag

e9c85af

fix: strip Slack approval buttons immediately on click

figitaki requested a review from f3nry May 9, 2026 00:40

yourbuddyconner added 7 commits May 11, 2026 10:37

fix: add driveCorpora to useUpdateOrgSettings mutation type

dfa3be9

Merge pull request yourbuddyconner#53 from yourbuddyconner/conner/tka…

bb83032

…i-49-drive-corpora-setting feat: configurable Drive corpora org setting (TKAI-49)

yourbuddyconner and others added 29 commits May 19, 2026 11:28

docs: address approval overrides spec review

a906fe1

docs: add approval override implementation plan

6eb7d96

feat(worker): add action policy override schema

648af0b

feat(worker): resolve user action policy overrides

b2d28b6

feat(worker): apply effective action policy resolution

ac408c6

feat(worker): add user action policy override routes

ce1b31d

feat(worker): expire session action policy overrides

ad78ac9

feat(worker): add approval override prompt actions

87fbb7e

feat(channels): support approval override choices

41c7ba9

feat(client): add approval override chooser

8f5ca68

feat(client): manage user action policy overrides

f949c14

fix(worker): harden approval override resolution

b5378f3

feat: add per-user approval policy overrides

71feeab

fix: address approval override review feedback

c3f13a1

chore: bump sandbox image for sync timeout fix, document image rebuil…

c0bfe48

…d rule

fix(backend): exclude node_modules from Modal upload

56c6e9b

63MB of docker/opencode/node_modules was being uploaded every deploy, only to be overwritten by bun install during image build.

fix(client): auto-focus chat input on new thread creation

8125f06

When clicking "new thread," focus the text input so the user can start typing immediately without an extra click. Closes TKAI-99

Merge pull request yourbuddyconner#65 from yourbuddyconner/tkai-64-ap…

97dc5e4

…proval-overrides feat: add per-user approval policy overrides

feat: add session tracing propagation

c5dd851

feat: align genai tracing with openlit semantics

a980317

figitaki force-pushed the carey/tkai-2-session-tracing branch from d1faab0 to 434efb8 Compare May 21, 2026 10:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TKAI-2: add session tracing propagation#1

TKAI-2: add session tracing propagation#1
figitaki wants to merge 138 commits into
mainfrom
carey/tkai-2-session-tracing

figitaki commented May 6, 2026

Uh oh!

figitaki May 7, 2026

Uh oh!

f3nry May 7, 2026

Uh oh!

f3nry left a comment

Uh oh!

f3nry May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

figitaki commented May 6, 2026

Summary

Test plan

Verification

Uh oh!

figitaki May 7, 2026

Choose a reason for hiding this comment

Uh oh!

f3nry May 7, 2026

Choose a reason for hiding this comment

Uh oh!

f3nry left a comment

Choose a reason for hiding this comment

Uh oh!

f3nry May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants