Skip to content

TKAI-2: add session tracing propagation#1

Open
figitaki wants to merge 138 commits into
mainfrom
carey/tkai-2-session-tracing
Open

TKAI-2: add session tracing propagation#1
figitaki wants to merge 138 commits into
mainfrom
carey/tkai-2-session-tracing

Conversation

@figitaki

@figitaki figitaki commented May 6, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • add a dependency-free OTLP/HTTP JSON tracer shared by worker and runner
  • propagate W3C traceparent through sandbox env, DO→runner protocol messages, and runner→OpenCode HTTP calls
  • instrument worker lifecycle/dispatch spans plus runner bootstrap, repo setup, turn, workflow, tool, and LLM usage spans
  • document TKAI-2 implementation coverage in the session tracing spec

Test plan

  • Unit tests
  • Smoke test
  • Deploy

Need to set.

OTEL_EXPORTER_OTLP_ENDPOINT=<grafana otlp endpoint>
OTEL_EXPORTER_OTLP_HEADERS=<auth headers>

Verification

  • pnpm --filter @valet/shared typecheck
  • pnpm --filter @valet/runner typecheck
  • pnpm --filter @valet/worker typecheck
  • pnpm --filter @valet/runner test -- src/prompt.test.ts
  • pnpm --filter @valet/worker exec vitest run src/durable-objects/prompt-queue.test.ts src/durable-objects/runner-link.test.ts

Refs TKAI-2

collectAlarmDeadlines() returned past timestamps that scheduleAlarm()
passed directly to setAlarm(), causing Cloudflare to fire the alarm
immediately in a tight loop. The main culprit: any prompt processing
longer than 5 minutes created a past watchdog deadline that the health
monitor correctly skipped (runner still connected) but never cleared.

Two-part fix:
- Add a 30s floor in scheduleAlarm() for past deadlines as a universal
  safety net against this class of bug
- Skip unactionable deadlines in collectAlarmDeadlines(): watchdog when
  past + runner connected, idle-queue when runner disconnected
@figitaki figitaki requested review from benturnkey and f3nry May 7, 2026 02:55
this.tracer = new SimpleTracer({
serviceName: 'valet-worker',
endpoint: this.env.OTEL_EXPORTER_OTLP_ENDPOINT,
headers: this.env.OTEL_EXPORTER_OTLP_HEADERS,

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we need these?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the auth headers / endpoints that receive the telemetry since there's no intermediate grafana agent until this is running in k8s

…solver

getChannelForMessage previously returned null for both "row missing" and
"row exists but lacks channel context", making all drops log as
no_prompt_row. This made it hard to tell legitimate drops (system prompts
on orchestrator sessions) from unexpected ones.

Now returns a discriminated union with a specific reason, so each call
site passes the accurate drop reason to dropEmission.
Deployment previously used a single .env.deploy file with no environment
separation. This made it impossible to deploy to dev and prod from the
same checkout without manually swapping config files.

Now all deploy, bootstrap, and destroy targets require ENVIRONMENT to be
set explicitly (e.g. ENVIRONMENT=prod make deploy). Each environment gets
its own config file (.env.deploy.dev, .env.deploy.prod) and its own
isolated set of Cloudflare and Modal resources.

- deploy.sh sources .env.deploy.${ENVIRONMENT} instead of .env.deploy
- Makefile adds _require-env guard on deploy/bootstrap/destroy targets
- Modal app name is parameterized via MODAL_APP_NAME env var (fixes
  mismatch between hardcoded "valet-backend" in app.py and Makefile's
  ${PROJECT_NAME}-backend derivation)
- Local dev targets (dev-worker, dev-client, etc.) unchanged
…ploy

feat: require ENVIRONMENT=dev|prod for all deploy targets
…collisions

Modal webhook subdomains are globally unique per workspace, so two apps
with the same labels (e.g. create-session) conflict. Each environment
now prefixes labels with MODAL_LABEL_PREFIX (defaults to ${ENVIRONMENT}-)
and the URL template includes the prefix so the worker's {label}
replacement still produces correct URLs.

@f3nry f3nry left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flush batching — see inline comment.


private startPromptDispatchSpan(attrs: SpanAttributes): SimpleSpan {
const queuedAt = this.promptQueue.promptReceivedAt;
const waitMs = queuedAt > 0 ? Date.now() - queuedAt : 0;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every call site does span.end() then this.flushTracing(), which fires a separate ctx.waitUntil(tracer.flush()) — and each flush() POSTs all accumulated spans to the OTLP endpoint. During a session lifecycle (spawn → dispatch → dispatch → hibernate → restore → dispatch…) this generates a lot of small HTTP requests to Grafana Cloud.

Consider batching: either flush on a timer/threshold inside SimpleTracer (e.g. every 5s or 50 spans), or consolidate flush calls to session-level boundaries rather than per-span. The Runner side already naturally batches because tool/LLM spans accumulate and only flush at the turn finally block — the Worker should do the same.

figitaki added a commit that referenced this pull request May 8, 2026
…-flush

msToUnixNano multiplied a JS number by 1_000_000, exceeding MAX_SAFE_INTEGER
for present-day epoch milliseconds and silently rounding span timestamps.

Adds maxQueuedSpans + scheduleFlush options so a hot tracer can auto-flush
in batches once a buffer threshold is hit, with a host hook for ctx.waitUntil.

Addresses review feedback on #1 and yourbuddyconner#45.
figitaki added a commit that referenced this pull request May 8, 2026
Previously every span ended in a session-agent dispatch path called
flushTracing() inline, fanning out to a separate ctx.waitUntil(flush())
per span and POSTing the buffer to OTLP each time. Across spawn →
dispatch → dispatch → hibernate → restore lifecycles this generated
many small HTTP requests against Grafana Cloud.

Configures SimpleTracer with maxQueuedSpans=50 and a scheduleFlush hook
that registers auto-flush promises with ctx.waitUntil. Drops the
per-span flushTracing() calls at dispatch sites; explicit flushes remain
at session-level boundaries (hibernate, terminate, wake guard, child
session error finally blocks) where the DO may go idle before the
threshold is hit.

Addresses f3nry review on #1.
figitaki and others added 11 commits May 8, 2026 13:53
The plugin-sync middleware awaited syncPluginsOnce on every request,
which on cold start performed ~47 serial D1 writes before the route
handler even ran. Public latency-critical endpoints like
/auth/providers got stuck behind it, causing the login page to show
its loading skeleton (the dark gray pulse-loaders) for tens of
seconds — long enough that users reported "no sign-in buttons."

Move the sync into ctx.waitUntil so it runs after the response is
returned. The registry sync is already idempotent and best-effort, so
the first cold-start request seeing slightly-stale content is fine.
Even with the sync running in the background, kicking it off on
unauthenticated traffic (/auth/providers, /auth/<provider> redirects,
/health) wastes D1 capacity on the cold-start path that's most
visible to end users. Skip the sync entirely for those prefixes —
authenticated traffic on /api/* still drives it.
Previously, if /auth/providers errored out (default retry=1), the
loading skeleton would disappear, providers would be undefined, and
the card would render with just the title and zero buttons — visually
identical to a "no login providers configured" state. Users had no
indication anything was wrong and no way to recover short of a full
page reload.

Render an explicit error message with a Retry button when the query
fails, and bump retry to 3 for this specific query so transient
worker cold-starts don't trip it.
Add minimal Progressive Web App support so the Valet dashboard can be
installed as a desktop app from Chrome (and other browsers that support
PWA). This eliminates the need to hunt through browser tabs to find it.

Changes:
- Add web app manifest (manifest.json) with app name, icons, and
  standalone display mode
- Add app icons (192x192, 512x512, 512x512 maskable, apple-touch-icon)
  matching the existing dark theme with cyan 'V' branding
- Add minimal service worker (sw.js) with fetch handler to satisfy
  Chrome's PWA installability criteria — no offline caching
- Register service worker from main.tsx on page load
- Add manifest link, apple-touch-icon, and apple-mobile-web-app meta
  tags to index.html

This is intentionally minimal — no offline support, no push
notifications, no caching strategy. Just enough for the install prompt.

Includes a TODO for future ACP (Agent Client Protocol) integration to
bridge cloud-hosted agent sessions with local dev tools.
Add initial thoughts for how incident scribing works.
The interactive handler returned 200 to Slack right away (to meet the
3-second ack deadline) and only updated the message via chat.update
after the DO finished executing the approved action. For slow tool
calls that gap was many seconds, during which Slack dropped its
spinner but the buttons stayed enabled — letting users re-click
Approve/Deny while processing was already underway.

Now we POST to the payload's response_url before scheduling the DO
call to replace the actions block with a "⏳ Processing…" context.
The DO's existing final-state chat.update still runs at the end and
overwrites with "✅ Approved by …" / "❌ Denied by …".
fix: unblock /auth/providers on cold start + show error UI on login
Move slack-app-manifest.json from packages/worker/ to
packages/plugin-slack/ where it belongs. Add assistant_view feature
block required by the assistant:write scope. Update "AI coding agent"
framing to "AI coworker" across manifest, README, CLAUDE.md, and
slack-setup docs.
Linear's MCP OAuth server only supports the 'write' scope (which
implies read access). Requesting 'read' caused an invalid_scope error
in prod where the OAuth client was freshly registered.
assembleRepoEnv used a raw DB lookup that bypassed the credential
service's token refresh logic, causing 403 push errors when the
user's 8-hour GitHub App OAuth token expired. All credential
resolution paths now use getCredential() which checks expiry and
auto-refreshes via the App's OAuth refresh endpoint.

Additional fixes in this commit:
- Add push-permission validation (check permissions.push, not just
  HTTP status) and 401 force-refresh retry for legacy credentials
- Migrate repos.ts and repo-providers.ts from old app_install
  credential model to github_installations + on-demand token minting
- Remove dead repo-providers OAuth link/callback flow and stale
  manifest callback URL
- Add comments throughout clarifying the unified GitHub App auth
  model (no classic OAuth App exists)
fix: strip Slack approval buttons immediately on click
@figitaki figitaki requested a review from f3nry May 9, 2026 00:40
…hread

Child session idle/completion events were delivered to the orchestrator's
main thread instead of the thread that spawned the child. The root cause
was that parentThreadId was only stored in the child DO's transient SQL
state, which is lost if the DO is re-initialized.

- Persist parent_thread_id in D1 sessions table as durable fallback
- Fall back to D1 value in notifyParentEvent when DO state is empty
- Add hydration to handleSystemMessage's direct dispatch path so thread
  channels resolve the correct OpenCode session (was missing, unlike
  handlePrompt and sendNextQueuedPrompt which already hydrate)
- Extract ensureThreadOcSessionHydrated helper to deduplicate the
  hydration pattern across dispatch paths
- Add diagnostic logging at spawn, notify, and receive points
Drive files.list calls default corpora to 'user', hiding org-shared docs.
Add a driveCorpora org setting (user/domain/allDrives) that flows through
guardConfig to the Google Workspace plugin, letting admins control file
discovery scope from the admin UI.
…i-49-drive-corpora-setting

feat: configurable Drive corpora org setting (TKAI-49)
The GET /integrations/google_workspace/labels endpoint used stale tokens
without retry logic, unlike action execution which detects 401s and
force-refreshes. Extract the fetch into a helper, detect 401 on the
first attempt, force-refresh the credential, and retry once.
… stop

Channel followups were never resolved when a prompt completed — only when
the agent explicitly called channel_reply with follow_up=true. This left
pending followup rows that fired stale reminders via the alarm handler,
causing the agent to re-attempt already-completed work and post
contradictory messages to Slack.

Resolve followups in handlePromptComplete (including the thread-origin
fallback for web UI messages targeting Slack-originated threads), bulk-
resolve on hibernation, and delete on session stop.
Slack forwarded messages store their content in event.attachments with
is_msg_unfurl: true, but parseInbound only read event.text (the link URL).
Extract unfurl text/fallback and include it in the parsed message.

Also fix pre-existing test that used 11MB for a non-image file size limit
check when the actual limit is 25MB.
yourbuddyconner and others added 29 commits May 19, 2026 11:28
Cloudflare's minutely cron occasionally skips invocations. When it
misses the exact minute a trigger is scheduled for, that trigger
silently fails until the next matching time (24h later for daily
triggers).

Add a two-pass dispatch: Pass 1 fires triggers matching the current
minute (existing behavior). Pass 2 scans backward up to 4 hours from
each trigger's last_run_at to find and dispatch missed ticks. The
existing schedule_ticks dedup table prevents double-dispatch.

Also fixes:
- Midnight bug: use hourCycle:'h23' instead of hour12:false (V8 returns
  "24" for midnight with hour12:false)
- Orchestrator tick ordering: claim tick before dispatch, release on
  retriable failure (prevents both burned ticks and duplicate prompts)
- Workflow enqueue ordering: update last_run_at after enqueue succeeds
- Tick table cleanup: nightly batched delete of rows older than 7 days
- Type safety: remove as-any from getActiveScheduleTriggers, add
  config.cron guard, add Invalid Date guard on last_run_at
- Extract cron functions to lib/cron.ts with 40 unit tests
The empty-state copy said "No tools enabled. This persona will only
have built-in coding tools." but the backend treats an empty whitelist
as unrestricted (all tools available). Updated to accurately describe
the restriction semantics.
…agents

When a session spawns a task (subagent), the parent session's sync prompt
timeout (5 min) would fire because the parent receives no SSE events while
the subagent works. This caused a spurious model failover mid-task.

Two fixes:
- Use OpenCode's server.heartbeat (every ~10s) to reset the sync timeout
  when tools are in running/pending state
- When the timeout fires, check for running tools before aborting — if
  tools are still active, extend the timeout instead
63MB of docker/opencode/node_modules was being uploaded every deploy,
only to be overwritten by bun install during image build.
…iginating thread

The pending followup banner, thinking indicator, and promote/edit/cancel
actions were displaying globally across all threads instead of only in
the thread that owns the queued work. Include threadId in agentStatus
broadcasts from the server and filter all affected UI elements by the
active thread.

Closes TKAI-94
When clicking "new thread," focus the text input so the user can start
typing immediately without an extra click.

Closes TKAI-99
…proval-overrides

feat: add per-user approval policy overrides
… refresh

Session-scoped "allow for this session" overrides were effectively
permanent on orchestrator sessions because they never reach a terminal
status. Now they expire when the sandbox is recycled — on hibernation
(idle 15min) and explicit refresh — matching the original intent that
ephemeral approvals last only as long as the OpenCode instance.

Also compact the approval prompt buttons from a vertical listbox to a
horizontal button row to reduce vertical space.
The soft-expire approach (setting expiresAt on expired rows) created
ghost rows that every query path had to filter out, leading to a
cascade of bugs: settings page showing stale overrides, upserts
reviving expired rows, partial unique index collisions on re-grants.

Switch to hard DELETE — the table only contains active rows, so
queries and indexes work without expiry filtering. Per-invocation
audit trail is preserved on action_invocations (user_override_id is
ON DELETE SET NULL, and policy_source/lifetime/scope are independent
columns).

Also adds the action-policy-overrides explainer spec and compacts
approval prompt buttons to a horizontal layout.
…-flush

msToUnixNano multiplied a JS number by 1_000_000, exceeding MAX_SAFE_INTEGER
for present-day epoch milliseconds and silently rounding span timestamps.

Adds maxQueuedSpans + scheduleFlush options so a hot tracer can auto-flush
in batches once a buffer threshold is hit, with a host hook for ctx.waitUntil.

Addresses review feedback on #1 and yourbuddyconner#45.
The prompt handler signature had grown to 14 positional args (messageId,
content, model, author, modelPreferences, attachments, channelType,
channelId, opencodeSessionId, continuationContext, threadId,
replyChannelType, replyChannelId, traceparent), making call sites and
the wire shape unreviewable.

Introduces PromptDispatch / PromptHandlerFn so onPrompt and handlePrompt
take a single typed object. Updates the agent-client dispatcher,
PromptHandler.handlePrompt, the bin.ts callback, and the prompt unit
tests to use the new shape.

Addresses figitaki review on yourbuddyconner#45.
Previously every span ended in a session-agent dispatch path called
flushTracing() inline, fanning out to a separate ctx.waitUntil(flush())
per span and POSTing the buffer to OTLP each time. Across spawn →
dispatch → dispatch → hibernate → restore lifecycles this generated
many small HTTP requests against Grafana Cloud.

Configures SimpleTracer with maxQueuedSpans=50 and a scheduleFlush hook
that registers auto-flush promises with ctx.waitUntil. Drops the
per-span flushTracing() calls at dispatch sites; explicit flushes remain
at session-level boundaries (hibernate, terminate, wake guard, child
session error finally blocks) where the DO may go idle before the
threshold is hit.

Addresses f3nry review on #1.
@figitaki figitaki force-pushed the carey/tkai-2-session-tracing branch from d1faab0 to 434efb8 Compare May 21, 2026 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants