TKAI-2: add session tracing propagation#1
Conversation
collectAlarmDeadlines() returned past timestamps that scheduleAlarm() passed directly to setAlarm(), causing Cloudflare to fire the alarm immediately in a tight loop. The main culprit: any prompt processing longer than 5 minutes created a past watchdog deadline that the health monitor correctly skipped (runner still connected) but never cleared. Two-part fix: - Add a 30s floor in scheduleAlarm() for past deadlines as a universal safety net against this class of bug - Skip unactionable deadlines in collectAlarmDeadlines(): watchdog when past + runner connected, idle-queue when runner disconnected
| this.tracer = new SimpleTracer({ | ||
| serviceName: 'valet-worker', | ||
| endpoint: this.env.OTEL_EXPORTER_OTLP_ENDPOINT, | ||
| headers: this.env.OTEL_EXPORTER_OTLP_HEADERS, |
There was a problem hiding this comment.
I'm not sure why we need these?
There was a problem hiding this comment.
These are the auth headers / endpoints that receive the telemetry since there's no intermediate grafana agent until this is running in k8s
…solver getChannelForMessage previously returned null for both "row missing" and "row exists but lacks channel context", making all drops log as no_prompt_row. This made it hard to tell legitimate drops (system prompts on orchestrator sessions) from unexpected ones. Now returns a discriminated union with a specific reason, so each call site passes the accurate drop reason to dropEmission.
Deployment previously used a single .env.deploy file with no environment
separation. This made it impossible to deploy to dev and prod from the
same checkout without manually swapping config files.
Now all deploy, bootstrap, and destroy targets require ENVIRONMENT to be
set explicitly (e.g. ENVIRONMENT=prod make deploy). Each environment gets
its own config file (.env.deploy.dev, .env.deploy.prod) and its own
isolated set of Cloudflare and Modal resources.
- deploy.sh sources .env.deploy.${ENVIRONMENT} instead of .env.deploy
- Makefile adds _require-env guard on deploy/bootstrap/destroy targets
- Modal app name is parameterized via MODAL_APP_NAME env var (fixes
mismatch between hardcoded "valet-backend" in app.py and Makefile's
${PROJECT_NAME}-backend derivation)
- Local dev targets (dev-worker, dev-client, etc.) unchanged
…ploy feat: require ENVIRONMENT=dev|prod for all deploy targets
…collisions
Modal webhook subdomains are globally unique per workspace, so two apps
with the same labels (e.g. create-session) conflict. Each environment
now prefixes labels with MODAL_LABEL_PREFIX (defaults to ${ENVIRONMENT}-)
and the URL template includes the prefix so the worker's {label}
replacement still produces correct URLs.
|
|
||
| private startPromptDispatchSpan(attrs: SpanAttributes): SimpleSpan { | ||
| const queuedAt = this.promptQueue.promptReceivedAt; | ||
| const waitMs = queuedAt > 0 ? Date.now() - queuedAt : 0; |
There was a problem hiding this comment.
Every call site does span.end() then this.flushTracing(), which fires a separate ctx.waitUntil(tracer.flush()) — and each flush() POSTs all accumulated spans to the OTLP endpoint. During a session lifecycle (spawn → dispatch → dispatch → hibernate → restore → dispatch…) this generates a lot of small HTTP requests to Grafana Cloud.
Consider batching: either flush on a timer/threshold inside SimpleTracer (e.g. every 5s or 50 spans), or consolidate flush calls to session-level boundaries rather than per-span. The Runner side already naturally batches because tool/LLM spans accumulate and only flush at the turn finally block — the Worker should do the same.
…-flush msToUnixNano multiplied a JS number by 1_000_000, exceeding MAX_SAFE_INTEGER for present-day epoch milliseconds and silently rounding span timestamps. Adds maxQueuedSpans + scheduleFlush options so a hot tracer can auto-flush in batches once a buffer threshold is hit, with a host hook for ctx.waitUntil. Addresses review feedback on #1 and yourbuddyconner#45.
Previously every span ended in a session-agent dispatch path called flushTracing() inline, fanning out to a separate ctx.waitUntil(flush()) per span and POSTing the buffer to OTLP each time. Across spawn → dispatch → dispatch → hibernate → restore lifecycles this generated many small HTTP requests against Grafana Cloud. Configures SimpleTracer with maxQueuedSpans=50 and a scheduleFlush hook that registers auto-flush promises with ctx.waitUntil. Drops the per-span flushTracing() calls at dispatch sites; explicit flushes remain at session-level boundaries (hibernate, terminate, wake guard, child session error finally blocks) where the DO may go idle before the threshold is hit. Addresses f3nry review on #1.
The plugin-sync middleware awaited syncPluginsOnce on every request, which on cold start performed ~47 serial D1 writes before the route handler even ran. Public latency-critical endpoints like /auth/providers got stuck behind it, causing the login page to show its loading skeleton (the dark gray pulse-loaders) for tens of seconds — long enough that users reported "no sign-in buttons." Move the sync into ctx.waitUntil so it runs after the response is returned. The registry sync is already idempotent and best-effort, so the first cold-start request seeing slightly-stale content is fine.
Even with the sync running in the background, kicking it off on unauthenticated traffic (/auth/providers, /auth/<provider> redirects, /health) wastes D1 capacity on the cold-start path that's most visible to end users. Skip the sync entirely for those prefixes — authenticated traffic on /api/* still drives it.
Previously, if /auth/providers errored out (default retry=1), the loading skeleton would disappear, providers would be undefined, and the card would render with just the title and zero buttons — visually identical to a "no login providers configured" state. Users had no indication anything was wrong and no way to recover short of a full page reload. Render an explicit error message with a Retry button when the query fails, and bump retry to 3 for this specific query so transient worker cold-starts don't trip it.
Add minimal Progressive Web App support so the Valet dashboard can be installed as a desktop app from Chrome (and other browsers that support PWA). This eliminates the need to hunt through browser tabs to find it. Changes: - Add web app manifest (manifest.json) with app name, icons, and standalone display mode - Add app icons (192x192, 512x512, 512x512 maskable, apple-touch-icon) matching the existing dark theme with cyan 'V' branding - Add minimal service worker (sw.js) with fetch handler to satisfy Chrome's PWA installability criteria — no offline caching - Register service worker from main.tsx on page load - Add manifest link, apple-touch-icon, and apple-mobile-web-app meta tags to index.html This is intentionally minimal — no offline support, no push notifications, no caching strategy. Just enough for the install prompt. Includes a TODO for future ACP (Agent Client Protocol) integration to bridge cloud-hosted agent sessions with local dev tools.
Add initial thoughts for how incident scribing works.
The interactive handler returned 200 to Slack right away (to meet the 3-second ack deadline) and only updated the message via chat.update after the DO finished executing the approved action. For slow tool calls that gap was many seconds, during which Slack dropped its spinner but the buttons stayed enabled — letting users re-click Approve/Deny while processing was already underway. Now we POST to the payload's response_url before scheduling the DO call to replace the actions block with a "⏳ Processing…" context. The DO's existing final-state chat.update still runs at the end and overwrites with "✅ Approved by …" / "❌ Denied by …".
fix: unblock /auth/providers on cold start + show error UI on login
Move slack-app-manifest.json from packages/worker/ to packages/plugin-slack/ where it belongs. Add assistant_view feature block required by the assistant:write scope. Update "AI coding agent" framing to "AI coworker" across manifest, README, CLAUDE.md, and slack-setup docs.
Linear's MCP OAuth server only supports the 'write' scope (which implies read access). Requesting 'read' caused an invalid_scope error in prod where the OAuth client was freshly registered.
assembleRepoEnv used a raw DB lookup that bypassed the credential service's token refresh logic, causing 403 push errors when the user's 8-hour GitHub App OAuth token expired. All credential resolution paths now use getCredential() which checks expiry and auto-refreshes via the App's OAuth refresh endpoint. Additional fixes in this commit: - Add push-permission validation (check permissions.push, not just HTTP status) and 401 force-refresh retry for legacy credentials - Migrate repos.ts and repo-providers.ts from old app_install credential model to github_installations + on-demand token minting - Remove dead repo-providers OAuth link/callback flow and stale manifest callback URL - Add comments throughout clarifying the unified GitHub App auth model (no classic OAuth App exists)
fix: strip Slack approval buttons immediately on click
…hread Child session idle/completion events were delivered to the orchestrator's main thread instead of the thread that spawned the child. The root cause was that parentThreadId was only stored in the child DO's transient SQL state, which is lost if the DO is re-initialized. - Persist parent_thread_id in D1 sessions table as durable fallback - Fall back to D1 value in notifyParentEvent when DO state is empty - Add hydration to handleSystemMessage's direct dispatch path so thread channels resolve the correct OpenCode session (was missing, unlike handlePrompt and sendNextQueuedPrompt which already hydrate) - Extract ensureThreadOcSessionHydrated helper to deduplicate the hydration pattern across dispatch paths - Add diagnostic logging at spawn, notify, and receive points
Drive files.list calls default corpora to 'user', hiding org-shared docs. Add a driveCorpora org setting (user/domain/allDrives) that flows through guardConfig to the Google Workspace plugin, letting admins control file discovery scope from the admin UI.
…i-49-drive-corpora-setting feat: configurable Drive corpora org setting (TKAI-49)
The GET /integrations/google_workspace/labels endpoint used stale tokens without retry logic, unlike action execution which detects 401s and force-refreshes. Extract the fetch into a helper, detect 401 on the first attempt, force-refresh the credential, and retry once.
… stop Channel followups were never resolved when a prompt completed — only when the agent explicitly called channel_reply with follow_up=true. This left pending followup rows that fired stale reminders via the alarm handler, causing the agent to re-attempt already-completed work and post contradictory messages to Slack. Resolve followups in handlePromptComplete (including the thread-origin fallback for web UI messages targeting Slack-originated threads), bulk- resolve on hibernation, and delete on session stop.
Slack forwarded messages store their content in event.attachments with is_msg_unfurl: true, but parseInbound only read event.text (the link URL). Extract unfurl text/fallback and include it in the parsed message. Also fix pre-existing test that used 11MB for a non-image file size limit check when the actual limit is 25MB.
Cloudflare's minutely cron occasionally skips invocations. When it misses the exact minute a trigger is scheduled for, that trigger silently fails until the next matching time (24h later for daily triggers). Add a two-pass dispatch: Pass 1 fires triggers matching the current minute (existing behavior). Pass 2 scans backward up to 4 hours from each trigger's last_run_at to find and dispatch missed ticks. The existing schedule_ticks dedup table prevents double-dispatch. Also fixes: - Midnight bug: use hourCycle:'h23' instead of hour12:false (V8 returns "24" for midnight with hour12:false) - Orchestrator tick ordering: claim tick before dispatch, release on retriable failure (prevents both burned ticks and duplicate prompts) - Workflow enqueue ordering: update last_run_at after enqueue succeeds - Tick table cleanup: nightly batched delete of rows older than 7 days - Type safety: remove as-any from getActiveScheduleTriggers, add config.cron guard, add Invalid Date guard on last_run_at - Extract cron functions to lib/cron.ts with 40 unit tests
The empty-state copy said "No tools enabled. This persona will only have built-in coding tools." but the backend treats an empty whitelist as unrestricted (all tools available). Updated to accurately describe the restriction semantics.
…agents When a session spawns a task (subagent), the parent session's sync prompt timeout (5 min) would fire because the parent receives no SSE events while the subagent works. This caused a spurious model failover mid-task. Two fixes: - Use OpenCode's server.heartbeat (every ~10s) to reset the sync timeout when tools are in running/pending state - When the timeout fires, check for running tools before aborting — if tools are still active, extend the timeout instead
63MB of docker/opencode/node_modules was being uploaded every deploy, only to be overwritten by bun install during image build.
…iginating thread The pending followup banner, thinking indicator, and promote/edit/cancel actions were displaying globally across all threads instead of only in the thread that owns the queued work. Include threadId in agentStatus broadcasts from the server and filter all affected UI elements by the active thread. Closes TKAI-94
When clicking "new thread," focus the text input so the user can start typing immediately without an extra click. Closes TKAI-99
…proval-overrides feat: add per-user approval policy overrides
… refresh Session-scoped "allow for this session" overrides were effectively permanent on orchestrator sessions because they never reach a terminal status. Now they expire when the sandbox is recycled — on hibernation (idle 15min) and explicit refresh — matching the original intent that ephemeral approvals last only as long as the OpenCode instance. Also compact the approval prompt buttons from a vertical listbox to a horizontal button row to reduce vertical space.
The soft-expire approach (setting expiresAt on expired rows) created ghost rows that every query path had to filter out, leading to a cascade of bugs: settings page showing stale overrides, upserts reviving expired rows, partial unique index collisions on re-grants. Switch to hard DELETE — the table only contains active rows, so queries and indexes work without expiry filtering. Per-invocation audit trail is preserved on action_invocations (user_override_id is ON DELETE SET NULL, and policy_source/lifetime/scope are independent columns). Also adds the action-policy-overrides explainer spec and compacts approval prompt buttons to a horizontal layout.
…-flush msToUnixNano multiplied a JS number by 1_000_000, exceeding MAX_SAFE_INTEGER for present-day epoch milliseconds and silently rounding span timestamps. Adds maxQueuedSpans + scheduleFlush options so a hot tracer can auto-flush in batches once a buffer threshold is hit, with a host hook for ctx.waitUntil. Addresses review feedback on #1 and yourbuddyconner#45.
The prompt handler signature had grown to 14 positional args (messageId, content, model, author, modelPreferences, attachments, channelType, channelId, opencodeSessionId, continuationContext, threadId, replyChannelType, replyChannelId, traceparent), making call sites and the wire shape unreviewable. Introduces PromptDispatch / PromptHandlerFn so onPrompt and handlePrompt take a single typed object. Updates the agent-client dispatcher, PromptHandler.handlePrompt, the bin.ts callback, and the prompt unit tests to use the new shape. Addresses figitaki review on yourbuddyconner#45.
Previously every span ended in a session-agent dispatch path called flushTracing() inline, fanning out to a separate ctx.waitUntil(flush()) per span and POSTing the buffer to OTLP each time. Across spawn → dispatch → dispatch → hibernate → restore lifecycles this generated many small HTTP requests against Grafana Cloud. Configures SimpleTracer with maxQueuedSpans=50 and a scheduleFlush hook that registers auto-flush promises with ctx.waitUntil. Drops the per-span flushTracing() calls at dispatch sites; explicit flushes remain at session-level boundaries (hibernate, terminate, wake guard, child session error finally blocks) where the DO may go idle before the threshold is hit. Addresses f3nry review on #1.
d1faab0 to
434efb8
Compare
Summary
Test plan
Need to set.
Verification
pnpm --filter @valet/shared typecheckpnpm --filter @valet/runner typecheckpnpm --filter @valet/worker typecheckpnpm --filter @valet/runner test -- src/prompt.test.tspnpm --filter @valet/worker exec vitest run src/durable-objects/prompt-queue.test.ts src/durable-objects/runner-link.test.tsRefs TKAI-2