Skip to content

agent-api: harden startup and shutdown on Cloud Run#3114

Open
skord wants to merge 1 commit into
masterfrom
mdanko/agent-api-startup-hardening
Open

agent-api: harden startup and shutdown on Cloud Run#3114
skord wants to merge 1 commit into
masterfrom
mdanko/agent-api-startup-hardening

Conversation

@skord

@skord skord commented Jul 2, 2026

Copy link
Copy Markdown
Member

Follow-up to #3109, addressing the confirmed findings from its post-merge review.

Description:

#3109 fixed the direct-VPC-egress cold-start crash-loop, but its 120-second tolerance covered only a probe pool.acquire(). During the first rollout, four instances passed the probe and then exited at Error: querying for agent user id — the next startup query still ran under the bare 5-second acquire_timeout. Review also confirmed three adjacent defects: startup could hang silently (port unbound, zero agent-side errors) if the authorization snapshot fetch failed persistently, since its refresh loop retries internally forever; permanent misconfigurations (bad password, TLS) were retried for the full 120 seconds even though sqlx surfaces them promptly and distinguishably; and graceful shutdown never ran on Cloud Run because only SIGINT was handled — every scale-down hard-killed in-flight requests when the grace period expired.

Changes:

  • Retry get_user_id_for_email itself against the 120s deadline, replacing the acquire-probe. The retry now covers the real first query (and drops the probe's throwaway connection).
  • Retry only transient errors (PoolTimedOut, Io); non-transient failures exit immediately with the underlying error so misconfiguration surfaces fast.
  • Bound the initial snapshot fetch to 60 seconds with a clear error. Combined worst-case startup stays inside Cloud Run's 240-second default startup-probe window.
  • Handle SIGTERM (Cloud Run / Kubernetes) in addition to SIGINT. Fixing this exposed a deeper pre-existing defect: graceful shutdown could never complete because the final try_join! awaited the logs sink, which only ends once every logs_tx sender drops — and several senders are owned by locals of async_main. The join is restructured to serve until shutdown (an early sink failure remains fatal), then give the sink a bounded 5-second drain before exiting.

Local testing (real binary against local Supabase):

  • Delayed-DB cold start (socat forward brought up at t=25s): port stays unbound, initial database query failed; retrying (PoolTimedOut) warnings every ~6s, binds and serves within 1s of the DB becoming reachable — same behavior as agent-api: survive direct VPC egress cold-start on instance startup #3109.
  • Wrong PGPASSWORD: exits in under a second with password authentication failed, versus 120 seconds of retries before this change.
  • SIGTERM after startup: graceful shutdown completes, main function completed ... Ok(Ok(())), process exits in ~5.1s — previously it hung until SIGKILL.

Rollout:

Merge, then dispatch deploy-agent-api with the new tag. Watch for: no exit(1) at "querying for agent user id" during the rollout window, and (on later scale-downs) no hard-killed in-flight requests.

Follow-ups to the direct VPC egress cold-start fix (#3109), from review:

- Retry the first startup query (get_user_id_for_email) against the
  120-second deadline instead of a separate pool.acquire() probe. The
  probe only proved the network worked once: instances were observed
  passing it and then exiting at "querying for agent user id" when the
  warming interface flapped, since that query ran under the bare
  5-second acquire_timeout. Retrying the query itself also avoids the
  probe's throwaway connection.

- Retry only transient errors (PoolTimedOut, Io). Non-transient
  failures such as bad credentials previously retried for the full
  two minutes; sqlx surfaces them promptly and distinguishably, so
  exit immediately and let misconfiguration surface fast.

- Bound the initial authorization snapshot fetch to 60 seconds. Its
  refresh loop retries errors internally forever, so a persistent
  failure (broken query, sops / KMS breakage) hung startup silently
  with the port unbound and nothing logged: Cloud Run's startup probe
  would kill the instance with no agent-side error at all. Now startup
  fails visibly, and the combined budget stays inside the probe's
  240-second window.

- Trigger graceful shutdown on SIGTERM, which Cloud Run and Kubernetes
  send; previously only SIGINT was handled, so every scale-down or
  deploy replacement hard-killed in-flight requests when the grace
  period expired. Fixing this exposed that graceful shutdown could
  never complete anyway: the final try_join awaited the logs sink,
  which only ends once every logs_tx sender drops, and several senders
  are owned by locals of async_main. Serve until shutdown instead,
  then give the sink a bounded drain window. Verified: SIGTERM now
  exits Ok in ~5s where it previously hung until SIGKILL.
@skord skord requested a review from a team July 2, 2026 23:11
@skord skord self-assigned this Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant