agent-api: harden startup and shutdown on Cloud Run by skord · Pull Request #3114 · estuary/flow

skord · 2026-07-02T22:46:50Z

Follow-up to #3109, addressing the confirmed findings from its post-merge review.

Description:

#3109 fixed the direct-VPC-egress cold-start crash-loop, but its 120-second tolerance covered only a probe pool.acquire(). During the first rollout, four instances passed the probe and then exited at Error: querying for agent user id — the next startup query still ran under the bare 5-second acquire_timeout. Review also confirmed three adjacent defects: startup could hang silently (port unbound, zero agent-side errors) if the authorization snapshot fetch failed persistently, since its refresh loop retries internally forever; permanent misconfigurations (bad password, TLS) were retried for the full 120 seconds even though sqlx surfaces them promptly and distinguishably; and graceful shutdown never ran on Cloud Run because only SIGINT was handled — every scale-down hard-killed in-flight requests when the grace period expired.

Changes:

Retry get_user_id_for_email itself against the 120s deadline, replacing the acquire-probe. The retry now covers the real first query (and drops the probe's throwaway connection).
Retry only transient errors (PoolTimedOut, Io); non-transient failures exit immediately with the underlying error so misconfiguration surfaces fast.
Bound the initial snapshot fetch to 60 seconds with a clear error. Combined worst-case startup stays inside Cloud Run's 240-second default startup-probe window.
Handle SIGTERM (Cloud Run / Kubernetes) in addition to SIGINT. Fixing this exposed a deeper pre-existing defect: graceful shutdown could never complete because the final try_join! awaited the logs sink, which only ends once every logs_tx sender drops — and several senders are owned by locals of async_main. The join is restructured to serve until shutdown (an early sink failure remains fatal), then give the sink a bounded 5-second drain before exiting.

Local testing (real binary against local Supabase):

Delayed-DB cold start (socat forward brought up at t=25s): port stays unbound, initial database query failed; retrying (PoolTimedOut) warnings every ~6s, binds and serves within 1s of the DB becoming reachable — same behavior as agent-api: survive direct VPC egress cold-start on instance startup #3109.
Wrong PGPASSWORD: exits in under a second with password authentication failed, versus 120 seconds of retries before this change.
SIGTERM after startup: graceful shutdown completes, main function completed ... Ok(Ok(())), process exits in ~5.1s — previously it hung until SIGKILL.

Rollout:

Merge, then dispatch deploy-agent-api with the new tag. Watch for: no exit(1) at "querying for agent user id" during the rollout window, and (on later scale-downs) no hard-killed in-flight requests.

Follow-ups to the direct VPC egress cold-start fix (#3109), from review: - Retry the first startup query (get_user_id_for_email) against the 120-second deadline instead of a separate pool.acquire() probe. The probe only proved the network worked once: instances were observed passing it and then exiting at "querying for agent user id" when the warming interface flapped, since that query ran under the bare 5-second acquire_timeout. Retrying the query itself also avoids the probe's throwaway connection. - Retry only transient errors (PoolTimedOut, Io). Non-transient failures such as bad credentials previously retried for the full two minutes; sqlx surfaces them promptly and distinguishably, so exit immediately and let misconfiguration surface fast. - Bound the initial authorization snapshot fetch to 60 seconds. Its refresh loop retries errors internally forever, so a persistent failure (broken query, sops / KMS breakage) hung startup silently with the port unbound and nothing logged: Cloud Run's startup probe would kill the instance with no agent-side error at all. Now startup fails visibly, and the combined budget stays inside the probe's 240-second window. - Trigger graceful shutdown on SIGTERM, which Cloud Run and Kubernetes send; previously only SIGINT was handled, so every scale-down or deploy replacement hard-killed in-flight requests when the grace period expired. Fixing this exposed that graceful shutdown could never complete anyway: the final try_join awaited the logs sink, which only ends once every logs_tx sender drops, and several senders are owned by locals of async_main. Serve until shutdown instead, then give the sink a bounded drain window. Verified: SIGTERM now exits Ok in ~5s where it previously hung until SIGKILL.

skord requested a review from a team July 2, 2026 23:11

skord self-assigned this Jul 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

agent-api: harden startup and shutdown on Cloud Run#3114

agent-api: harden startup and shutdown on Cloud Run#3114
skord wants to merge 1 commit into
masterfrom
mdanko/agent-api-startup-hardening

skord commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

skord commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant