agent-api: harden startup and shutdown on Cloud Run#3114
Open
skord wants to merge 1 commit into
Open
Conversation
Follow-ups to the direct VPC egress cold-start fix (#3109), from review: - Retry the first startup query (get_user_id_for_email) against the 120-second deadline instead of a separate pool.acquire() probe. The probe only proved the network worked once: instances were observed passing it and then exiting at "querying for agent user id" when the warming interface flapped, since that query ran under the bare 5-second acquire_timeout. Retrying the query itself also avoids the probe's throwaway connection. - Retry only transient errors (PoolTimedOut, Io). Non-transient failures such as bad credentials previously retried for the full two minutes; sqlx surfaces them promptly and distinguishably, so exit immediately and let misconfiguration surface fast. - Bound the initial authorization snapshot fetch to 60 seconds. Its refresh loop retries errors internally forever, so a persistent failure (broken query, sops / KMS breakage) hung startup silently with the port unbound and nothing logged: Cloud Run's startup probe would kill the instance with no agent-side error at all. Now startup fails visibly, and the combined budget stays inside the probe's 240-second window. - Trigger graceful shutdown on SIGTERM, which Cloud Run and Kubernetes send; previously only SIGINT was handled, so every scale-down or deploy replacement hard-killed in-flight requests when the grace period expired. Fixing this exposed that graceful shutdown could never complete anyway: the final try_join awaited the logs sink, which only ends once every logs_tx sender drops, and several senders are owned by locals of async_main. Serve until shutdown instead, then give the sink a bounded drain window. Verified: SIGTERM now exits Ok in ~5s where it previously hung until SIGKILL.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to #3109, addressing the confirmed findings from its post-merge review.
Description:
#3109 fixed the direct-VPC-egress cold-start crash-loop, but its 120-second tolerance covered only a probe
pool.acquire(). During the first rollout, four instances passed the probe and then exited atError: querying for agent user id— the next startup query still ran under the bare 5-secondacquire_timeout. Review also confirmed three adjacent defects: startup could hang silently (port unbound, zero agent-side errors) if the authorization snapshot fetch failed persistently, since its refresh loop retries internally forever; permanent misconfigurations (bad password, TLS) were retried for the full 120 seconds even though sqlx surfaces them promptly and distinguishably; and graceful shutdown never ran on Cloud Run because only SIGINT was handled — every scale-down hard-killed in-flight requests when the grace period expired.Changes:
get_user_id_for_emailitself against the 120s deadline, replacing the acquire-probe. The retry now covers the real first query (and drops the probe's throwaway connection).PoolTimedOut,Io); non-transient failures exit immediately with the underlying error so misconfiguration surfaces fast.try_join!awaited the logs sink, which only ends once everylogs_txsender drops — and several senders are owned by locals ofasync_main. The join is restructured to serve until shutdown (an early sink failure remains fatal), then give the sink a bounded 5-second drain before exiting.Local testing (real binary against local Supabase):
initial database query failed; retrying (PoolTimedOut)warnings every ~6s, binds and serves within 1s of the DB becoming reachable — same behavior as agent-api: survive direct VPC egress cold-start on instance startup #3109.PGPASSWORD: exits in under a second withpassword authentication failed, versus 120 seconds of retries before this change.main function completed ... Ok(Ok(())), process exits in ~5.1s — previously it hung until SIGKILL.Rollout:
Merge, then dispatch
deploy-agent-apiwith the new tag. Watch for: noexit(1)at "querying for agent user id" during the rollout window, and (on later scale-downs) no hard-killed in-flight requests.