feat(auth): Add OAuth quota group suspension#2885
feat(auth): Add OAuth quota group suspension#288538lunin021189-creator wants to merge 8 commits intorouter-for-me:devfrom
Conversation
Add configurable OAuth quota groups with persisted auto and manual suspension state for OAuth-backed providers. Route 429 quota exhaustion through group-level cooldowns, keep transient 503 handling model-scoped, expose management endpoints for inspecting and updating group state, and cover the Antigravity Claude-versus-Gemini behavior with tests. Co-Authored-By: Codex <[email protected]>
Apply three fixes called out in the gemini-code-assist review of PR
router-for-me/CLIProxyAPIPlus#524 before resubmitting the feature against
the mainline repository:
1. Thread-safety (shared config mutation):
- Introduce cloneRuntimeConfigForQuotaGroups() that returns a shallow
copy of the currently published *internalconfig.Config with an
independently owned OAuthAccountQuotaGroupState slice.
- Replace the three callsites in oauth_quota_groups_manager.go that
previously loaded the shared pointer and mutated it in place. The
mutation now happens on a private copy before SetConfig publishes
the new pointer via atomic.Value, so readers that still hold the
previous pointer never observe a torn slice header.
2. Atomic persistence of config.yaml:
- Change SaveConfigPreserveComments to render the merged YAML fully
in memory first, then write it to disk via a new writeFileAtomic
helper (CreateTemp sibling + Write + Sync + Chmod + Rename). A
crash, concurrent writer, or failed Sync can no longer leave a
truncated or partially-written live config.
- Serialize quota-group driven writes with a package-level mutex
(oauthQuotaGroupPersistMu) so we never rebuild/merge the same file
from multiple goroutines in parallel even though the final rename
is atomic.
3. Async cleanup of expired auto suspensions:
- Remove the three synchronous m.ClearExpiredOAuthQuotaGroupAutoStates
calls from Execute / ExecuteCount / ExecuteStream. The per-request
lock-free oauthQuotaGroupBlock read already ignores entries whose
AutoSuspendedUntil is in the past, so the hot path no longer needs
to acquire the global Manager lock or write config.yaml.
- Add StartOAuthQuotaGroupCleanup / StopOAuthQuotaGroupCleanup that
run the reaper in a single background goroutine with a 30s ticker
and a one-second primer, cancellable via the supplied context. The
Manager.quotaGroupCleanupCancel field tracks the active loop and
guarantees only one is alive at a time.
- Wire the new loop into the existing StartAutoRefresh /
StopAutoRefresh lifecycle so hosts that already manage background
loops get expired-state cleanup for free.
Verified with `go build ./...` and `go test -race ./sdk/cliproxy/auth/...
./internal/config/...` on go1.26-alpine; no data races reported.
|
This pull request targeted The base branch has been automatically changed to |
There was a problem hiding this comment.
Code Review
This pull request implements OAuth quota groups to manage shared quota families for Google-backed models, including persistence for manual and automatic suspension states. It introduces management API endpoints for state control, background cleanup for expired auto-suspensions, and an atomic configuration saving mechanism. The review feedback identifies several critical concurrency and consistency issues, specifically regarding thread-unsafe mutations of shared configuration objects in the management handlers and the use of stale state snapshots when calculating account availability. Additionally, it is recommended to improve error handling during file permission updates in the atomic write process.
| } | ||
| entries = wrapper.Items | ||
| } | ||
| h.cfg.OAuthQuotaGroups = config.NormalizeOAuthQuotaGroups(entries) |
There was a problem hiding this comment.
Data race: Mutating h.cfg.OAuthQuotaGroups in place is not thread-safe. This configuration object is shared with the AuthManager and other components. Concurrent reads (e.g., during request routing or in GetAuthFileQuotaGroups) could observe a torn slice header or inconsistent data. You should clone the configuration object, mutate the clone, and then update the reference atomically, or ensure all access to h.cfg is protected by a lock that is also respected by readers.
| h.authManager.ClearExpiredOAuthQuotaGroupAutoStates(time.Now()) | ||
| definitions := config.NormalizeOAuthQuotaGroups(h.cfg.OAuthQuotaGroups) | ||
| stateEntries := config.NormalizeOAuthAccountQuotaGroupState(h.cfg.OAuthAccountQuotaGroupState) |
There was a problem hiding this comment.
Consistency issue: h.cfg in the management handler is not automatically updated when h.authManager.ClearExpiredOAuthQuotaGroupAutoStates modifies the configuration. As a result, stateEntries is populated from a stale version of the configuration, causing the API to return expired suspension states that were just cleared. More importantly, any subsequent manual update via PatchAuthFileQuotaGroupsManual will use the stale h.cfg as a base, potentially re-introducing expired auto-suspension states into the config file and AuthManager.
| h.cfg.OAuthAccountQuotaGroupState = config.RemoveOAuthAccountQuotaGroupState(h.cfg.OAuthAccountQuotaGroupState, authID, groupID) | ||
| } else { | ||
| h.cfg.OAuthAccountQuotaGroupState = config.UpsertOAuthAccountQuotaGroupState(h.cfg.OAuthAccountQuotaGroupState, current) |
There was a problem hiding this comment.
Data race and consistency issue: Mutating h.cfg.OAuthAccountQuotaGroupState outside of a lock that protects the entire read-modify-write cycle is unsafe. Multiple concurrent management requests could result in lost updates. Additionally, since h.cfg is shared, this mutation is visible to other goroutines without proper synchronization. Furthermore, h.cfg may be stale if the AuthManager has performed background cleanups, leading to the re-introduction of expired states.
| if err = tmp.Chmod(perm); err != nil { | ||
| // Non-fatal on Windows where Chmod has limited effect; ignore. | ||
| _ = err | ||
| } |
There was a problem hiding this comment.
Ignoring the error from tmp.Chmod might lead to incorrect file permissions on non-Windows systems if the operation fails for reasons other than platform incompatibility (e.g., restricted filesystem features). It's safer to check the error and only ignore it if it's a known non-fatal condition or if the platform is Windows.
| } | ||
| state := ensureModelState(auth, result.Model) | ||
| resetModelState(state, now) | ||
| updateAggregatedAvailability(auth, now) |
There was a problem hiding this comment.
Stale snapshot usage: updateAggregatedAvailability relies on the global oauthQuotaRuntime snapshot to determine the status of quota groups. However, this snapshot is only updated when m.SetConfig is called (line 2189). Calling updateAggregatedAvailability here uses the state from before the current result was processed, leading to an incorrect auth.Unavailable status until the next time availability is recalculated.
| if auth == nil { | ||
| continue | ||
| } | ||
| updateAggregatedAvailability(auth, now) |
There was a problem hiding this comment.
Stale snapshot usage: updateAggregatedAvailability is called here before the updated configuration has been published via m.SetConfig (line 232). This means the availability calculation uses the old snapshot which still contains the expired suspension states. Consequently, the Auth objects in m.auths and the scheduler will retain a stale Unavailable status until a subsequent request or cleanup tick triggers another update.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e573763673
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // Write atomically: temp file in the same directory, fsync, then rename. | ||
| // This prevents a partial write from corrupting the live config on crash | ||
| // or concurrent writer races. | ||
| return writeFileAtomic(configFile, data, 0o644) |
There was a problem hiding this comment.
Preserve restrictive permissions when saving config
Passing a hard-coded 0o644 here makes every persisted config.yaml world-readable regardless of the process umask or the file’s prior mode, which can expose embedded secrets (API keys/tokens) on multi-user hosts. Before this change, os.Create respected umask, so deployments using umask 077 would keep config private; now each write can silently relax permissions.
Useful? React with 👍 / 👎.
| return false | ||
| } | ||
| if h.authManager != nil { | ||
| h.authManager.SetConfig(h.cfg) |
There was a problem hiding this comment.
Avoid resetting manager state from stale handler config
This pushes h.cfg back into authManager after every management write, but quota-group auto state is now mutated inside Manager.MarkResult and published via its own runtime config snapshot. If a 429 updates quota-group state and a management request persists before file-watcher reload refreshes h.cfg, this call overwrites the newer runtime snapshot with stale data, dropping active cooldown/manual state and causing inconsistent routing behavior.
Useful? React with 👍 / 👎.
| } | ||
| m.runtimeConfig.Store(cfg) | ||
| m.applyBuiltInSelectorForConfig(cfg) | ||
| SetOAuthQuotaRuntimeConfig(cfg) |
There was a problem hiding this comment.
Scope quota runtime config to each Manager instance
SetConfig writes quota-group routing state into a process-global snapshot, so multiple auth.Manager instances in the same process overwrite each other’s quota groups/aliases/state. That breaks isolation for embedded SDK use (or parallel services/tests) because routing decisions in one manager can start using another manager’s config unexpectedly.
Useful? React with 👍 / 👎.
Apply the nine inline review comments left by gemini-code-assist and chatgpt-codex-connector on PR router-for-me#2885. Management handler thread-safety (addresses gemini HIGH #1/router-for-me#2/router-for-me#3 and codex P1 on handler.go): - internal/api/handlers/management/quota_groups.go no longer mutates h.cfg.OAuthQuotaGroups or h.cfg.OAuthAccountQuotaGroupState directly. Reads go through the new authManager.OAuthQuotaGroupStateSnapshot(), so every call sees the freshly published runtime snapshot even after a background ClearExpiredOAuthQuotaGroupAutoStates tick. - Manual suspend / lift and explicit auto-clear use two new Manager methods (SetOAuthQuotaGroupManualState and ClearOAuthQuotaGroupAutoState) that take m.mu.Lock, copy the current runtime *Config, mutate the copy and return it to the caller. The handler then calls h.persist() which takes the global persist mutex introduced in round 1. - internal/api/handlers/management/handler.go stops pushing the stale h.cfg back into authManager via SetConfig in both SetAuthManager and persist(). Only the config-file path is propagated, preserving the newer runtime snapshot produced by MarkResult / cleanup / management writes. Manager-scoped quota-group runtime (addresses codex P2 router-for-me#9): - sdk/cliproxy/auth/oauth_quota_groups.go drops the package-global var oauthQuotaRuntime atomic.Value / init() / SetOAuthQuotaRuntimeConfig entry points. The snapshot is now looked up via the new helper oauthQuotaRuntimeSnapshotForAuth(auth) which reads auth.quotaRuntime. - sdk/cliproxy/auth/types.go adds a private quotaRuntime pointer to Auth and Auth.Clone copies it forward. - sdk/cliproxy/auth/conductor.go adds a Manager.oauthQuotaRuntime atomic.Value, initialises it in NewManager, refreshes it inside SetConfig (while holding m.mu briefly to fan-out the new pointer to every cached Auth), and makes Register / Update / Load attach the current snapshot to newly registered Auths. Multiple Manager instances in the same process no longer overwrite each other's routing state. Atomic config.yaml write preserves existing permissions (addresses codex P1 security router-for-me#7 and gemini MEDIUM router-for-me#4): - internal/config/config.go replaces the round-1 writeFileAtomic(.., 0o644) helper with writeConfigFileAtomically(configFile, data). The new helper stat()s the existing file and reuses its permission bits; a 0o600 default is used when the file does not yet exist. Chmod errors are propagated, not swallowed, so a failed permission preservation cannot silently produce a world-readable secret file. Both SaveConfigPreserveComments and SaveConfigPreserveCommentsUpdateNestedScalar now use the atomic writer, so the nested-scalar path no longer leaves a truncated file on crash. Ordering of updateAggregatedAvailability (addresses gemini MEDIUM router-for-me#5/router-for-me#6): - sdk/cliproxy/auth/conductor.go: updateAggregatedAvailability now reads quota-group state via oauthQuotaGroupState(auth, groupID), so every Auth sees its own manager's freshly published snapshot rather than a process-global stale one. Tests: - sdk/cliproxy/auth/oauth_quota_groups_test.go switches from the old SetOAuthQuotaRuntimeConfig / global-cleanup pattern to a local applyOAuthQuotaRuntime(cfg, auths...) helper that wires the snapshot into each Auth directly, matching the new per-Auth design. Verified with: - docker run golang:1.26-alpine go build ./... - docker run golang:1.26-alpine apk add gcc musl-dev; go test -race -count=1 -timeout 180s ./sdk/cliproxy/auth/... ./internal/config/... Both packages pass with the race detector; no data races remain.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1edeee7b86
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if next := h.authManager.SetOAuthQuotaGroupManualState(authID, groupID, *body.ManualSuspended, body.Reason, body.UpdatedBy, now); next != nil { | ||
| h.cfg = next | ||
| } |
There was a problem hiding this comment.
Publish quota snapshot after manual quota-group changes
This handler stores the updated config in h.cfg and persists it, but it never calls authManager.SetConfig(next) after SetOAuthQuotaGroupManualState (same pattern in PatchAuthFileQuotaGroupsAutoClear). Selection enforces quota-group suspension from auth.quotaRuntime snapshots (oauthQuotaGroupBlock), and those snapshots are only rebuilt in Manager.SetConfig, so manual suspend/clear can appear applied in management responses while live routing still uses stale quota-group state until an external config reload happens.
Useful? React with 👍 / 👎.
| m.mu.Unlock() | ||
| if runtimeCfgToPublish != nil { | ||
| m.SetConfig(runtimeCfgToPublish) | ||
| if persistRuntimeConfig { |
There was a problem hiding this comment.
Keep quota snapshot publish inside MarkResult critical section
MarkResult computes runtimeCfgToPublish while holding m.mu, but releases the lock before calling SetConfig. Under concurrent 429 handling, two goroutines can both clone from the same old runtime config and then publish in sequence, causing the later publish to overwrite and drop the earlier auth/group cooldown entry. That creates nondeterministic missing suspensions under production concurrency.
Useful? React with 👍 / 👎.
Keep OAuth quota-group snapshot publication inside the manager critical section so concurrent cooldown updates do not overwrite each other, and manual management changes take effect immediately in live routing.\n\nAdd regression tests covering manual suspension publication, auto-clear publication, and aggregated availability during quota-group cooldowns.\n\nCo-Authored-By: Codex <[email protected]>
|
Addressed the latest quota-group review items in commit e035c2a. What changed:
Verification:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e035c2ae75
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| prev, _ = m.runtimeConfig.Load().(*internalconfig.Config) | ||
| } |
There was a problem hiding this comment.
Persist quota-state changes from fresh disk config
cloneRuntimeConfigForQuotaGroups builds updates from m.runtimeConfig, and later persistence writes that whole snapshot back to config.yaml. If management/external writes have already changed the file but watcher reload has not yet refreshed m.runtimeConfig, a 429-triggered quota-state write will serialize stale fields and silently roll back unrelated config edits. This creates deterministic lost updates during the reload gap.
Useful? React with 👍 / 👎.
| wd, err := os.Getwd() | ||
| if err != nil || strings.TrimSpace(wd) == "" { | ||
| return "" | ||
| } | ||
| return filepath.Join(wd, "config.yaml") |
There was a problem hiding this comment.
Require explicit config path for quota-state persistence
When SetConfigFilePath is never called, configFilePathValue falls back to <cwd>/config.yaml. In embedded SDK usage this makes quota-group runtime events (e.g., 429 cooldown updates) attempt writes to an arbitrary working-directory file, which can modify the wrong config or emit repeated persistence warnings under load. Persistence should be opt-in when no explicit path is configured.
Useful? React with 👍 / 👎.
| _ = os.Remove(configFile) | ||
| if retryErr := os.Rename(tmpName, configFile); retryErr != nil { |
There was a problem hiding this comment.
Keep existing config until replacement rename succeeds
On rename failure, the fallback immediately removes configFile before retrying os.Rename. If the retry fails too (for transient permission/filesystem reasons), the function returns with the original config already deleted, turning a write error into data loss. The old file should remain intact unless a replacement has been successfully installed.
Useful? React with 👍 / 👎.
Port the upstream-compatible account and quota-group suspension changes from the retired fork-side working tree into the mainline branch. Keep the conductor, selector, filestore, vertex, and config persistence updates on the branch we plan to carry alongside upstream. Add the atomic config write fallback test so the Windows-safe config path stays covered while the fork-only provider tail remains archived separately. Co-Authored-By: Codex <[email protected]>
Non-Anthropic upstreams (for example third-party proxies) may not support `/v1/messages/count_tokens` and can return `404`. Treating that as a model failure suspends the auth for hours and causes follow-on `503` responses. Skip `MarkResult` for `404` failures in `executeCountMixedOnce` so `count_tokens` misses do not trigger auth suspension. Other statuses such as `401`, `429`, and `500` continue to be recorded normally. Co-Authored-By: Claude Opus 4.7 <[email protected]> (cherry picked from commit 92c2f75) Co-Authored-By: Codex <[email protected]>
Allow operators to clear every auto cooldown for a single OAuth account when the upstream quota resets earlier than the stored cooldown window. Preserve manual suspensions during the bulk clear so the escape hatch only removes stale auto freezes. Co-Authored-By: Codex <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 22593c982d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| timer := time.AfterFunc(upstreamAttemptTimeout, func() { | ||
| timedOut.Store(true) | ||
| cancel() |
There was a problem hiding this comment.
Remove hard-coded 25s upstream execution timeout
This introduces a fixed 25-second cancellation for every upstream attempt via newUpstreamAttemptContext, so any legitimate completion/count/bootstrap that takes longer is forcibly turned into an upstream_timeout/504 even when the caller context is still valid. Because this timeout is global and not configurable, slower providers or large prompts will now deterministically fail/retry where they previously could succeed, which is a production behavior regression.
Useful? React with 👍 / 👎.
| } | ||
|
|
||
| now := time.Now().UTC() | ||
| if next := h.authManager.SetOAuthQuotaGroupManualState(authID, groupID, *body.ManualSuspended, body.Reason, body.UpdatedBy, now); next != nil { |
There was a problem hiding this comment.
Check auth manager before applying quota-group patches
Both quota-group PATCH handlers dereference h.authManager without a nil guard, unlike other management handlers. Since this package explicitly supports constructing a handler with a nil manager (e.g., NewHandlerWithoutConfigFilePath(..., nil)), calling these endpoints before manager wiring (or in nil-manager contexts) will panic instead of returning a controlled HTTP error.
Useful? React with 👍 / 👎.
luispater
left a comment
There was a problem hiding this comment.
Summary
This PR adds configurable OAuth quota-group suspension for OAuth-backed providers (antigravity, gemini, gemini-cli), including:
- Config-driven quota groups (
oauth-quota-groups) with default bootstrap groups + normalization. - Persisted per-auth + per-group manual/auto suspension state (
oauth-account-quota-group-state) inconfig.yaml. - Selector-side enforcement to block suspended groups, and group-level 429 cooldown routing (instead of per-model freezing).
- Management API endpoints to view/update quota groups and suspension state.
- Background cleanup loop to reap expired auto suspensions (moved off the hot path).
- Atomic config writes to avoid truncation/corruption and preserve restrictive perms.
- Focused tests (including the Antigravity Claude-vs-Gemini behavior).
Overall the quota-group architecture and the “publish via cloned runtime config + atomic persistence” direction look solid and well-tested.
Blocking: upstream attempt timeout
sdk/cliproxy/auth/conductor.go introduces upstreamAttemptTimeout = 25s and uses it to cancel each upstream attempt for:
executor.Execute(...)(non-stream)executor.CountTokens(...)executor.ExecuteStream(...)(bootstrap)
This is a major behavioral change for non-stream requests: legitimate responses that take >25s will now be canceled and surfaced as a synthetic upstream_timeout (HTTP 504), and it appears to conflict with this repo’s timeout policy (timeouts only allowed in limited, explicitly-listed places).
Request: please remove this timeout, or gate it behind explicit config (default disabled), and/or scope it narrowly (e.g., stream bootstrap and possibly CountTokens, but not full non-stream Execute).
Non-blocking suggestions
- Add
h.authManager == nilguards to the new quota-group PATCH handlers for consistency/safety. - Make
ClearOAuthQuotaGroupAutoState/ClearAllOAuthQuotaGroupAutoStatesreturnnil(or a “changed” flag) when no state changes, so management calls don’t rewriteconfig.yamlunnecessarily. - Consider tightening
writeConfigFileAtomically’s “delete target then retry rename” branch to only the error cases it actually intends to handle.
Suggested test plan (not run locally)
go test -race -count=1 ./sdk/cliproxy/auth/... ./internal/config/...go test ./...
Summary
Add configurable OAuth quota-group suspension for OAuth-backed providers (
antigravity,gemini,gemini-cli).This introduces a generic quota-group layer that can persist manual and automatic suspension state, route model-scoped 429 quota exhaustion into group-level cooldowns, and keep transient capacity failures model-scoped. The main motivation is to make provider behavior match real OAuth quota semantics for accounts that share quota across model families. In the current implementation a 429 usually freezes only the exact model that tripped it, which means a burned Claude family can still be retried through adjacent Claude models until they fail individually.
What this change adds
oauth-quota-groups) with provider + wildcard pattern matching.config.yamlunderoauth-account-quota-group-state, preserving comments.Additional follow-ups on this branch
Two follow-up fixes now ride on top of the reviewed quota-group core:
count_tokensreturns404, which can happen on non-Anthropic compat upstreams that do not implement that endpoint.History
This was originally submitted against the downstream Plus repository as router-for-me/CLIProxyAPIPlus#524 and closed by the maintainer with the note that the Plus project only accepts third-party-provider PRs; core
AuthManagerchanges belong in this mainline repo. The main dashboard side for quota-group controls already shipped as itsmylife44/cliproxyapi-dashboard#192, and the new bulk-clear UI follow-up is in itsmylife44/cliproxyapi-dashboard#204.Code review feedback from the previous submission
The automated review on the closed PR flagged three concerns. This resubmission addresses all of them in a follow-up commit (
fix(quota-groups): address PR #524 code review feedback):Thread-safety when mutating shared config. The previous version loaded the shared
*internalconfig.Configpointer fromatomic.Valueand then mutated theOAuthAccountQuotaGroupStateslice header in place. Any reader that still held the previous pointer could observe a torn slice header.cloneRuntimeConfigForQuotaGroups()returns a shallow copy of the current snapshot with an independently owned state slice. All three mutation sites (setOAuthQuotaGroupAutoStateLocked,clearOAuthQuotaGroupAutoStateLocked,ClearExpiredOAuthQuotaGroupAutoStates) now mutate the copy and publish it viaSetConfig. No in-place mutation of the shared snapshot remains.File corruption from concurrent writes of
config.yaml. The previous version usedos.Create+Write, which truncates the live file before the new bytes are written and leaves the file in an inconsistent state on crash or concurrent writer.SaveConfigPreserveCommentsnow renders the merged YAML fully in memory first, then calls the newwriteFileAtomichelper:CreateTempsibling ->Write->Sync->Chmod->Rename. Rename is atomic on POSIX and best-effort atomic on Windows.oauthQuotaGroupPersistMuserializes quota-group-initiated writes so the merge/render step is not performed concurrently on the same file even though the final rename is atomic.Synchronous expired-state cleanup on the hot path. The previous version called
ClearExpiredOAuthQuotaGroupAutoStatesat the top ofExecute,ExecuteCount, andExecuteStream. That call took the globalManagermutex and, on any expiry, synchronously rewroteconfig.yaml— inside the request path.oauthQuotaGroupBlockread already ignores entries whoseAutoSuspendedUntilis in the past, so the request path stays correct without cleanup.StartOAuthQuotaGroupCleanup/StopOAuthQuotaGroupCleanuppair runs the reaper in a single background goroutine (30s ticker + 1s primer), tracked byManager.quotaGroupCleanupCancelso only one loop is alive at a time. It is auto-wired into the existingStartAutoRefresh/StopAutoRefreshlifecycle so hosts get it for free.Verification
go build ./...— OK ongolang:1.26-alpineduring the earlier verification pass.go test -race -count=1 ./sdk/cliproxy/auth/... ./internal/config/...— OK during the earlier verification pass; no data races reported.conductor_oauth_alias_suspension_test.go,antigravity_quota_groups_behavior_test.go,oauth_quota_groups_test.go,oauth_quota_groups_manager(viaauthpackage tests), andinternal/configtests passed in that run.count_tokens404 handling and bulk auto-clear) were not re-run on this Windows workstation because the Go toolchain is currently unavailable here.Commits
feat(auth): Add OAuth quota group suspension— the feature itself (ported from the closed PR).fix(quota-groups): address PR #524 code review feedback— the three review fixes above.fix(quota-groups): publish runtime state atomically— tighten publication ordering and cover runtime snapshot refresh.fix(conductor): skip auth suspension for count_tokens 404s— avoid false cooldowns on compat upstreams withoutcount_tokens.fix(quota): Add bulk OAuth cooldown clear— add the operator escape hatch for early quota resets.Related