Skip to content

feat(auth): Add OAuth quota group suspension#2885

Open
38lunin021189-creator wants to merge 8 commits intorouter-for-me:devfrom
38lunin021189-creator:feat/quota-group-suspension
Open

feat(auth): Add OAuth quota group suspension#2885
38lunin021189-creator wants to merge 8 commits intorouter-for-me:devfrom
38lunin021189-creator:feat/quota-group-suspension

Conversation

@38lunin021189-creator
Copy link
Copy Markdown

@38lunin021189-creator 38lunin021189-creator commented Apr 18, 2026

Summary

Add configurable OAuth quota-group suspension for OAuth-backed providers (antigravity, gemini, gemini-cli).

This introduces a generic quota-group layer that can persist manual and automatic suspension state, route model-scoped 429 quota exhaustion into group-level cooldowns, and keep transient capacity failures model-scoped. The main motivation is to make provider behavior match real OAuth quota semantics for accounts that share quota across model families. In the current implementation a 429 usually freezes only the exact model that tripped it, which means a burned Claude family can still be retried through adjacent Claude models until they fail individually.

What this change adds

  • Config-driven quota groups (oauth-quota-groups) with provider + wildcard pattern matching.
  • Selector enforcement of group-level suspension (both manual and automatic cooldown).
  • Persistence of suspension state in config.yaml under oauth-account-quota-group-state, preserving comments.
  • Management endpoints for admin-controlled suspend / lift / clear-cooldown.
  • Focused tests including the Antigravity case where Claude can be suspended while Gemini Flash remains available.

Additional follow-ups on this branch

Two follow-up fixes now ride on top of the reviewed quota-group core:

  1. Skip auth suspension when count_tokens returns 404, which can happen on non-Anthropic compat upstreams that do not implement that endpoint.
  2. Add a bulk auto-cooldown clear path for one OAuth account while preserving manual suspensions, so operators can recover cleanly after an early upstream quota reset.

History

This was originally submitted against the downstream Plus repository as router-for-me/CLIProxyAPIPlus#524 and closed by the maintainer with the note that the Plus project only accepts third-party-provider PRs; core AuthManager changes belong in this mainline repo. The main dashboard side for quota-group controls already shipped as itsmylife44/cliproxyapi-dashboard#192, and the new bulk-clear UI follow-up is in itsmylife44/cliproxyapi-dashboard#204.

Code review feedback from the previous submission

The automated review on the closed PR flagged three concerns. This resubmission addresses all of them in a follow-up commit (fix(quota-groups): address PR #524 code review feedback):

  1. Thread-safety when mutating shared config. The previous version loaded the shared *internalconfig.Config pointer from atomic.Value and then mutated the OAuthAccountQuotaGroupState slice header in place. Any reader that still held the previous pointer could observe a torn slice header.

    • Now cloneRuntimeConfigForQuotaGroups() returns a shallow copy of the current snapshot with an independently owned state slice. All three mutation sites (setOAuthQuotaGroupAutoStateLocked, clearOAuthQuotaGroupAutoStateLocked, ClearExpiredOAuthQuotaGroupAutoStates) now mutate the copy and publish it via SetConfig. No in-place mutation of the shared snapshot remains.
  2. File corruption from concurrent writes of config.yaml. The previous version used os.Create + Write, which truncates the live file before the new bytes are written and leaves the file in an inconsistent state on crash or concurrent writer.

    • SaveConfigPreserveComments now renders the merged YAML fully in memory first, then calls the new writeFileAtomic helper: CreateTemp sibling -> Write -> Sync -> Chmod -> Rename. Rename is atomic on POSIX and best-effort atomic on Windows.
    • A package-level oauthQuotaGroupPersistMu serializes quota-group-initiated writes so the merge/render step is not performed concurrently on the same file even though the final rename is atomic.
  3. Synchronous expired-state cleanup on the hot path. The previous version called ClearExpiredOAuthQuotaGroupAutoStates at the top of Execute, ExecuteCount, and ExecuteStream. That call took the global Manager mutex and, on any expiry, synchronously rewrote config.yaml — inside the request path.

    • All three hot-path calls are removed. The lock-free oauthQuotaGroupBlock read already ignores entries whose AutoSuspendedUntil is in the past, so the request path stays correct without cleanup.
    • A new StartOAuthQuotaGroupCleanup / StopOAuthQuotaGroupCleanup pair runs the reaper in a single background goroutine (30s ticker + 1s primer), tracked by Manager.quotaGroupCleanupCancel so only one loop is alive at a time. It is auto-wired into the existing StartAutoRefresh / StopAutoRefresh lifecycle so hosts get it for free.

Verification

  • go build ./... — OK on golang:1.26-alpine during the earlier verification pass.
  • go test -race -count=1 ./sdk/cliproxy/auth/... ./internal/config/... — OK during the earlier verification pass; no data races reported.
  • Existing conductor_oauth_alias_suspension_test.go, antigravity_quota_groups_behavior_test.go, oauth_quota_groups_test.go, oauth_quota_groups_manager (via auth package tests), and internal/config tests passed in that run.
  • The two latest follow-up commits (count_tokens 404 handling and bulk auto-clear) were not re-run on this Windows workstation because the Go toolchain is currently unavailable here.

Commits

  1. feat(auth): Add OAuth quota group suspension — the feature itself (ported from the closed PR).
  2. fix(quota-groups): address PR #524 code review feedback — the three review fixes above.
  3. fix(quota-groups): publish runtime state atomically — tighten publication ordering and cover runtime snapshot refresh.
  4. fix(conductor): skip auth suspension for count_tokens 404s — avoid false cooldowns on compat upstreams without count_tokens.
  5. fix(quota): Add bulk OAuth cooldown clear — add the operator escape hatch for early quota resets.

Related

38lunin021189-creator and others added 2 commits April 18, 2026 07:50
Add configurable OAuth quota groups with persisted auto and manual suspension state for OAuth-backed providers.

Route 429 quota exhaustion through group-level cooldowns, keep transient 503 handling model-scoped, expose management endpoints for inspecting and updating group state, and cover the Antigravity Claude-versus-Gemini behavior with tests.

Co-Authored-By: Codex <[email protected]>
Apply three fixes called out in the gemini-code-assist review of PR
router-for-me/CLIProxyAPIPlus#524 before resubmitting the feature against
the mainline repository:

1. Thread-safety (shared config mutation):
   - Introduce cloneRuntimeConfigForQuotaGroups() that returns a shallow
     copy of the currently published *internalconfig.Config with an
     independently owned OAuthAccountQuotaGroupState slice.
   - Replace the three callsites in oauth_quota_groups_manager.go that
     previously loaded the shared pointer and mutated it in place. The
     mutation now happens on a private copy before SetConfig publishes
     the new pointer via atomic.Value, so readers that still hold the
     previous pointer never observe a torn slice header.

2. Atomic persistence of config.yaml:
   - Change SaveConfigPreserveComments to render the merged YAML fully
     in memory first, then write it to disk via a new writeFileAtomic
     helper (CreateTemp sibling + Write + Sync + Chmod + Rename). A
     crash, concurrent writer, or failed Sync can no longer leave a
     truncated or partially-written live config.
   - Serialize quota-group driven writes with a package-level mutex
     (oauthQuotaGroupPersistMu) so we never rebuild/merge the same file
     from multiple goroutines in parallel even though the final rename
     is atomic.

3. Async cleanup of expired auto suspensions:
   - Remove the three synchronous m.ClearExpiredOAuthQuotaGroupAutoStates
     calls from Execute / ExecuteCount / ExecuteStream. The per-request
     lock-free oauthQuotaGroupBlock read already ignores entries whose
     AutoSuspendedUntil is in the past, so the hot path no longer needs
     to acquire the global Manager lock or write config.yaml.
   - Add StartOAuthQuotaGroupCleanup / StopOAuthQuotaGroupCleanup that
     run the reaper in a single background goroutine with a 30s ticker
     and a one-second primer, cancellable via the supplied context. The
     Manager.quotaGroupCleanupCancel field tracks the active loop and
     guarantees only one is alive at a time.
   - Wire the new loop into the existing StartAutoRefresh /
     StopAutoRefresh lifecycle so hosts that already manage background
     loops get expired-state cleanup for free.

Verified with `go build ./...` and `go test -race ./sdk/cliproxy/auth/...
./internal/config/...` on go1.26-alpine; no data races reported.
@github-actions
Copy link
Copy Markdown

This pull request targeted main.

The base branch has been automatically changed to dev.

@github-actions github-actions Bot changed the base branch from main to dev April 18, 2026 03:02
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements OAuth quota groups to manage shared quota families for Google-backed models, including persistence for manual and automatic suspension states. It introduces management API endpoints for state control, background cleanup for expired auto-suspensions, and an atomic configuration saving mechanism. The review feedback identifies several critical concurrency and consistency issues, specifically regarding thread-unsafe mutations of shared configuration objects in the management handlers and the use of stale state snapshots when calculating account availability. Additionally, it is recommended to improve error handling during file permission updates in the atomic write process.

}
entries = wrapper.Items
}
h.cfg.OAuthQuotaGroups = config.NormalizeOAuthQuotaGroups(entries)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Data race: Mutating h.cfg.OAuthQuotaGroups in place is not thread-safe. This configuration object is shared with the AuthManager and other components. Concurrent reads (e.g., during request routing or in GetAuthFileQuotaGroups) could observe a torn slice header or inconsistent data. You should clone the configuration object, mutate the clone, and then update the reference atomically, or ensure all access to h.cfg is protected by a lock that is also respected by readers.

Comment on lines +92 to +94
h.authManager.ClearExpiredOAuthQuotaGroupAutoStates(time.Now())
definitions := config.NormalizeOAuthQuotaGroups(h.cfg.OAuthQuotaGroups)
stateEntries := config.NormalizeOAuthAccountQuotaGroupState(h.cfg.OAuthAccountQuotaGroupState)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Consistency issue: h.cfg in the management handler is not automatically updated when h.authManager.ClearExpiredOAuthQuotaGroupAutoStates modifies the configuration. As a result, stateEntries is populated from a stale version of the configuration, causing the API to return expired suspension states that were just cleared. More importantly, any subsequent manual update via PatchAuthFileQuotaGroupsManual will use the stale h.cfg as a base, potentially re-introducing expired auto-suspension states into the config file and AuthManager.

Comment on lines +192 to +194
h.cfg.OAuthAccountQuotaGroupState = config.RemoveOAuthAccountQuotaGroupState(h.cfg.OAuthAccountQuotaGroupState, authID, groupID)
} else {
h.cfg.OAuthAccountQuotaGroupState = config.UpsertOAuthAccountQuotaGroupState(h.cfg.OAuthAccountQuotaGroupState, current)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Data race and consistency issue: Mutating h.cfg.OAuthAccountQuotaGroupState outside of a lock that protects the entire read-modify-write cycle is unsafe. Multiple concurrent management requests could result in lost updates. Additionally, since h.cfg is shared, this mutation is visible to other goroutines without proper synchronization. Furthermore, h.cfg may be stale if the AuthManager has performed background cleanups, leading to the re-introduction of expired states.

Comment thread internal/config/config.go Outdated
Comment on lines +1129 to +1132
if err = tmp.Chmod(perm); err != nil {
// Non-fatal on Windows where Chmod has limited effect; ignore.
_ = err
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Ignoring the error from tmp.Chmod might lead to incorrect file permissions on non-Windows systems if the operation fails for reasons other than platform incompatibility (e.g., restricted filesystem features). It's safer to check the error and only ignore it if it's a known non-fatal condition or if the platform is Windows.

}
state := ensureModelState(auth, result.Model)
resetModelState(state, now)
updateAggregatedAvailability(auth, now)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Stale snapshot usage: updateAggregatedAvailability relies on the global oauthQuotaRuntime snapshot to determine the status of quota groups. However, this snapshot is only updated when m.SetConfig is called (line 2189). Calling updateAggregatedAvailability here uses the state from before the current result was processed, leading to an incorrect auth.Unavailable status until the next time availability is recalculated.

if auth == nil {
continue
}
updateAggregatedAvailability(auth, now)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Stale snapshot usage: updateAggregatedAvailability is called here before the updated configuration has been published via m.SetConfig (line 232). This means the availability calculation uses the old snapshot which still contains the expired suspension states. Consequently, the Auth objects in m.auths and the scheduler will retain a stale Unavailable status until a subsequent request or cleanup tick triggers another update.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e573763673

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread internal/config/config.go Outdated
// Write atomically: temp file in the same directory, fsync, then rename.
// This prevents a partial write from corrupting the live config on crash
// or concurrent writer races.
return writeFileAtomic(configFile, data, 0o644)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve restrictive permissions when saving config

Passing a hard-coded 0o644 here makes every persisted config.yaml world-readable regardless of the process umask or the file’s prior mode, which can expose embedded secrets (API keys/tokens) on multi-user hosts. Before this change, os.Create respected umask, so deployments using umask 077 would keep config private; now each write can silently relax permissions.

Useful? React with 👍 / 👎.

return false
}
if h.authManager != nil {
h.authManager.SetConfig(h.cfg)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid resetting manager state from stale handler config

This pushes h.cfg back into authManager after every management write, but quota-group auto state is now mutated inside Manager.MarkResult and published via its own runtime config snapshot. If a 429 updates quota-group state and a management request persists before file-watcher reload refreshes h.cfg, this call overwrites the newer runtime snapshot with stale data, dropping active cooldown/manual state and causing inconsistent routing behavior.

Useful? React with 👍 / 👎.

Comment thread sdk/cliproxy/auth/conductor.go Outdated
}
m.runtimeConfig.Store(cfg)
m.applyBuiltInSelectorForConfig(cfg)
SetOAuthQuotaRuntimeConfig(cfg)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Scope quota runtime config to each Manager instance

SetConfig writes quota-group routing state into a process-global snapshot, so multiple auth.Manager instances in the same process overwrite each other’s quota groups/aliases/state. That breaks isolation for embedded SDK use (or parallel services/tests) because routing decisions in one manager can start using another manager’s config unexpectedly.

Useful? React with 👍 / 👎.

Apply the nine inline review comments left by gemini-code-assist and
chatgpt-codex-connector on PR router-for-me#2885.

Management handler thread-safety (addresses gemini HIGH #1/router-for-me#2/router-for-me#3 and
codex P1 on handler.go):

  - internal/api/handlers/management/quota_groups.go no longer mutates
    h.cfg.OAuthQuotaGroups or h.cfg.OAuthAccountQuotaGroupState directly.
    Reads go through the new authManager.OAuthQuotaGroupStateSnapshot(),
    so every call sees the freshly published runtime snapshot even after
    a background ClearExpiredOAuthQuotaGroupAutoStates tick.
  - Manual suspend / lift and explicit auto-clear use two new Manager
    methods (SetOAuthQuotaGroupManualState and ClearOAuthQuotaGroupAutoState)
    that take m.mu.Lock, copy the current runtime *Config, mutate the
    copy and return it to the caller. The handler then calls h.persist()
    which takes the global persist mutex introduced in round 1.
  - internal/api/handlers/management/handler.go stops pushing the stale
    h.cfg back into authManager via SetConfig in both SetAuthManager and
    persist(). Only the config-file path is propagated, preserving the
    newer runtime snapshot produced by MarkResult / cleanup / management
    writes.

Manager-scoped quota-group runtime (addresses codex P2 router-for-me#9):

  - sdk/cliproxy/auth/oauth_quota_groups.go drops the package-global
    var oauthQuotaRuntime atomic.Value / init() / SetOAuthQuotaRuntimeConfig
    entry points. The snapshot is now looked up via the new helper
    oauthQuotaRuntimeSnapshotForAuth(auth) which reads auth.quotaRuntime.
  - sdk/cliproxy/auth/types.go adds a private quotaRuntime pointer to
    Auth and Auth.Clone copies it forward.
  - sdk/cliproxy/auth/conductor.go adds a Manager.oauthQuotaRuntime
    atomic.Value, initialises it in NewManager, refreshes it inside
    SetConfig (while holding m.mu briefly to fan-out the new pointer to
    every cached Auth), and makes Register / Update / Load attach the
    current snapshot to newly registered Auths. Multiple Manager
    instances in the same process no longer overwrite each other's
    routing state.

Atomic config.yaml write preserves existing permissions (addresses
codex P1 security router-for-me#7 and gemini MEDIUM router-for-me#4):

  - internal/config/config.go replaces the round-1 writeFileAtomic(..,
    0o644) helper with writeConfigFileAtomically(configFile, data).
    The new helper stat()s the existing file and reuses its permission
    bits; a 0o600 default is used when the file does not yet exist.
    Chmod errors are propagated, not swallowed, so a failed permission
    preservation cannot silently produce a world-readable secret file.
    Both SaveConfigPreserveComments and
    SaveConfigPreserveCommentsUpdateNestedScalar now use the atomic
    writer, so the nested-scalar path no longer leaves a truncated file
    on crash.

Ordering of updateAggregatedAvailability (addresses gemini MEDIUM router-for-me#5/router-for-me#6):

  - sdk/cliproxy/auth/conductor.go: updateAggregatedAvailability now
    reads quota-group state via oauthQuotaGroupState(auth, groupID), so
    every Auth sees its own manager's freshly published snapshot rather
    than a process-global stale one.

Tests:

  - sdk/cliproxy/auth/oauth_quota_groups_test.go switches from the old
    SetOAuthQuotaRuntimeConfig / global-cleanup pattern to a local
    applyOAuthQuotaRuntime(cfg, auths...) helper that wires the
    snapshot into each Auth directly, matching the new per-Auth design.

Verified with:

  - docker run golang:1.26-alpine go build ./...
  - docker run golang:1.26-alpine apk add gcc musl-dev; go test -race
    -count=1 -timeout 180s ./sdk/cliproxy/auth/... ./internal/config/...

Both packages pass with the race detector; no data races remain.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1edeee7b86

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +177 to +179
if next := h.authManager.SetOAuthQuotaGroupManualState(authID, groupID, *body.ManualSuspended, body.Reason, body.UpdatedBy, now); next != nil {
h.cfg = next
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Publish quota snapshot after manual quota-group changes

This handler stores the updated config in h.cfg and persists it, but it never calls authManager.SetConfig(next) after SetOAuthQuotaGroupManualState (same pattern in PatchAuthFileQuotaGroupsAutoClear). Selection enforces quota-group suspension from auth.quotaRuntime snapshots (oauthQuotaGroupBlock), and those snapshots are only rebuilt in Manager.SetConfig, so manual suspend/clear can appear applied in management responses while live routing still uses stale quota-group state until an external config reload happens.

Useful? React with 👍 / 👎.

Comment thread sdk/cliproxy/auth/conductor.go Outdated
Comment on lines +2209 to +2212
m.mu.Unlock()
if runtimeCfgToPublish != nil {
m.SetConfig(runtimeCfgToPublish)
if persistRuntimeConfig {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep quota snapshot publish inside MarkResult critical section

MarkResult computes runtimeCfgToPublish while holding m.mu, but releases the lock before calling SetConfig. Under concurrent 429 handling, two goroutines can both clone from the same old runtime config and then publish in sequence, causing the later publish to overwrite and drop the earlier auth/group cooldown entry. That creates nondeterministic missing suspensions under production concurrency.

Useful? React with 👍 / 👎.

Keep OAuth quota-group snapshot publication inside the manager critical section so concurrent cooldown updates do not overwrite each other, and manual management changes take effect immediately in live routing.\n\nAdd regression tests covering manual suspension publication, auto-clear publication, and aggregated availability during quota-group cooldowns.\n\nCo-Authored-By: Codex <[email protected]>
@38lunin021189-creator
Copy link
Copy Markdown
Author

Addressed the latest quota-group review items in commit e035c2a.

What changed:

  • publish runtime OAuth quota snapshots while still holding the manager lock, so concurrent cooldown updates do not drop each other
  • apply manual suspend and auto-clear changes to live routing immediately instead of waiting for an external config reload
  • update aggregated auth availability against the freshly published quota snapshot
  • add regression coverage for manual publish, auto-clear publish, and quota-group cooldown lifecycle

Verification:

  • docker run --rm -v "${PWD}:/workspace" -w /workspace golang:1.26 go test ./sdk/cliproxy/auth
  • docker run --rm -v "${PWD}:/workspace" -w /workspace golang:1.26 go build -o /tmp/test-output ./cmd/server

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e035c2ae75

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +66 to +67
prev, _ = m.runtimeConfig.Load().(*internalconfig.Config)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Persist quota-state changes from fresh disk config

cloneRuntimeConfigForQuotaGroups builds updates from m.runtimeConfig, and later persistence writes that whole snapshot back to config.yaml. If management/external writes have already changed the file but watcher reload has not yet refreshed m.runtimeConfig, a 429-triggered quota-state write will serialize stale fields and silently roll back unrelated config edits. This creates deterministic lost updates during the reload gap.

Useful? React with 👍 / 👎.

Comment on lines +37 to +41
wd, err := os.Getwd()
if err != nil || strings.TrimSpace(wd) == "" {
return ""
}
return filepath.Join(wd, "config.yaml")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require explicit config path for quota-state persistence

When SetConfigFilePath is never called, configFilePathValue falls back to <cwd>/config.yaml. In embedded SDK usage this makes quota-group runtime events (e.g., 429 cooldown updates) attempt writes to an arbitrary working-directory file, which can modify the wrong config or emit repeated persistence warnings under load. Persistence should be opt-in when no explicit path is configured.

Useful? React with 👍 / 👎.

Comment thread internal/config/config.go Outdated
Comment on lines +1145 to +1146
_ = os.Remove(configFile)
if retryErr := os.Rename(tmpName, configFile); retryErr != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep existing config until replacement rename succeeds

On rename failure, the fallback immediately removes configFile before retrying os.Rename. If the retry fails too (for transient permission/filesystem reasons), the function returns with the original config already deleted, turning a write error into data loss. The old file should remain intact unless a replacement has been successfully installed.

Useful? React with 👍 / 👎.

38lunin021189-creator and others added 3 commits April 22, 2026 01:01
Port the upstream-compatible account and quota-group suspension changes from the retired fork-side working tree into the mainline branch.

Keep the conductor, selector, filestore, vertex, and config persistence updates on the branch we plan to carry alongside upstream. Add the atomic config write fallback test so the Windows-safe config path stays covered while the fork-only provider tail remains archived separately.

Co-Authored-By: Codex <[email protected]>
Non-Anthropic upstreams (for example third-party proxies) may not support
`/v1/messages/count_tokens` and can return `404`. Treating that as a model
failure suspends the auth for hours and causes follow-on `503` responses.

Skip `MarkResult` for `404` failures in `executeCountMixedOnce` so
`count_tokens` misses do not trigger auth suspension. Other statuses such as
`401`, `429`, and `500` continue to be recorded normally.

Co-Authored-By: Claude Opus 4.7 <[email protected]>
(cherry picked from commit 92c2f75)
Co-Authored-By: Codex <[email protected]>
Allow operators to clear every auto cooldown for a single OAuth
account when the upstream quota resets earlier than the stored
cooldown window.

Preserve manual suspensions during the bulk clear so the escape
hatch only removes stale auto freezes.

Co-Authored-By: Codex <[email protected]>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 22593c982d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +2174 to +2176
timer := time.AfterFunc(upstreamAttemptTimeout, func() {
timedOut.Store(true)
cancel()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove hard-coded 25s upstream execution timeout

This introduces a fixed 25-second cancellation for every upstream attempt via newUpstreamAttemptContext, so any legitimate completion/count/bootstrap that takes longer is forcibly turned into an upstream_timeout/504 even when the caller context is still valid. Because this timeout is global and not configurable, slower providers or large prompts will now deterministically fail/retry where they previously could succeed, which is a production behavior regression.

Useful? React with 👍 / 👎.

}

now := time.Now().UTC()
if next := h.authManager.SetOAuthQuotaGroupManualState(authID, groupID, *body.ManualSuspended, body.Reason, body.UpdatedBy, now); next != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Check auth manager before applying quota-group patches

Both quota-group PATCH handlers dereference h.authManager without a nil guard, unlike other management handlers. Since this package explicitly supports constructing a handler with a nil manager (e.g., NewHandlerWithoutConfigFilePath(..., nil)), calling these endpoints before manager wiring (or in nil-manager contexts) will panic instead of returning a controlled HTTP error.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator

@luispater luispater left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This PR adds configurable OAuth quota-group suspension for OAuth-backed providers (antigravity, gemini, gemini-cli), including:

  • Config-driven quota groups (oauth-quota-groups) with default bootstrap groups + normalization.
  • Persisted per-auth + per-group manual/auto suspension state (oauth-account-quota-group-state) in config.yaml.
  • Selector-side enforcement to block suspended groups, and group-level 429 cooldown routing (instead of per-model freezing).
  • Management API endpoints to view/update quota groups and suspension state.
  • Background cleanup loop to reap expired auto suspensions (moved off the hot path).
  • Atomic config writes to avoid truncation/corruption and preserve restrictive perms.
  • Focused tests (including the Antigravity Claude-vs-Gemini behavior).

Overall the quota-group architecture and the “publish via cloned runtime config + atomic persistence” direction look solid and well-tested.

Blocking: upstream attempt timeout

sdk/cliproxy/auth/conductor.go introduces upstreamAttemptTimeout = 25s and uses it to cancel each upstream attempt for:

  • executor.Execute(...) (non-stream)
  • executor.CountTokens(...)
  • executor.ExecuteStream(...) (bootstrap)

This is a major behavioral change for non-stream requests: legitimate responses that take >25s will now be canceled and surfaced as a synthetic upstream_timeout (HTTP 504), and it appears to conflict with this repo’s timeout policy (timeouts only allowed in limited, explicitly-listed places).

Request: please remove this timeout, or gate it behind explicit config (default disabled), and/or scope it narrowly (e.g., stream bootstrap and possibly CountTokens, but not full non-stream Execute).

Non-blocking suggestions

  • Add h.authManager == nil guards to the new quota-group PATCH handlers for consistency/safety.
  • Make ClearOAuthQuotaGroupAutoState / ClearAllOAuthQuotaGroupAutoStates return nil (or a “changed” flag) when no state changes, so management calls don’t rewrite config.yaml unnecessarily.
  • Consider tightening writeConfigFileAtomically’s “delete target then retry rename” branch to only the error cases it actually intends to handle.

Suggested test plan (not run locally)

  • go test -race -count=1 ./sdk/cliproxy/auth/... ./internal/config/...
  • go test ./...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants