Skip to content

fix(auth): retire rate_limit_until, unify on cachedQuota single truth#492

Merged
icebear0828 merged 2 commits into
devfrom
fix/account-quota-single-truth
May 11, 2026
Merged

fix(auth): retire rate_limit_until, unify on cachedQuota single truth#492
icebear0828 merged 2 commits into
devfrom
fix/account-quota-single-truth

Conversation

@icebear0828
Copy link
Copy Markdown
Owner

Summary

Eliminate the dual-source-of-truth between local rate_limit_until + status === "rate_limited" and cachedQuota.<bucket>.limit_reached that has been causing stale locks. The two signals drift apart whenever markStatus(_, "active") is called from refresh-scheduler / account-mutation (which never clear rate_limit_until), leaving accounts locally locked long after the upstream window has actually reset. Visible to users as "已限速 badge + 0% 已使用" contradiction; for free-plan accounts (7-day primary window), the orphan lock could persist for up to a week past the real reset.

cachedQuota is now the only signal. All "is this account exhausted" decisions flow through hasReachedCachedQuota (which already covers all 3 buckets — primary / secondary / code_review).

While auditing this we also pinned down the long-running "premature stream close" complaint: native rust transport now emits the full reqwest cause chain, which revealed upstream chatgpt.com hard-cuts HTTP/1.1 chunked encoding mid-chunk after ~120s of reasoning without output_text. Surface this distinctly so cross-account empty-response retry doesn't burn 3 accounts × 120s on a deterministic upstream cap.

Changes

Backend (src/auth/):

  • Delete markRateLimited / clearRateLimit / markQuotaExhausted; add applyRateLimit429() that writes into cachedQuota.rate_limit.{limit_reached=true, reset_at} and never shrinks an existing reset_at (src/auth/account-registry.ts, src/auth/account-pool.ts).
  • Drop "rate_limited" from AccountStatus enum; status is now pure rotation marker (src/auth/types.ts).
  • refreshStatus loses its rate-limit clearing branch — resetExpiredQuotaWindow already auto-clears limit_reached when reset_at passes (src/auth/account-registry.ts).
  • isAuthenticated + getPoolSummary consult hasReachedCachedQuota so an all-exhausted pool reports correctly.
  • Export isQuotaExhausted(quota) helper (src/auth/quota-skip.ts).
  • One-shot accounts.json migration in loadPersisted via migrateLegacyRateLimit — converts legacy status="rate_limited" + rate_limit_until to status="active" + synthesized cachedQuota.rate_limit only when the local lock is fresher than cachedQuota (src/auth/account-persistence.ts).

Routes / proxy plumbing:

  • 429 handler routes through new method, no longer combines retry-after with cachedQuota inside the handler (logic moved to registry layer) (src/routes/shared/proxy-error-handler.ts).
  • Success-path proactive marking uses the same helper (src/routes/shared/proxy-handler.ts).

Premature-close diagnostics (bonus, surfaced during the audit):

  • UpstreamPrematureCloseError (src/translation/codex-event-extractor.ts); collectCodexResponse throws it when stream ends without response.completed/failed/error terminal event. proxy-handler returns 504 fail-fast instead of triggering empty-response cross-account retry (src/translation/codex-to-openai.ts).
  • Native rust transport upgraded eprintln!("Stream error: {e}") to error_chain(&e) so hyper cause chains are visible in stderr (native/src/lib.rs).

Dashboard:

  • New derivedStatus(account) + isQuotaExhausted(quota) helper (web/src/lib/accountStatus.ts).
  • AccountCard, AccountList filter dropdown + counts, AccountTable, AccountManagement filter ladder all wired to render "rate_limited" badge based on cachedQuota — fixes the "已活跃 + 7d 已达上限" contradiction.

Tests:

  • New tests/unit/auth/account-pool-rate-limit-429.test.ts (9 cases) covering primary-bucket write, never-shrink rule, hasReachedCachedQuota exclusion, status not mutated, request counting, auto-clear via resetExpiredQuotaWindow.
  • New tests/unit/auth/account-persistence-migration.test.ts (7 cases) covering legacy-future-lock synthesis, fresh-cachedQuota-trust, stale-cachedQuota-overwrite, orphan-past-lock cleanup.
  • Rewrote ~10 existing tests (account-pool-quota, account-pool-config-di, account-pool, account-pool-has-available, account-pool-sticky, account-routing, proxy-handler integration, quota-refresh e2e, secondary-quota stress, proxy-error-handler unit) to assert via applyRateLimit429 + isQuotaExhausted(account.quota) instead of the retired fields.

CHANGELOG [Unreleased] → Fixed entries cover both the dual-truth fix (with free-plan severity callout) and the premature-close 504 fail-fast.

Test Plan

  • npm test — 1927 passed, 1 skipped, 0 failed (1 pre-existing flaky usage-stats history test confirmed unrelated via git stash baseline check)
  • npx tsc --noEmit — clean
  • npm run build — vite bundle + tsc both clean
  • Migration smoke: backed up data/accounts.json to .bak-pre-refactor, restarted prod proxy, confirmed via curl /auth/accounts | jq that previously-locked miles.halloway had orphan rate_limit_until cleared and a1432350557 correctly reports secondary_rate_limit.limit_reached=true from cachedQuota
  • Manual verification: derivedStatus returns "rate_limited" for the a1432350557-style account (status=active + secondary.limit_reached=true) — fixes Bug B
  • npm run test:real not run this session (network-bound upstream test)

Notes

  • accounts.json migration is one-way: the legacy field is dropped on next persist. Operators should keep a backup before rolling out. Rolling restart back to old build will see rate_limit_until: null and status="active" on previously-locked accounts (brief lock amnesia → handful of extra 429s, acceptable for a one-time migration).
  • For free-plan users the bug was meaningfully worse than for plus (week-long orphan locks vs ~5h). Worth highlighting in release notes when this lands on stable.
  • The 504 fail-fast for UpstreamPrematureCloseError is a new client-visible status code for a previously-502 condition. Codex CLI and Claude Code both treat 504 as retryable, so end-user behavior should improve (no longer burns 3 accounts before reporting).

The proxy used to maintain two independent signals for "this account is
locked": entry.usage.rate_limit_until (local lock from 429 retry-after)
and entry.cachedQuota.<bucket>.limit_reached (passive collection from
upstream rate_limits headers). The two drift because four callers of
markStatus(_, "active") in refresh-scheduler / account-mutation flip the
status away from "rate_limited" without clearing rate_limit_until. The
orphaned field then outlives the upstream window: dashboard shows
"已限速" while cachedQuota reads 0% used, and free-plan users with a 7-day
primary bucket can stay locked for a week past the actual reset.

Single-source-of-truth fix:

- Delete markRateLimited / clearRateLimit / markQuotaExhausted; add
  applyRateLimit429() that writes the 429 retry-after into
  cachedQuota.rate_limit.{limit_reached=true, reset_at}. Never shrinks
  an existing reset_at — next passive header collection corrects the
  bucket if the 429 was actually secondary.
- Drop "rate_limited" from AccountStatus enum. status is now a pure
  rotation marker; pool exclusion flows through hasReachedCachedQuota,
  which already checks all 3 buckets (primary / secondary / code_review).
- refreshStatus loses its rate-limit clearing branch (resetExpiredQuotaWindow
  already auto-clears limit_reached when reset_at passes).
- isAuthenticated and getPoolSummary consult cachedQuota too, so an
  all-exhausted pool no longer falsely reports authenticated.
- accounts.json one-shot migration: migrateLegacyRateLimit in
  loadPersisted converts legacy status="rate_limited" + rate_limit_until
  to status="active" + synthesized cachedQuota.rate_limit (only when
  the local lock is fresher than cachedQuota). Next persist drops the
  legacy field.
- Dashboard derivedStatus(account) helper renders "rate_limited" badge
  from cachedQuota.*.limit_reached, fixing the "已活跃 + 7d 已达上限"
  contradiction for accounts with secondary exhaustion.

Diagnostic findings folded into the changelog:
- Native rust transport upgraded "Stream error: {e}" to error_chain(&e)
  so hyper cause chains are visible. Confirmed the long-running
  "premature stream close" reports as
  "error decoding response body: unexpected EOF during chunk size line"
  — upstream cuts HTTP/1.1 chunked mid-chunk after ~120s of reasoning
  without output_text.
- UpstreamPrematureCloseError introduced (codex-event-extractor.ts) and
  collectCodexResponse throws it instead of EmptyResponseError when the
  stream never emits a terminal event. proxy-handler returns 504
  fail-fast rather than burning 3 accounts on cross-account retry.

Tests: full suite 1927 green; npx tsc --noEmit clean. Migration smoke
on live data confirmed miles.halloway's orphan lock cleared and
a1432350557's weekly-exhausted status now surfaces correctly in
dashboard.
Address review feedback on #492:

- web/src/components/AccountTable.tsx: drop the unused `derivedStatus`
  import. AssignmentAccount has no `quota` field so the helper would
  always pass through to `acct.status` — the import was dead. Replace
  with a comment explaining why the table renders raw status while
  AccountList/AccountCard receive full Account and do derive.

- Revert native/src/lib.rs + native/codex-tls.darwin-arm64.node to
  origin/dev. The rust error_chain upgrade is diagnostic-only (helps
  inspect hyper cause chains in stderr) and was rebuilt only for
  darwin-arm64. Shipping it without rebuilt win32/linux binaries
  would leave non-Mac operators on the old short error format —
  inconsistent across platforms. Ship the rust change separately
  with full per-platform binary rebuilds.

- CHANGELOG: trim the rust-specific callout from the
  UpstreamPrematureCloseError entry to match the slimmer scope of
  this PR (the user-visible 504 fail-fast and proxy-handler logic
  stay; the diagnostic upgrade is gone).

Tests + tsc green.
@icebear0828 icebear0828 merged commit 9891de6 into dev May 11, 2026
1 check passed
@icebear0828 icebear0828 deleted the fix/account-quota-single-truth branch May 11, 2026 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant