Skip to content

fix(auth): quarantine corrupt accounts.json instead of overwriting it#504

Merged
icebear0828 merged 2 commits into
devfrom
fix/account-load-failsafe
May 12, 2026
Merged

fix(auth): quarantine corrupt accounts.json instead of overwriting it#504
icebear0828 merged 2 commits into
devfrom
fix/account-load-failsafe

Conversation

@icebear0828
Copy link
Copy Markdown
Owner

Why

Reported in conversation: users updating from 2.0.71/2.0.72 → 2.0.73 on Windows had their accounts.json silently emptied (file content became { "accounts": [] }). Reading the code:

  1. loadPersisted (src/auth/account-persistence.ts) caught readFileSync / JSON.parse / non-array failures with a console.warn and returned { entries: [], needsPersist: false }.
  2. AccountPool built the registry from those empty entries.
  3. The next mutation (refresh-scheduler tick, dashboard action, release()) hit schedulePersist, and 1 s later persistence.save([]) overwrote the on-disk file with the empty map.

A single transient read failure — Windows AV scan locking the file, NSIS upgrade killing the process mid-renameSync, a half-written file from a prior crash — was enough to delete the entire account list. No error surfaced in the UI; users only noticed when the dashboard showed zero accounts.

What

Three coordinated changes prevent this from happening again:

  1. Quarantine the corrupt file. quarantineCorruptFile renames the unparseable accounts.json to accounts.json.corrupt-<ISO timestamp>.bak, preserving the original bytes for forensics. appendErrorLog({ source: "server", name: "AccountsFileLoadFailed", context: { reason, accountsFile, quarantined, backupPath, rawByteLength } }) so the Errors tab + header badge surface the failure on next dashboard load. Best-effort: if the rename fails, the file stays where it is and the next launch will hit the same code path — still better than silent overwrite. Missing file is not a failure (first-run keeps loadFailed=false).
  2. Circuit-breaker on persistence. AccountPersistence.load() returns a new optional loadFailed: boolean. AccountPool threads it into new AccountRegistry(..., { persistDisabled }). schedulePersist and persistNow short-circuit when disabled, so the empty in-memory map can never overwrite anything. In-memory CRUD still works for the running session — users can keep operating until they restore the file and restart.
  3. Dashboard banner. GET /auth/accounts response gains persistence_health: { ok, reason, message }. useAccounts reads it; AccountManagement renders an amber "Auto-save paused" banner pointing users at data/accounts.json.corrupt-*.bak. i18n keys persistDisabledTitle / persistDisabledBody for en + zh.

Recovery path for affected users

  1. Open the dashboard — the new banner shows up if accounts.json was corrupt at startup.
  2. Inspect data/accounts.json.corrupt-<timestamp>.bak (on Windows: %APPDATA%\Codex Proxy\data\). If it's intact JSON with accounts, rename it back to accounts.json. If it's a half-written file, restore from any other backup (Codex Desktop's ~/.codex/auth.json, an earlier RT-only export, etc.).
  3. Restart the app. Healthy load → banner clears, auto-save resumes.

Test plan

  • tests/unit/auth/account-persistence-load-failsafe.test.ts — 6 cases: malformed JSON, 0-byte file, wrong shape (accounts not an array), missing file, healthy file, repeat-failure timestamp uniqueness.
  • tests/unit/auth/account-registry-persist-disabled.test.ts — 6 cases: schedulePersist no-op, persistNow no-op, addAccount memory-only when disabled, pool.isPersistDisabled() reflects loadFailed, background mutations don't reach disk, healthy path regression.
  • tests/unit/routes/accounts-persistence-health.test.ts — 2 cases: GET /auth/accounts returns persistence_health: { ok: true } on healthy load, { ok: false, reason: "load_failed_quarantined", message: ... } on failed load.
  • Full suite: 1940 passing / 1 skipped / 0 failed.
  • tsc --noEmit clean.
  • Manual verification on the affected Windows machine (out of band — requires the user's environment).

Out of scope

  • Auto-recovery from .bak file (user must rename manually + restart). Keeping this explicit so we don't silently restore a stale or wrong-version backup.
  • Periodic backup snapshots of accounts.json. Reasonable future addition, but separate change.

… of overwriting

`loadPersisted` previously caught readFileSync/JSON.parse/Array.isArray failures
with a `console.warn` and returned `{ entries: [], needsPersist: false }`. The
empty result became the in-memory registry, and the next mutation (refresh
scheduler tick, dashboard CRUD, release()) fired `schedulePersist` which a
second later overwrote the on-disk file with `{ "accounts": [] }`. A single
transient read failure (Windows AV scan locking the file, NSIS upgrade killing
the process during the writeSync → renameSync window, an interrupted write
from a prior crash) was enough to silently wipe the account list — recently
hit by users updating from 2.0.71/2.0.72 to 2.0.73 on Windows.

Three coordinated changes prevent recurrence:

- `quarantineCorruptFile` renames the unparseable `accounts.json` to
  `accounts.json.corrupt-<ISO timestamp>.bak` so the original bytes survive
  for debugging, and `appendErrorLog` writes an `AccountsFileLoadFailed`
  entry (source=server) so the Errors tab + header badge surface the
  failure on next dashboard load. Best-effort rename — if it fails the
  file stays in place and the next launch hits the same code path, still
  better than silent data loss. Missing file is NOT a failure: first-run
  keeps `loadFailed=false`.

- `AccountPersistence.load()` returns a new `loadFailed?: boolean`.
  `AccountPool` threads it into `new AccountRegistry(..., { persistDisabled
  })`. `schedulePersist` and `persistNow` short-circuit when disabled, so
  the empty in-memory map can never overwrite the quarantined original
  (or anything else). In-memory CRUD still works for the running session,
  so users can keep operating until they restore the backup and restart.
  `pool.isPersistDisabled()` exposes the state.

- `GET /auth/accounts` response gains `persistence_health: { ok, reason,
  message }`. `useAccounts` reads it; `AccountManagement` renders an
  amber "Auto-save paused" banner pointing users at
  `data/accounts.json.corrupt-*.bak`. i18n keys `persistDisabledTitle` /
  `persistDisabledBody` for en + zh.

Tests:
- tests/unit/auth/account-persistence-load-failsafe.test.ts (6 cases) —
  malformed JSON, 0-byte, wrong shape, missing file, healthy file,
  repeat-failure timestamp uniqueness.
- tests/unit/auth/account-registry-persist-disabled.test.ts (6 cases) —
  schedulePersist no-op, persistNow no-op, addAccount memory-only,
  loadFailed → pool.isPersistDisabled(), background mutations don't reach
  disk, healthy path regression.
- tests/unit/routes/accounts-persistence-health.test.ts (2 cases) —
  GET /auth/accounts response field for both healthy and failed states.

Full suite 1940 passing, 1 skipped. tsc clean.
…oot log drop, health detail

Codex review of PR #504 flagged four high/medium issues. All fixed:

- HIGH: top-level JSON `null` (and other non-object payloads — numbers,
  strings, arrays) bypassed the shape guard and crashed on
  `data.accounts` access. `loadPersisted` now parses as `unknown` and
  rejects null / non-object / array before reaching `.accounts`. New
  tests cover `null`, `42`, `"oops"`, `[1,2,3]`.

- MEDIUM: a hypothetical concurrent `persistence.save()` could
  recreate `accounts.json` after quarantine renamed it aside, since
  the persistDisabled flag only guarded the AccountRegistry instance.
  `createFsPersistence` now holds a per-instance `quarantineActive`
  latch — once a load failure fires, the persistence object itself
  refuses future save() calls regardless of the caller path. Second
  line of defense; the registry-level gate remains the primary one.

- MEDIUM: `quarantineCorruptFile`'s `appendErrorLog` could silently
  drop the entry during very early boot (`getConfig()` throws before
  `loadConfig()` runs). The throw escaped from the entry-construction
  step (outer try wraps only the file write) and `quarantineCorruptFile`'s
  defensive catch then swallowed it — so the quarantine event vanished
  from `error-log.jsonl`. `readAppVersion` now catches the
  config-not-loaded throw and returns `"unknown"`. New
  early-boot-resilience test in `error-log.test.ts`.

- MEDIUM: when the quarantine rename itself fails (AV lock, ACL),
  the dashboard previously told the user to recover from a `.bak`
  that does not exist. `AccountPersistence.load()` now returns a
  structured `health: { quarantined, backupPath }`, `AccountPool`
  exposes `getPersistenceHealth()`, and `GET /auth/accounts` returns
  distinct `reason` codes: `load_failed_quarantined` (rename succeeded,
  show backup path) vs `load_failed_unquarantined` (rename failed,
  point user at original file). Three new route tests + one
  persistence-level test cover both branches and the legacy
  no-health fallback.

Tests: 1947 passing, 1 skipped. tsc clean.
@icebear0828 icebear0828 merged commit e7e16b7 into dev May 12, 2026
1 check passed
@icebear0828 icebear0828 deleted the fix/account-load-failsafe branch May 12, 2026 10:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant