fix(auth): quarantine corrupt accounts.json instead of overwriting it#504
Merged
Conversation
… of overwriting
`loadPersisted` previously caught readFileSync/JSON.parse/Array.isArray failures
with a `console.warn` and returned `{ entries: [], needsPersist: false }`. The
empty result became the in-memory registry, and the next mutation (refresh
scheduler tick, dashboard CRUD, release()) fired `schedulePersist` which a
second later overwrote the on-disk file with `{ "accounts": [] }`. A single
transient read failure (Windows AV scan locking the file, NSIS upgrade killing
the process during the writeSync → renameSync window, an interrupted write
from a prior crash) was enough to silently wipe the account list — recently
hit by users updating from 2.0.71/2.0.72 to 2.0.73 on Windows.
Three coordinated changes prevent recurrence:
- `quarantineCorruptFile` renames the unparseable `accounts.json` to
`accounts.json.corrupt-<ISO timestamp>.bak` so the original bytes survive
for debugging, and `appendErrorLog` writes an `AccountsFileLoadFailed`
entry (source=server) so the Errors tab + header badge surface the
failure on next dashboard load. Best-effort rename — if it fails the
file stays in place and the next launch hits the same code path, still
better than silent data loss. Missing file is NOT a failure: first-run
keeps `loadFailed=false`.
- `AccountPersistence.load()` returns a new `loadFailed?: boolean`.
`AccountPool` threads it into `new AccountRegistry(..., { persistDisabled
})`. `schedulePersist` and `persistNow` short-circuit when disabled, so
the empty in-memory map can never overwrite the quarantined original
(or anything else). In-memory CRUD still works for the running session,
so users can keep operating until they restore the backup and restart.
`pool.isPersistDisabled()` exposes the state.
- `GET /auth/accounts` response gains `persistence_health: { ok, reason,
message }`. `useAccounts` reads it; `AccountManagement` renders an
amber "Auto-save paused" banner pointing users at
`data/accounts.json.corrupt-*.bak`. i18n keys `persistDisabledTitle` /
`persistDisabledBody` for en + zh.
Tests:
- tests/unit/auth/account-persistence-load-failsafe.test.ts (6 cases) —
malformed JSON, 0-byte, wrong shape, missing file, healthy file,
repeat-failure timestamp uniqueness.
- tests/unit/auth/account-registry-persist-disabled.test.ts (6 cases) —
schedulePersist no-op, persistNow no-op, addAccount memory-only,
loadFailed → pool.isPersistDisabled(), background mutations don't reach
disk, healthy path regression.
- tests/unit/routes/accounts-persistence-health.test.ts (2 cases) —
GET /auth/accounts response field for both healthy and failed states.
Full suite 1940 passing, 1 skipped. tsc clean.
…oot log drop, health detail Codex review of PR #504 flagged four high/medium issues. All fixed: - HIGH: top-level JSON `null` (and other non-object payloads — numbers, strings, arrays) bypassed the shape guard and crashed on `data.accounts` access. `loadPersisted` now parses as `unknown` and rejects null / non-object / array before reaching `.accounts`. New tests cover `null`, `42`, `"oops"`, `[1,2,3]`. - MEDIUM: a hypothetical concurrent `persistence.save()` could recreate `accounts.json` after quarantine renamed it aside, since the persistDisabled flag only guarded the AccountRegistry instance. `createFsPersistence` now holds a per-instance `quarantineActive` latch — once a load failure fires, the persistence object itself refuses future save() calls regardless of the caller path. Second line of defense; the registry-level gate remains the primary one. - MEDIUM: `quarantineCorruptFile`'s `appendErrorLog` could silently drop the entry during very early boot (`getConfig()` throws before `loadConfig()` runs). The throw escaped from the entry-construction step (outer try wraps only the file write) and `quarantineCorruptFile`'s defensive catch then swallowed it — so the quarantine event vanished from `error-log.jsonl`. `readAppVersion` now catches the config-not-loaded throw and returns `"unknown"`. New early-boot-resilience test in `error-log.test.ts`. - MEDIUM: when the quarantine rename itself fails (AV lock, ACL), the dashboard previously told the user to recover from a `.bak` that does not exist. `AccountPersistence.load()` now returns a structured `health: { quarantined, backupPath }`, `AccountPool` exposes `getPersistenceHealth()`, and `GET /auth/accounts` returns distinct `reason` codes: `load_failed_quarantined` (rename succeeded, show backup path) vs `load_failed_unquarantined` (rename failed, point user at original file). Three new route tests + one persistence-level test cover both branches and the legacy no-health fallback. Tests: 1947 passing, 1 skipped. tsc clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Reported in conversation: users updating from 2.0.71/2.0.72 → 2.0.73 on Windows had their
accounts.jsonsilently emptied (file content became{ "accounts": [] }). Reading the code:loadPersisted(src/auth/account-persistence.ts) caughtreadFileSync/JSON.parse/ non-array failures with aconsole.warnand returned{ entries: [], needsPersist: false }.AccountPoolbuilt the registry from those empty entries.release()) hitschedulePersist, and 1 s laterpersistence.save([])overwrote the on-disk file with the empty map.A single transient read failure — Windows AV scan locking the file, NSIS upgrade killing the process mid-
renameSync, a half-written file from a prior crash — was enough to delete the entire account list. No error surfaced in the UI; users only noticed when the dashboard showed zero accounts.What
Three coordinated changes prevent this from happening again:
quarantineCorruptFilerenames the unparseableaccounts.jsontoaccounts.json.corrupt-<ISO timestamp>.bak, preserving the original bytes for forensics.appendErrorLog({ source: "server", name: "AccountsFileLoadFailed", context: { reason, accountsFile, quarantined, backupPath, rawByteLength } })so the Errors tab + header badge surface the failure on next dashboard load. Best-effort: if the rename fails, the file stays where it is and the next launch will hit the same code path — still better than silent overwrite. Missing file is not a failure (first-run keepsloadFailed=false).AccountPersistence.load()returns a new optionalloadFailed: boolean.AccountPoolthreads it intonew AccountRegistry(..., { persistDisabled }).schedulePersistandpersistNowshort-circuit when disabled, so the empty in-memory map can never overwrite anything. In-memory CRUD still works for the running session — users can keep operating until they restore the file and restart.GET /auth/accountsresponse gainspersistence_health: { ok, reason, message }.useAccountsreads it;AccountManagementrenders an amber "Auto-save paused" banner pointing users atdata/accounts.json.corrupt-*.bak. i18n keyspersistDisabledTitle/persistDisabledBodyfor en + zh.Recovery path for affected users
data/accounts.json.corrupt-<timestamp>.bak(on Windows:%APPDATA%\Codex Proxy\data\). If it's intact JSON with accounts, rename it back toaccounts.json. If it's a half-written file, restore from any other backup (Codex Desktop's~/.codex/auth.json, an earlier RT-only export, etc.).Test plan
tests/unit/auth/account-persistence-load-failsafe.test.ts— 6 cases: malformed JSON, 0-byte file, wrong shape (accountsnot an array), missing file, healthy file, repeat-failure timestamp uniqueness.tests/unit/auth/account-registry-persist-disabled.test.ts— 6 cases:schedulePersistno-op,persistNowno-op,addAccountmemory-only when disabled,pool.isPersistDisabled()reflectsloadFailed, background mutations don't reach disk, healthy path regression.tests/unit/routes/accounts-persistence-health.test.ts— 2 cases:GET /auth/accountsreturnspersistence_health: { ok: true }on healthy load,{ ok: false, reason: "load_failed_quarantined", message: ... }on failed load.tsc --noEmitclean.Out of scope
.bakfile (user must rename manually + restart). Keeping this explicit so we don't silently restore a stale or wrong-version backup.accounts.json. Reasonable future addition, but separate change.