Skip to content

fix(server): evict disk KV entry that fails prefill#253

Open
gmontana wants to merge 1 commit into
antirez:mainfrom
gmontana:fix/evict-stale-kv-on-prefill-failure
Open

fix(server): evict disk KV entry that fails prefill#253
gmontana wants to merge 1 commit into
antirez:mainfrom
gmontana:fix/evict-stale-kv-on-prefill-failure

Conversation

@gmontana
Copy link
Copy Markdown
Contributor

@gmontana gmontana commented May 25, 2026

Fixes the infinite prefill failure loop from #251.

When ds4-server is killed mid-session, the disk KV checkpoint survives
and passes all load-time checks (header, hash, token count) on restart.
But the restored Metal state can be unusable — prefill fails, the error
goes back to the client, client retries, same checkpoint gets loaded
again, same failure. Loop runs forever until you rm -rf the cache dir.

The file is fine structurally, so the loader can't catch this. The fix:
when prefill fails and the prefix came from a disk entry, unlink that
file and invalidate the session. Next request misses the deleted entry
and does a clean prefill. Same pattern as the existing corrupt-prefix
discard path in the KV loader.

One helper (kv_cache_evict_failed_disk_entry), two call sites in
generate_job(), +13/-1.

Tested by injecting a sync failure after a disk cache load — confirmed
the eviction log fires, the file disappears, and the next request
recovers with a full prefill. The actual Metal Internal Error from the
report is hardware-specific and not reproducible in CI.

Closes #251

After an unclean shutdown the on-disk KV checkpoint can be intact
(header, hash, token count all valid) but leave Metal in a state
where prefill fails.  Since the file keeps passing load-time checks
it gets reloaded on every request, looping forever until the user
manually deletes the cache directory.

On prefill failure, if the prefix came from a disk entry, unlink it
and invalidate the session.  Next request gets a clean cache miss.

Closes antirez#251
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: Metal Internal Error + infinite retry loop on restart with stale KV cache

1 participant