Skip to content

BUG: Metal Internal Error + infinite retry loop on restart with stale KV cache #251

@wAngByg

Description

@wAngByg

Summary

  • Problem: When ds4-server is killed while a long-context conversation has KV cache persisted to disk, restarting the server with a stale KV cache triggers Metal graph embed tokens failed: Internal Error (00000001:Internal Error). Instead of detecting the corrupted state and resetting, ds4 enters an infinite retry loop:
0525 21:04:12 ds4-server: chat ctx=81920..94031:12111 prompt start
0525 21:04:12 ds4-server: chat ctx=81920..94031:12111 prefill chunk 0/12111 (0.0%) chunk=0.00 t/s
ds4: Metal graph embed tokens failed: Internal Error (00000001:Internal Error)
0525 21:04:12 ds4-server: chat ctx=81920..94031:12111 prefill failed after stream closed: metal prefill failed
0525 21:04:12 ds4-server: chat ctx=81920..95221:13301 prompt start
(repeats indefinitely)

Steps to Reproduce

  1. Start ds4-server with --ctx 1000000 --kv-disk-dir <path>
  2. Begin a conversation with enough context to fill the KV cache (e.g. 80K+ tokens)
  3. Kill the ds4-server process (kill -9)
  4. Restart ds4-server without clearing the KV cache directory
  5. The next request will trigger Metal Internal Error and the infinite retry loop
  6. Server becomes unresponsive and must be killed

Workaround

Clearing the KV cache directory resolves the issue:

rm -rf <kv-cache-dir>/*

Environment

  • ds4 commit: b9305 (latest prebuilt binary)
  • Hardware: Apple M5 Max, 128 GiB unified memory
  • macOS 26.5

Suggested Fix

  1. Short-term: Add a retry limit — after 3 consecutive Metal prefill failures on the same context range, clear the context and report the error.
  2. Long-term: On startup, validate stale KV cache entries against the loaded model state and evict mismatched entries instead of crashing/re-looping.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions