Summary
If a user message arrives while a streaming inference is between an emitted tool_use and its not-yet-returned tool_result, the framework appends the user text directly into messages and continues. The orphan tool_use is never paired. Every subsequent inference flush rebuilds a request that violates Anthropic's structural validator and gets rejected with HTTP 400 (messages.N: \tool_use` ids were found without ...). The damage is persisted to the session's Chronicle messages` log, so it survives host restarts and binary upgrades. The only current mitigation is to start a new session.
Severity
Latent foot-gun. Any agent that runs tools while the operator is active can hit this. A "stay quiet" agent (e.g. a conductor that does occasional fleet pokes) is the worst case because the user almost always types into it mid-tool. Once tripped, the session is permanently bricked for that agent — no recovery path inside the framework.
Concrete reproduction (from prod, post-mortem of a conhost VM)
Session 75da82b0 (conductor), inference 63 at 2026-05-22 10:22:24 returned 400 in 256 ms. Walking the request payload from the new llm-calls.jsonl logger:
msg[444] user "Okay; pinging - is everything good?"
msg[445] assistant tool_use fleet--status (toolu_01Dtd...)
msg[446] user tool_result for status ← paired
msg[447] assistant tool_use fleet--peek (toolu_01Tdd...) ← orphan
msg[448] user "I think they're all actually stopped now.
Let's see what happens when the clerk picks
up the loose threads and files new tickets."
Tool-call counts in that request: 189 tool_use vs. 188 tool_result — exactly one unpaired.
Inference 64 (next user nudge) had the same orphan at a shifted index (an autobio compression pass had dropped 17 unrelated messages between attempts but preserved the orphan), and was rejected again. User opened a new session shortly after.
The predecessor request at 10:21:37 (same store, 449 messages, 0 orphans) succeeded. The orphan was introduced by the user typing during the in-flight fleet--peek.
Why it happens
Reading src/framework.ts and src/agent.ts: driveStream() runs as a fire-and-forget background promise. While it is between emitting a tool_use and stream.provideToolResults() (which depends on dispatchToolCall().then() → ToolResultEvent → handleProcessEvent), an ExternalMessageEvent can land in the queue. The handler appends the user text to messages and proceeds. There is no check that the most recent assistant message has all its tool_use ids resolved, and no synthesis of placeholder tool_results for the unresolved ones.
Autobio compression downstream respects the orphan rather than repairing it.
Suggested fix (sketch)
Two layers, both small:
-
Repair-on-write. In the path that appends a user text message to messages (i.e. the ExternalMessageEvent handler / equivalent), scan back to the most recent assistant message. For every tool_use block in it whose id has no matching tool_result in a subsequent user message, synthesize a stub tool_result (is_error: true, content something like \"cancelled: user interjected before result returned\") and append it as a user message before the new user text. The driveStream that was awaiting the real result needs to either be cancelled or have its pending tool dispatch invalidated — picking the cleanest approach here is the design call.
-
Repair-on-read (defense in depth). In the request-build path, refuse to serialize any assistant message that contains an orphan tool_use. Either drop the trailing tool_use block or throw a clear engine-side error before the API does. Applies equally to any autobio rewrite path.
A one-shot repair script that walks existing damaged sessions and inserts the synthesized stubs would also help — there are already-bricked sessions in the wild.
Diagnostic data
- Failing request payloads:
llm-calls.jsonl entries [17] and [19] in the post-mortem VM dump.
- Chronicle inferences:
framework/inference-log entries 63, 64 in session 75da82b0.
- Both errors:
400 invalid_request_error: messages.N: \tool_use` ids were found without ...`
- Same orphan id across both:
toolu_01TddAQ7dJdWyxWskNwMsSca (fleet--peek).
Summary
If a user message arrives while a streaming inference is between an emitted
tool_useand its not-yet-returnedtool_result, the framework appends the user text directly intomessagesand continues. The orphantool_useis never paired. Every subsequent inference flush rebuilds a request that violates Anthropic's structural validator and gets rejected with HTTP 400 (messages.N: \tool_use` ids were found without ...). The damage is persisted to the session's Chroniclemessages` log, so it survives host restarts and binary upgrades. The only current mitigation is to start a new session.Severity
Latent foot-gun. Any agent that runs tools while the operator is active can hit this. A "stay quiet" agent (e.g. a conductor that does occasional fleet pokes) is the worst case because the user almost always types into it mid-tool. Once tripped, the session is permanently bricked for that agent — no recovery path inside the framework.
Concrete reproduction (from prod, post-mortem of a conhost VM)
Session
75da82b0(conductor), inference 63 at 2026-05-22 10:22:24 returned 400 in 256 ms. Walking the request payload from the newllm-calls.jsonllogger:Tool-call counts in that request: 189
tool_usevs. 188tool_result— exactly one unpaired.Inference 64 (next user nudge) had the same orphan at a shifted index (an autobio compression pass had dropped 17 unrelated messages between attempts but preserved the orphan), and was rejected again. User opened a new session shortly after.
The predecessor request at 10:21:37 (same store, 449 messages, 0 orphans) succeeded. The orphan was introduced by the user typing during the in-flight
fleet--peek.Why it happens
Reading
src/framework.tsandsrc/agent.ts:driveStream()runs as a fire-and-forget background promise. While it is between emitting atool_useandstream.provideToolResults()(which depends ondispatchToolCall().then() → ToolResultEvent → handleProcessEvent), anExternalMessageEventcan land in the queue. The handler appends the user text tomessagesand proceeds. There is no check that the most recent assistant message has all itstool_useids resolved, and no synthesis of placeholdertool_results for the unresolved ones.Autobio compression downstream respects the orphan rather than repairing it.
Suggested fix (sketch)
Two layers, both small:
Repair-on-write. In the path that appends a user text message to
messages(i.e. theExternalMessageEventhandler / equivalent), scan back to the most recent assistant message. For everytool_useblock in it whose id has no matchingtool_resultin a subsequent user message, synthesize a stubtool_result(is_error: true, content something like\"cancelled: user interjected before result returned\") and append it as a user message before the new user text. The driveStream that was awaiting the real result needs to either be cancelled or have its pending tool dispatch invalidated — picking the cleanest approach here is the design call.Repair-on-read (defense in depth). In the request-build path, refuse to serialize any assistant message that contains an orphan
tool_use. Either drop the trailing tool_use block or throw a clear engine-side error before the API does. Applies equally to any autobio rewrite path.A one-shot repair script that walks existing damaged sessions and inserts the synthesized stubs would also help — there are already-bricked sessions in the wild.
Diagnostic data
llm-calls.jsonlentries [17] and [19] in the post-mortem VM dump.framework/inference-logentries 63, 64 in session75da82b0.400 invalid_request_error: messages.N: \tool_use` ids were found without ...`toolu_01TddAQ7dJdWyxWskNwMsSca(fleet--peek).