Background
framework.runEphemeralToCompletion subscribes to inference:completed, inference:turn_ended, and inference:exhausted via onTrace(...) and drives control flow from them: it resolves/rejects the caller's promise, deregisters the ephemeral agent from this.agents, and decides when the subagent's task is done.
This silently violates the trace-vs-process-event contract. Traces are observability — UI, logging, postmortems — and explicitly not the channel for framework business logic. Once the framework consumes its own traces for state transitions, the trace stream's timing and the state observable at emission time become part of an undocumented API surface.
How it bit us
PR #32 fixes a zombie-subagent bug surfaced in a production postmortem: a clerk agent's forked subagent timed out on the parent side, stayed registered in this.agents, kept receiving mcpl:channel-incoming fan-out for ~30 hours, and eventually duplicated a Zulip reply to an end user.
The proximate cause was an earlier refactor that moved emitTrace({type: 'inference:completed'}) ahead of agent.reset() in driveStream. Pure observability reorder, no behavior change — except runEphemeralToCompletion's trace listener does:
case 'inference:completed': {
if (agent.state.status === 'idle') {
cleanup();
resolve({ speech, toolCallsCount });
}
break;
}
With reset() happening after the trace, agent.state.status was still 'streaming' at emit time. The check failed, cleanup never ran, the agent stayed in this.agents indefinitely.
PR #32 restores correctness, but it cements a contract that was never advertised: the next "harmless" trace refactor (e.g. emitting earlier so the UI updates before a slow dispatchSpeech) will silently break ephemeral cleanup again. We just pinned the emit order without saying so.
Proposed fix
Promote ephemeral-stream completion off the trace bus.
Preferred — promise returned by the stream driver. driveStream (or a wrapper) returns Promise<{stopReason, speech, toolCallsCount, ...}> that resolves when the stream truly settles, sourced from the state machine. runEphemeralToCompletion awaits that promise; no trace listener, no agent-name filter, no state-at-emit check. Test surface collapses to "mock the promise"; current tests must choreograph trace events in the right order with the right state.
Acceptable — structured process event. Push process:agent-settled onto the process queue alongside other ProcessEvents, with {agentName, stopReason, ...}. Less elegant (broadcast-filter-by-name re-emerges) but uses the explicit business-logic transport.
Avoid: emitting a second trace dedicated to ephemeral cleanup. That just relocates the violation.
Related sites to audit
Grep this.onTrace( outside framework.ts's own logging/UI/CLI consumers. Anything in a control-flow path is a candidate:
errorPolicy.onInferenceError retry bookkeeping in driveStream reads attempt/requestId off the same flow.
- Any module that reacts to
inference:completed to do persistent bookkeeping (lessons-on-completion writes, eventgate state, usage tracking side effects).
Not all of these are necessarily wrong — observability can legitimately produce side effects (logging, metrics). The question is whether the framework's own correctness depends on the trace firing. Where it does, the trace listener should be replaced with a state-machine signal.
Acceptance criteria
Context
Discovered while postmortem-ing a duplicate-Zulip-reply incident in a triumvirate fleet deployment. The zombie subagent burned ~half the run's input tokens (2.28M input / 21k output across 18 inferences) after its parent-side promise had already settled.
Background
framework.runEphemeralToCompletionsubscribes toinference:completed,inference:turn_ended, andinference:exhaustedviaonTrace(...)and drives control flow from them: it resolves/rejects the caller's promise, deregisters the ephemeral agent fromthis.agents, and decides when the subagent's task is done.This silently violates the trace-vs-process-event contract. Traces are observability — UI, logging, postmortems — and explicitly not the channel for framework business logic. Once the framework consumes its own traces for state transitions, the trace stream's timing and the state observable at emission time become part of an undocumented API surface.
How it bit us
PR #32 fixes a zombie-subagent bug surfaced in a production postmortem: a clerk agent's forked subagent timed out on the parent side, stayed registered in
this.agents, kept receivingmcpl:channel-incomingfan-out for ~30 hours, and eventually duplicated a Zulip reply to an end user.The proximate cause was an earlier refactor that moved
emitTrace({type: 'inference:completed'})ahead ofagent.reset()indriveStream. Pure observability reorder, no behavior change — exceptrunEphemeralToCompletion's trace listener does:With
reset()happening after the trace,agent.state.statuswas still'streaming'at emit time. The check failed, cleanup never ran, the agent stayed inthis.agentsindefinitely.PR #32 restores correctness, but it cements a contract that was never advertised: the next "harmless" trace refactor (e.g. emitting earlier so the UI updates before a slow
dispatchSpeech) will silently break ephemeral cleanup again. We just pinned the emit order without saying so.Proposed fix
Promote ephemeral-stream completion off the trace bus.
Preferred — promise returned by the stream driver.
driveStream(or a wrapper) returnsPromise<{stopReason, speech, toolCallsCount, ...}>that resolves when the stream truly settles, sourced from the state machine.runEphemeralToCompletionawaits that promise; no trace listener, no agent-name filter, no state-at-emit check. Test surface collapses to "mock the promise"; current tests must choreograph trace events in the right order with the right state.Acceptable — structured process event. Push
process:agent-settledonto the process queue alongside otherProcessEvents, with{agentName, stopReason, ...}. Less elegant (broadcast-filter-by-name re-emerges) but uses the explicit business-logic transport.Avoid: emitting a second trace dedicated to ephemeral cleanup. That just relocates the violation.
Related sites to audit
Grep
this.onTrace(outsideframework.ts's own logging/UI/CLI consumers. Anything in a control-flow path is a candidate:errorPolicy.onInferenceErrorretry bookkeeping indriveStreamreadsattempt/requestIdoff the same flow.inference:completedto do persistent bookkeeping (lessons-on-completion writes, eventgate state, usage tracking side effects).Not all of these are necessarily wrong — observability can legitimately produce side effects (logging, metrics). The question is whether the framework's own correctness depends on the trace firing. Where it does, the trace listener should be replaced with a state-machine signal.
Acceptance criteria
runEphemeralToCompletionno longer subscribes to trace eventstrace.ts/framework.tsemit sites) callthis.onTrace(for control-flow purposesContext
Discovered while postmortem-ing a duplicate-Zulip-reply incident in a triumvirate fleet deployment. The zombie subagent burned ~half the run's input tokens (2.28M input / 21k output across 18 inferences) after its parent-side promise had already settled.