Skip to content

Trace events have become load-bearing for control flow in runEphemeralToCompletion #33

@Anarchid

Description

@Anarchid

Background

framework.runEphemeralToCompletion subscribes to inference:completed, inference:turn_ended, and inference:exhausted via onTrace(...) and drives control flow from them: it resolves/rejects the caller's promise, deregisters the ephemeral agent from this.agents, and decides when the subagent's task is done.

This silently violates the trace-vs-process-event contract. Traces are observability — UI, logging, postmortems — and explicitly not the channel for framework business logic. Once the framework consumes its own traces for state transitions, the trace stream's timing and the state observable at emission time become part of an undocumented API surface.

How it bit us

PR #32 fixes a zombie-subagent bug surfaced in a production postmortem: a clerk agent's forked subagent timed out on the parent side, stayed registered in this.agents, kept receiving mcpl:channel-incoming fan-out for ~30 hours, and eventually duplicated a Zulip reply to an end user.

The proximate cause was an earlier refactor that moved emitTrace({type: 'inference:completed'}) ahead of agent.reset() in driveStream. Pure observability reorder, no behavior change — except runEphemeralToCompletion's trace listener does:

case 'inference:completed': {
  if (agent.state.status === 'idle') {
    cleanup();
    resolve({ speech, toolCallsCount });
  }
  break;
}

With reset() happening after the trace, agent.state.status was still 'streaming' at emit time. The check failed, cleanup never ran, the agent stayed in this.agents indefinitely.

PR #32 restores correctness, but it cements a contract that was never advertised: the next "harmless" trace refactor (e.g. emitting earlier so the UI updates before a slow dispatchSpeech) will silently break ephemeral cleanup again. We just pinned the emit order without saying so.

Proposed fix

Promote ephemeral-stream completion off the trace bus.

Preferred — promise returned by the stream driver. driveStream (or a wrapper) returns Promise<{stopReason, speech, toolCallsCount, ...}> that resolves when the stream truly settles, sourced from the state machine. runEphemeralToCompletion awaits that promise; no trace listener, no agent-name filter, no state-at-emit check. Test surface collapses to "mock the promise"; current tests must choreograph trace events in the right order with the right state.

Acceptable — structured process event. Push process:agent-settled onto the process queue alongside other ProcessEvents, with {agentName, stopReason, ...}. Less elegant (broadcast-filter-by-name re-emerges) but uses the explicit business-logic transport.

Avoid: emitting a second trace dedicated to ephemeral cleanup. That just relocates the violation.

Related sites to audit

Grep this.onTrace( outside framework.ts's own logging/UI/CLI consumers. Anything in a control-flow path is a candidate:

  • errorPolicy.onInferenceError retry bookkeeping in driveStream reads attempt/requestId off the same flow.
  • Any module that reacts to inference:completed to do persistent bookkeeping (lessons-on-completion writes, eventgate state, usage tracking side effects).

Not all of these are necessarily wrong — observability can legitimately produce side effects (logging, metrics). The question is whether the framework's own correctness depends on the trace firing. Where it does, the trace listener should be replaced with a state-machine signal.

Acceptance criteria

  • runEphemeralToCompletion no longer subscribes to trace events
  • Subagent tests pass without choreographing trace emit order or state-at-emit
  • Document in framework README / CONTRIBUTING: traces are observability only; framework-internal control flow must not consume them
  • Optional: lint or guard test that fails if framework-internal files (outside trace.ts / framework.ts emit sites) call this.onTrace( for control-flow purposes

Context

Discovered while postmortem-ing a duplicate-Zulip-reply incident in a triumvirate fleet deployment. The zombie subagent burned ~half the run's input tokens (2.28M input / 21k output across 18 inferences) after its parent-side promise had already settled.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions