Trace events have become load-bearing for control flow in runEphemeralToCompletion

## Background

`framework.runEphemeralToCompletion` subscribes to `inference:completed`, `inference:turn_ended`, and `inference:exhausted` via `onTrace(...)` and drives control flow from them: it resolves/rejects the caller's promise, deregisters the ephemeral agent from `this.agents`, and decides when the subagent's task is done.

This silently violates the trace-vs-process-event contract. Traces are observability — UI, logging, postmortems — and explicitly **not** the channel for framework business logic. Once the framework consumes its own traces for state transitions, the trace stream's *timing and the state observable at emission time* become part of an undocumented API surface.

## How it bit us

[PR #32](https://github.com/anima-research/agent-framework/pull/32) fixes a zombie-subagent bug surfaced in a production postmortem: a clerk agent's forked subagent timed out on the parent side, stayed registered in `this.agents`, kept receiving `mcpl:channel-incoming` fan-out for ~30 hours, and eventually duplicated a Zulip reply to an end user.

The proximate cause was an earlier refactor that moved `emitTrace({type: 'inference:completed'})` ahead of `agent.reset()` in `driveStream`. Pure observability reorder, no behavior change — except `runEphemeralToCompletion`'s trace listener does:

```ts
case 'inference:completed': {
  if (agent.state.status === 'idle') {
    cleanup();
    resolve({ speech, toolCallsCount });
  }
  break;
}
```

With `reset()` happening *after* the trace, `agent.state.status` was still `'streaming'` at emit time. The check failed, cleanup never ran, the agent stayed in `this.agents` indefinitely.

PR #32 restores correctness, but it cements a contract that was never advertised: the next "harmless" trace refactor (e.g. emitting earlier so the UI updates before a slow `dispatchSpeech`) will silently break ephemeral cleanup again. We just pinned the emit order without saying so.

## Proposed fix

Promote ephemeral-stream completion **off** the trace bus.

**Preferred — promise returned by the stream driver.** `driveStream` (or a wrapper) returns `Promise<{stopReason, speech, toolCallsCount, ...}>` that resolves when the stream truly settles, sourced from the state machine. `runEphemeralToCompletion` awaits that promise; no trace listener, no agent-name filter, no state-at-emit check. Test surface collapses to "mock the promise"; current tests must choreograph trace events in the right order with the right state.

**Acceptable — structured process event.** Push `process:agent-settled` onto the process queue alongside other `ProcessEvent`s, with `{agentName, stopReason, ...}`. Less elegant (broadcast-filter-by-name re-emerges) but uses the explicit business-logic transport.

**Avoid:** emitting a *second* trace dedicated to ephemeral cleanup. That just relocates the violation.

## Related sites to audit

Grep `this.onTrace(` outside `framework.ts`'s own logging/UI/CLI consumers. Anything in a control-flow path is a candidate:

- `errorPolicy.onInferenceError` retry bookkeeping in `driveStream` reads `attempt`/`requestId` off the same flow.
- Any module that reacts to `inference:completed` to do persistent bookkeeping (lessons-on-completion writes, eventgate state, usage tracking side effects).

Not all of these are necessarily wrong — observability *can* legitimately produce side effects (logging, metrics). The question is whether the framework's own correctness depends on the trace firing. Where it does, the trace listener should be replaced with a state-machine signal.

## Acceptance criteria

- [ ] `runEphemeralToCompletion` no longer subscribes to trace events
- [ ] Subagent tests pass without choreographing trace emit order or state-at-emit
- [ ] Document in framework README / CONTRIBUTING: traces are observability only; framework-internal control flow must not consume them
- [ ] Optional: lint or guard test that fails if framework-internal files (outside `trace.ts` / `framework.ts` emit sites) call `this.onTrace(` for control-flow purposes

## Context

Discovered while postmortem-ing a duplicate-Zulip-reply incident in a triumvirate fleet deployment. The zombie subagent burned ~half the run's input tokens (2.28M input / 21k output across 18 inferences) after its parent-side promise had already settled.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trace events have become load-bearing for control flow in runEphemeralToCompletion #33

Background

How it bit us

Proposed fix

Related sites to audit

Acceptance criteria

Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Trace events have become load-bearing for control flow in runEphemeralToCompletion #33

Description

Background

How it bit us

Proposed fix

Related sites to audit

Acceptance criteria

Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions