Skip to content

fix(collector): make trace failures self-explanatory in the JSON envelope#5

Merged
burrows99 merged 1 commit into
masterfrom
fix/collector-emit-feedback
Jun 19, 2026
Merged

fix(collector): make trace failures self-explanatory in the JSON envelope#5
burrows99 merged 1 commit into
masterfrom
fix/collector-emit-feedback

Conversation

@burrows99

Copy link
Copy Markdown
Owner

Why

Minted from a real debugging session: the trace itself worked, but every failure after it was lossy in exactly the channel an agent consumes (--json + exit code). The model spent ~7 minutes chasing a wrong cause. The root pattern: the CLI knew the answer but put it where the agent doesn't look (human-render lines, info logs, or one undifferentiated 400). Each fix moves the signal into the envelope.

The session's smoking gun: a real --emit failed with password authentication failed for user "trace" — a Postgres failure — returned as 400, the same code as a malformed envelope. The model probed with a junk payload, got a different 400 ("invalid envelope"), fused the two, and concluded --concise was trimming required fields (it isn't — see #3).

Changes

1. Collector route — separate fault domains, add machine codes, stop leaking internals (ui/app/v1/traces/route.ts)

  • Parse/validate failures → 4xx (caller's fault). Store/DB failure → 503 STORE_UNAVAILABLE, real cause logged server-side, not returned.
  • Every response carries a code from the shared vocabulary (INGEST / INGEST_INVALID / INGEST_NO_SESSION / STORE). No more string-matching; infra outage ≠ bad envelope.
  • Re-exports Code from trace-cli/server so the route reuses the one vocabulary instead of drifting literals.

2. Emit failures surface instead of vanishing (src/collector/Collector.ts, src/cli/Cli.ts)

  • Collector.emit returns a rich EmitResult {ok,status,body,error} and logs rejections at error with the reason — previously a 400 logged at info and the bare false was swallowed by .catch(() => false).
  • The run flushes pending emits before rendering and folds any rejection into the printed/--json envelope as an EMIT diagnostic. A run whose every emit 400'd no longer prints "success."

3. --concise no longer reads like it shapes the wire (src/cli/Cli.ts)

  • Help now states it trims the printed envelope only and does NOT affect --emit (the collector always receives the full trace.toJSON()). Closes the false lead the session chased.

4. 0-hit runs explain themselves in JSON (src/shared/codes.ts, src/cli/commands/DynamicCommand.ts)

  • New BP_BOUND_UNHIT diagnostic when a breakpoint binds but never fires — the JSON counterpart of the renderer's "no breakpoints hit" line. A --json reader can now tell "trigger didn't run" from "wrong line" without grepping server logs.

5. Port clarity (src/cli/commands/ServeCommand.ts, docker-compose.yml)

  • trace serve prints the exact --emit URL to copy (auto-discovery only probes the well-known ports). Fixed the stale :4747 compose comment (host port is 14747).

Tests

  • Extends test/output.test.js: asserts Code.BP_BOUND_UNHIT, and that Collector.emit resolves to a structured result (never throws, never a bare bool) on a refused POST.
  • CLI tsc --noEmit clean · UI tsc --noEmit clean · npm test41/41 pass (1 DB round-trip skipped).

Base

Branched off master (which already carries the 14747 host-port mappings — fix #5 corrects the lone stale :4747 compose comment they left behind).

🤖 Generated with Claude Code

…lope

When a trace run streamed to the dashboard, every failure mode was lossy in
exactly the channel an agent consumes (--json + exit code), so a real
debugging session spent minutes chasing the wrong cause. Each fix puts the
signal where the agent looks.

- Collector route: separate fault domains. A dead/misconfigured store now
  returns 503 STORE_UNAVAILABLE (internal error logged, not leaked) instead
  of collapsing into the same 400 as a malformed envelope. Every response
  carries a machine `code` from the shared vocabulary (INGEST / INGEST_INVALID
  / INGEST_NO_SESSION / STORE), so a caller never has to string-match. A
  Postgres auth failure is no longer indistinguishable from a schema error.

- Emit failures surface. Collector.emit returns a rich EmitResult
  ({ok,status,body,error}) and logs rejections at error with the reason,
  instead of logging a 400 at info and swallowing a bare false. The run folds
  any rejected emit into the printed/--json envelope as an EMIT diagnostic.

- 0-hit runs explain themselves. A bound-but-never-hit breakpoint now emits a
  BP_BOUND_UNHIT diagnostic into the envelope (the JSON counterpart of the
  renderer's "no breakpoints hit" line), so a --json reader can tell "trigger
  didn't run" from "wrong line" without grepping server logs.

- --concise help states it trims the PRINTED envelope only and does NOT affect
  --emit (the collector always gets the full envelope) — closing the false
  lead that toggling it changes the wire payload.

- serve prints the exact --emit URL to copy; fix the stale :4747 compose
  comment (host port is 14747).

Adds Code.BP_BOUND_UNHIT, re-exports Code from trace-cli/server for the route,
and covers the new emit-result shape + code in output.test.js. CLI + UI
typecheck clean; 41/41 tests pass (1 DB round-trip skipped).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 19, 2026 07:04
@burrows99 burrows99 merged commit ce8ffe1 into master Jun 19, 2026
2 checks passed
@burrows99 burrows99 deleted the fix/collector-emit-feedback branch June 19, 2026 07:07

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves end-to-end debuggability of --emit/collector failures and “0-hit” dynamic runs for --json consumers by introducing shared machine-readable codes and surfacing emit/ingest failure reasons in structured outputs instead of losing them in human logs.

Changes:

  • Adds collector ingest fault-domain separation (4xx envelope issues vs 503 store issues) and returns shared Code values in ingest responses.
  • Makes Collector.emit return a structured result and updates CLI flow to flush emits before rendering and emit an EMIT diagnostic on failures.
  • Introduces BP_BOUND_UNHIT to explain “breakpoints bound but never hit” in JSON diagnostics; updates serve output and compose comment for port clarity.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
ui/app/v1/traces/route.ts Adds structured ingest error codes and distinguishes client envelope errors from store outages (503).
src/collector/Collector.ts Changes emit() to return a rich result and logs HTTP rejections as errors with collector-provided details.
src/cli/Cli.ts Flushes pending emits before printing output and surfaces emit failures as diagnostics; clarifies --concise help text.
src/cli/commands/DynamicCommand.ts Adds BP_BOUND_UNHIT warning when breakpoints bind but no events fire.
src/shared/codes.ts Adds BP_BOUND_UNHIT and collector/emit-related codes to the shared code vocabulary.
src/server.ts Re-exports Code so the Next.js collector route can reuse the shared vocabulary.
test/output.test.js Extends tests to cover new code and Collector.emit structured failure behavior.
docker-compose.yml Fixes a stale compose comment to reference the correct exposed host port.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +40 to +42
const collectorUrl = `http://localhost:${port}`;
log.info("starting dashboard", { url: collectorUrl, host });
log.info("stream traces here", { hint: `trace dynamic … --emit ${collectorUrl} (or: export TRACE_COLLECTOR_URL=${collectorUrl})` });
Comment thread src/cli/Cli.ts
Comment on lines +124 to 128
const emitToCollector = collector
? (envelope: unknown) => { emitChain = emitChain.then(async () => { const result = await Collector.emit(collector, envelope); if (!result.ok) emitFailures.push(result); }); }
: undefined;

const { trace } = await this.#dynamic.run({
Comment on lines +37 to +41
// A rejection is a real failure — log it at error with the collector's reason (e.g. "invalid envelope",
// "trace store unavailable"), not at info. The caller folds this into the envelope's diagnostics.
const body = await response.text().catch(() => "");
log.error("emit rejected", { code: Code.EMIT, endpoint, status: response.status, body: body.slice(0, 500) });
return { ok: false, status: response.status, body };
burrows99 added a commit that referenced this pull request Jun 19, 2026
…ccurately (#6)

* fix(collector): bound emit failure memory and word network failures accurately

Follow-up to the emit-feedback diagnostics (#5), addressing two Copilot
review notes:

- Cli.ts kept every failed EmitResult in an array but only ever read the
  count and the last one. On a run that emits per onProgress, repeated
  failures grew that array unbounded. Replace it with a counter plus the
  most recent failure.
- The emit diagnostic said the collector "rejected" the emits even when
  the POST never reached it (connection refused/timeout/DNS — no HTTP
  status). Word an HTTP rejection and a delivery failure distinctly.
- Collector.emit returned the full rejection body, which a caller retains;
  a large error page could bloat memory. Cap the stored body at 10k chars
  (the log line still truncates to 500 independently).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(trace): make empty/aborted runs fail loudly instead of silently "running" (#7)

A chrome run that captured nothing and recorded nothing was reported as
success (ok:true, exit 0) and could sit on the dashboard's "running" badge
forever — the agent had no signal it was broken. Root cause: the failure
signals lived in stderr logs or dashboard-only state, never in the JSON
envelope the agent reads. This closes those gaps and the matching violations
of the logger's own two-channel contract (codes.ts: the stderr log and the
envelope diagnostic for one event should carry the SAME code).

- Terminal envelope on abort. If a run throws (attach failed, engine crashed,
  recording threw), DynamicCommand now emits a terminal envelope via onProgress
  — no `running` flag, ok:false, an ENGINE_FATAL diagnostic — so the collector
  resolves the session instead of leaving its initial "running" partial
  orphaned. Cli flushes that emit before exiting non-zero.

- Recording outcomes are now envelope diagnostics, not just stderr. RECORD_EMPTY
  (no frames → empty video) and RECORD (render/upload threw) were log-only, so
  "no video" was invisible to a --json reader. Both now push a warn diagnostic
  carrying the same code as the log.

- Upload failure no longer masquerades as a clean local save. S3ArtifactStore
  swallows failures and returns null, which #record could not tell apart from
  "no S3 configured". It now distinguishes the two and emits an UPLOAD
  diagnostic when a configured upload fails (link missing, local copy kept).

- ENGINE_FATAL is now logged as well as diagnosed, so it appears in both
  channels per the contract.

Adds test/dynamic-diagnostics.test.js (fake-tracer unit tests for the abort
terminal envelope, captured-fatal, and clean-empty cases). typecheck + build
clean; 44/45 tests pass (1 DB round-trip skipped).

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(collector): bound the rejection body at the stream, add emit-diagnostic tests

Addresses the two Copilot review notes on the PR:

- Collector.emit buffered the whole rejection body via response.text()
  before slicing, so an oversized error page still caused a transient
  memory spike and download latency even though the stored body was
  capped. Read only up to MAX_BODY_CHARS off the stream, then cancel it —
  bounding memory and latency at the source.
- Extract the emit-failure diagnostic wording into an exported
  emitFailureMessage() helper and cover it with focused tests: HTTP status
  → "rejected", no status (network) → "failed" (the regression guard so a
  POST that never landed isn't reported as a rejection), plus the
  no-body / missing-error / oversized-body edge cases.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(collector): flush the TextDecoder so a split multi-byte char isn't dropped

readCappedText decoded chunks with { stream: true } but never did a final
decode() flush. A response ending on a multi-byte UTF-8 sequence split
across the last chunk boundary would drop/garble that trailing character in
the stored body and logged reason. Flush after the read loop.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
burrows99 added a commit that referenced this pull request Jun 19, 2026
…#8)

Keep the prose docs in step with the behavior changes from the collector/trace
fixes (#5#7).

- README: the docker-compose section showed the container ports (:4747 ·
  :5432 · :9000) instead of the host-published ports the compose file actually
  maps. Correct to :14747 · :65432 · :19000 (console :19001), and the
  S3_ENDPOINT example to :19000, so the README matches docker-compose.yml. The
  native-serve example keeps :5432 (your own Postgres) — that one was right.

- skills/trace/SKILL.md: add a "Reading the result" section. The skill is
  deliberately minimal (the binary is the source of truth), but it gave zero
  guidance on interpreting a trace — the gap behind an agent treating an empty
  or perpetually-"running" trace as success. Names the three stable fields that
  decide whether a run did what was asked (ok · diagnostics[].code · meta.running)
  and the rule that ok:true + events:[] + a warn diagnostic means "ran fine, saw
  nothing — re-aim," not success. Field-level (no drift-prone code list).
burrows99 added a commit that referenced this pull request Jun 19, 2026
The DX-gaps analysis enumerated six ways the CLI misled an agent; all six
have since shipped. Keep the analysis verbatim as the rationale, but add a
status banner + table and a per-gap resolution note pointing to where each
was closed (and flagging the few sub-items still open: dashboard age-out of
stale "running" (#6b), done/upload decoupling (#6c), and lockfile-based port
auto-discovery (#5)).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants