Problem
While debugging agent memory issues (#41-#44), the tracer lacked critical context that would have sped up diagnosis significantly.
Current state
The tracer captures: message history (with previews), tool call args + returns, per-step timing, error flags, done event with model/provider info. This is a good foundation.
Missing context
1. Tool schemas not captured
We see allowedTools names but not the JSON schemas. When diagnosing #42 (GPT-4o skipping thinking kwarg), we couldn't tell from the trace alone whether thinking was actually marked required in the schema.
Proposed: Add a tool_schemas event on the first step with the full JSON schema for each tool.
2. TTFT always null
ttftMs is null on every step across all traces examined. The streaming adapter likely isn't emitting the first-token timestamp.
Proposed: Fix TTFT capture in the streaming adapter.
3. No memory state snapshot
There's no visibility into what the agent's core memory (persona, human blocks) looks like at the start of a turn. We had to infer that human memory was empty from recall_memory returning nothing.
Proposed: Add a memory_state event at turn start with a snapshot of core memory blocks.
4. Tool success/failure not semantically tagged
save_to_memory returned isError: false with message "Could not promote note — not found or duplicate". The tracer has no way to distinguish tool-level success from logical failure, making automated failure detection impossible.
Proposed: Add a tool_succeeded field derived from response content, or require tools to return structured success/failure signals.
5. Reasoning not captured
reasoningCaptured: false on every step. We can't see why the model chose save_to_memory over update_human_memory three times in a row.
Proposed: Capture reasoning/thinking content when available from the model response.
6. No search internals for recall_memory
When recall_memory returns empty, we don't know which search paths were attempted (hybrid/embedding, keyword, episode scan) or if any threw silent exceptions.
Proposed: Include search path details in the tool return or as a separate diagnostic event.
Related issues
Problem
While debugging agent memory issues (#41-#44), the tracer lacked critical context that would have sped up diagnosis significantly.
Current state
The tracer captures: message history (with previews), tool call args + returns, per-step timing, error flags, done event with model/provider info. This is a good foundation.
Missing context
1. Tool schemas not captured
We see
allowedToolsnames but not the JSON schemas. When diagnosing #42 (GPT-4o skippingthinkingkwarg), we couldn't tell from the trace alone whetherthinkingwas actually markedrequiredin the schema.Proposed: Add a
tool_schemasevent on the first step with the full JSON schema for each tool.2. TTFT always null
ttftMsisnullon every step across all traces examined. The streaming adapter likely isn't emitting the first-token timestamp.Proposed: Fix TTFT capture in the streaming adapter.
3. No memory state snapshot
There's no visibility into what the agent's core memory (persona, human blocks) looks like at the start of a turn. We had to infer that human memory was empty from
recall_memoryreturning nothing.Proposed: Add a
memory_stateevent at turn start with a snapshot of core memory blocks.4. Tool success/failure not semantically tagged
save_to_memoryreturnedisError: falsewith message "Could not promote note — not found or duplicate". The tracer has no way to distinguish tool-level success from logical failure, making automated failure detection impossible.Proposed: Add a
tool_succeededfield derived from response content, or require tools to return structured success/failure signals.5. Reasoning not captured
reasoningCaptured: falseon every step. We can't see why the model chosesave_to_memoryoverupdate_human_memorythree times in a row.Proposed: Capture reasoning/thinking content when available from the model response.
6. No search internals for recall_memory
When
recall_memoryreturns empty, we don't know which search paths were attempted (hybrid/embedding, keyword, episode scan) or if any threw silent exceptions.Proposed: Include search path details in the tool return or as a separate diagnostic event.
Related issues
thinkingkwarg in tool calls #42 — GPT-4o skips required thinking kwarg