Bundled CLI never reads prompt cache in multi-turn headless runs (direct anthropic SDK over the same endpoint caches correctly)

## Summary

In a headless `claude-agent-sdk` run (Python, custom system prompt, MCP
servers configured via `.mcp.json`), the bundled Claude Code CLI writes
prompt cache on every turn but never reads it — `cache_read_input_tokens=0`
across multi-turn investigations. On the same endpoint, same model, same
workspace credentials, a 30-line script using the plain `anthropic` Python
SDK with `cache_control: ephemeral` on the system block gets a healthy
cache read on the second call.

This makes the regression CLI-side rather than gateway- or API-side, and
the cache_create-on-every-turn behaviour quickly bursts past per-minute
input-token rate limits on multi-turn runs.

## Environment

- `claude-agent-sdk` versions tested: **0.1.81** (bundled CLI 2.1.139) and
  **0.2.82** (bundled CLI 2.1.142) — both broken.
- Python 3.12 inside a Debian-slim Docker image (Node.js 20 installed for
  the bundled CLI shebang).
- Model: `claude-sonnet-4-6`.
- Endpoint: the Claude on AWS first-party endpoint
  (`https://aws-external-anthropic.<region>.api.aws`) with an
  `anthropic-workspace-id` custom header. Same endpoint used by the healthy
  direct-SDK probe below.
- MCP servers configured via `.mcp.json` (two HTTP-transport servers).
- `permission_mode="bypassPermissions"`, `strict_mcp_config=True`, custom
  `system_prompt` string (not the `claude_code` preset), `cwd` scoped to a
  read-only mount, `allowed_tools` / `disallowed_tools` enforced natively.

## Symptom — bundled CLI (broken)

Representative log line from a 3-turn run (annotated):

```
ResultMessage received turns=3 cost_usd=0.187 stop_reason=stop_sequence
usage input=3 output=240 cache_read=0 cache_create=45781
agent completed with error(s)=[] api_error_status=429 subtype=success
```

- `cache_create_input_tokens` ≈ 14–15K **per turn**, accumulating to
  43–46K across 3 turns.
- `cache_read_input_tokens` = **0** across every turn.
- The third turn trips the workspace's per-minute input-token cap and the
  CLI exits with `is_error=true, subtype=success, api_error_status=429`.

The "ResultMessage…subtype=success api_error_status=429" combo is
reported via PR #923 — that part is working correctly. The cache miss
is the underlying issue.

## Symptom — direct `anthropic` SDK over the same endpoint (healthy)

```
[call 1] cache_creation_input_tokens=2722, cache_read_input_tokens=0
[call 2] cache_creation_input_tokens=4,    cache_read_input_tokens=2722
```

Same `ANTHROPIC_BASE_URL`, same `anthropic-workspace-id` header, same
`claude-sonnet-4-6` model, same ~11K-char system block carrying
`cache_control: {"type": "ephemeral"}`.

Conclusion: the gateway / API / workspace / model are all fine. The
bundled CLI is shipping a payload that defeats the cache lookup despite
the prefix appearing identical across turns.

## Minimal repro

Two pieces:

### 1. Healthy reference — direct SDK call, ~50 LOC

```python
import os, time
from anthropic import Anthropic

SYSTEM = ("This is a static filler line used to grow the system prompt past "
          "the cache minimum block size. ") * 80

client = Anthropic(
    api_key=os.environ["ANTHROPIC_API_KEY"],
    base_url=os.environ.get("ANTHROPIC_BASE_URL") or None,
    default_headers={"anthropic-workspace-id": os.environ["ANTHROPIC_WORKSPACE_ID"]}
        if os.environ.get("ANTHROPIC_WORKSPACE_ID") else None,
)

def call(label):
    r = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=64,
        system=[{"type": "text", "text": SYSTEM,
                 "cache_control": {"type": "ephemeral"}}],
        messages=[{"role": "user", "content": "ping"}],
    )
    print(label, r.usage.model_dump())

call("call 1"); time.sleep(1); call("call 2")
# call 2 shows cache_read_input_tokens > 0
```

### 2. Broken reference — same workspace, same endpoint, via `claude-agent-sdk`

```python
from claude_agent_sdk import ClaudeAgentOptions, query
from claude_agent_sdk.types import ResultMessage
import asyncio, os

options = ClaudeAgentOptions(
    system_prompt="<a static ~1.5K-token system prompt>",
    model="claude-sonnet-4-6",
    mcp_servers="/tmp/.mcp.json",       # any two HTTP MCP servers
    strict_mcp_config=True,
    allowed_tools=[...],                # ~16K tokens of tool defs total
    permission_mode="bypassPermissions",
    cwd="/some/read-only/path",
    env={
        "ANTHROPIC_API_KEY": os.environ["ANTHROPIC_API_KEY"],
        "ANTHROPIC_BASE_URL": os.environ["ANTHROPIC_BASE_URL"],
        "ANTHROPIC_CUSTOM_HEADERS":
            f"anthropic-workspace-id: {os.environ['ANTHROPIC_WORKSPACE_ID']}",
    },
    max_turns=50,
)

async def main():
    async for msg in query(prompt="<a small ticket-shaped user message>",
                            options=options):
        if isinstance(msg, ResultMessage):
            print(msg.usage)  # observe cache_read=0 across turns

asyncio.run(main())
```

The model takes a couple of cheap tool calls and the second/third turn's
`usage.cache_read_input_tokens` is still 0.

## What we ruled out

- **SDK version regression in the v0.2.x line** — pinned back to v0.1.81
  (bundled CLI 2.1.139); same broken behaviour.
- **Async MCP loading (v0.2.82's breaking change)** — not the cause;
  v0.1.81 exhibits the same symptom.
- **Gateway / workspace / model** — direct `anthropic` SDK over the same
  endpoint shows healthy cache reads.
- **`MCP_CONNECTION_NONBLOCKING=0`** — env var was set; no observable
  effect (we may have had the env var name wrong, but it didn't move the
  needle either way).

## Hypotheses

Things the CLI may be doing differently from the direct SDK call, any of
which would defeat the prefix match:

1. **Per-turn dynamic injection into the system block** — date, session
   token, tool registration metadata, or similar non-stable text appearing
   inside the `cache_control`-marked block.
2. **Tool-array re-ordering** between turns once MCP server connections
   settle.
3. **`cache_control` breakpoint placement** drifting between turn 1 and
   turn 2 (e.g., placed on different block indices once more conversation
   accumulates), so it doesn't match what was originally written.
4. Related to #899 (`subprocess CLI rejects list-form system_prompt`) —
   if the CLI is forced to flatten `system` to a string, the
   `cache_control` marker placement may not survive intact.

We can't observe which is true from outside the bundled binary. If anyone
on the SDK side can confirm what the CLI sends as `system` and `tools`
between turns when given a custom `system_prompt` plus MCP servers, that
would settle it quickly.

## Impact

- Multi-turn agentic runs against any workspace with a strict per-minute
  input-token cap will trip 429 within 2–3 turns even on small (<20K-token)
  prefixes.
- Cost: every turn writes the full prefix to cache fresh, ~3× the expected
  amortised cost.
- Severity is amplified for the Claude on AWS endpoint where ITPM
  defaults are lower than direct Anthropic accounts.

## Asks

1. Confirm whether dynamic per-call content is being injected into the
   system block in headless mode, and if so, document the env var or
   option to suppress it (analogous to `exclude_dynamic_sections` on the
   preset).
2. If not, a way to log the bundled CLI's outbound API request body would
   be sufficient to diagnose from outside.

Happy to share the worker-side and probe scripts in their entirety if it
helps shorten triage.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bundled CLI never reads prompt cache in multi-turn headless runs (direct anthropic SDK over the same endpoint caches correctly) #974

Summary

Environment

Symptom — bundled CLI (broken)

Symptom — direct `anthropic` SDK over the same endpoint (healthy)

Minimal repro

1. Healthy reference — direct SDK call, ~50 LOC

2. Broken reference — same workspace, same endpoint, via `claude-agent-sdk`

What we ruled out

Hypotheses

Impact

Asks

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Bundled CLI never reads prompt cache in multi-turn headless runs (direct anthropic SDK over the same endpoint caches correctly) #974

Description

Summary

Environment

Symptom — bundled CLI (broken)

Symptom — direct anthropic SDK over the same endpoint (healthy)

Minimal repro

1. Healthy reference — direct SDK call, ~50 LOC

2. Broken reference — same workspace, same endpoint, via claude-agent-sdk

What we ruled out

Hypotheses

Impact

Asks

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Symptom — direct `anthropic` SDK over the same endpoint (healthy)

2. Broken reference — same workspace, same endpoint, via `claude-agent-sdk`