Skip to content

Bundled CLI never reads prompt cache in multi-turn headless runs (direct anthropic SDK over the same endpoint caches correctly) #974

@raspin-home

Description

@raspin-home

Summary

In a headless claude-agent-sdk run (Python, custom system prompt, MCP
servers configured via .mcp.json), the bundled Claude Code CLI writes
prompt cache on every turn but never reads it — cache_read_input_tokens=0
across multi-turn investigations. On the same endpoint, same model, same
workspace credentials, a 30-line script using the plain anthropic Python
SDK with cache_control: ephemeral on the system block gets a healthy
cache read on the second call.

This makes the regression CLI-side rather than gateway- or API-side, and
the cache_create-on-every-turn behaviour quickly bursts past per-minute
input-token rate limits on multi-turn runs.

Environment

  • claude-agent-sdk versions tested: 0.1.81 (bundled CLI 2.1.139) and
    0.2.82 (bundled CLI 2.1.142) — both broken.
  • Python 3.12 inside a Debian-slim Docker image (Node.js 20 installed for
    the bundled CLI shebang).
  • Model: claude-sonnet-4-6.
  • Endpoint: the Claude on AWS first-party endpoint
    (https://aws-external-anthropic.<region>.api.aws) with an
    anthropic-workspace-id custom header. Same endpoint used by the healthy
    direct-SDK probe below.
  • MCP servers configured via .mcp.json (two HTTP-transport servers).
  • permission_mode="bypassPermissions", strict_mcp_config=True, custom
    system_prompt string (not the claude_code preset), cwd scoped to a
    read-only mount, allowed_tools / disallowed_tools enforced natively.

Symptom — bundled CLI (broken)

Representative log line from a 3-turn run (annotated):

ResultMessage received turns=3 cost_usd=0.187 stop_reason=stop_sequence
usage input=3 output=240 cache_read=0 cache_create=45781
agent completed with error(s)=[] api_error_status=429 subtype=success
  • cache_create_input_tokens ≈ 14–15K per turn, accumulating to
    43–46K across 3 turns.
  • cache_read_input_tokens = 0 across every turn.
  • The third turn trips the workspace's per-minute input-token cap and the
    CLI exits with is_error=true, subtype=success, api_error_status=429.

The "ResultMessage…subtype=success api_error_status=429" combo is
reported via PR #923 — that part is working correctly. The cache miss
is the underlying issue.

Symptom — direct anthropic SDK over the same endpoint (healthy)

[call 1] cache_creation_input_tokens=2722, cache_read_input_tokens=0
[call 2] cache_creation_input_tokens=4,    cache_read_input_tokens=2722

Same ANTHROPIC_BASE_URL, same anthropic-workspace-id header, same
claude-sonnet-4-6 model, same ~11K-char system block carrying
cache_control: {"type": "ephemeral"}.

Conclusion: the gateway / API / workspace / model are all fine. The
bundled CLI is shipping a payload that defeats the cache lookup despite
the prefix appearing identical across turns.

Minimal repro

Two pieces:

1. Healthy reference — direct SDK call, ~50 LOC

import os, time
from anthropic import Anthropic

SYSTEM = ("This is a static filler line used to grow the system prompt past "
          "the cache minimum block size. ") * 80

client = Anthropic(
    api_key=os.environ["ANTHROPIC_API_KEY"],
    base_url=os.environ.get("ANTHROPIC_BASE_URL") or None,
    default_headers={"anthropic-workspace-id": os.environ["ANTHROPIC_WORKSPACE_ID"]}
        if os.environ.get("ANTHROPIC_WORKSPACE_ID") else None,
)

def call(label):
    r = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=64,
        system=[{"type": "text", "text": SYSTEM,
                 "cache_control": {"type": "ephemeral"}}],
        messages=[{"role": "user", "content": "ping"}],
    )
    print(label, r.usage.model_dump())

call("call 1"); time.sleep(1); call("call 2")
# call 2 shows cache_read_input_tokens > 0

2. Broken reference — same workspace, same endpoint, via claude-agent-sdk

from claude_agent_sdk import ClaudeAgentOptions, query
from claude_agent_sdk.types import ResultMessage
import asyncio, os

options = ClaudeAgentOptions(
    system_prompt="<a static ~1.5K-token system prompt>",
    model="claude-sonnet-4-6",
    mcp_servers="/tmp/.mcp.json",       # any two HTTP MCP servers
    strict_mcp_config=True,
    allowed_tools=[...],                # ~16K tokens of tool defs total
    permission_mode="bypassPermissions",
    cwd="/some/read-only/path",
    env={
        "ANTHROPIC_API_KEY": os.environ["ANTHROPIC_API_KEY"],
        "ANTHROPIC_BASE_URL": os.environ["ANTHROPIC_BASE_URL"],
        "ANTHROPIC_CUSTOM_HEADERS":
            f"anthropic-workspace-id: {os.environ['ANTHROPIC_WORKSPACE_ID']}",
    },
    max_turns=50,
)

async def main():
    async for msg in query(prompt="<a small ticket-shaped user message>",
                            options=options):
        if isinstance(msg, ResultMessage):
            print(msg.usage)  # observe cache_read=0 across turns

asyncio.run(main())

The model takes a couple of cheap tool calls and the second/third turn's
usage.cache_read_input_tokens is still 0.

What we ruled out

  • SDK version regression in the v0.2.x line — pinned back to v0.1.81
    (bundled CLI 2.1.139); same broken behaviour.
  • Async MCP loading (v0.2.82's breaking change) — not the cause;
    v0.1.81 exhibits the same symptom.
  • Gateway / workspace / model — direct anthropic SDK over the same
    endpoint shows healthy cache reads.
  • MCP_CONNECTION_NONBLOCKING=0 — env var was set; no observable
    effect (we may have had the env var name wrong, but it didn't move the
    needle either way).

Hypotheses

Things the CLI may be doing differently from the direct SDK call, any of
which would defeat the prefix match:

  1. Per-turn dynamic injection into the system block — date, session
    token, tool registration metadata, or similar non-stable text appearing
    inside the cache_control-marked block.
  2. Tool-array re-ordering between turns once MCP server connections
    settle.
  3. cache_control breakpoint placement drifting between turn 1 and
    turn 2 (e.g., placed on different block indices once more conversation
    accumulates), so it doesn't match what was originally written.
  4. Related to subprocess CLI rejects list-form system_prompt (Anthropic API supports it) #899 (subprocess CLI rejects list-form system_prompt) —
    if the CLI is forced to flatten system to a string, the
    cache_control marker placement may not survive intact.

We can't observe which is true from outside the bundled binary. If anyone
on the SDK side can confirm what the CLI sends as system and tools
between turns when given a custom system_prompt plus MCP servers, that
would settle it quickly.

Impact

  • Multi-turn agentic runs against any workspace with a strict per-minute
    input-token cap will trip 429 within 2–3 turns even on small (<20K-token)
    prefixes.
  • Cost: every turn writes the full prefix to cache fresh, ~3× the expected
    amortised cost.
  • Severity is amplified for the Claude on AWS endpoint where ITPM
    defaults are lower than direct Anthropic accounts.

Asks

  1. Confirm whether dynamic per-call content is being injected into the
    system block in headless mode, and if so, document the env var or
    option to suppress it (analogous to exclude_dynamic_sections on the
    preset).
  2. If not, a way to log the bundled CLI's outbound API request body would
    be sufficient to diagnose from outside.

Happy to share the worker-side and probe scripts in their entirety if it
helps shorten triage.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions