You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In a headless claude-agent-sdk run (Python, custom system prompt, MCP
servers configured via .mcp.json), the bundled Claude Code CLI writes
prompt cache on every turn but never reads it — cache_read_input_tokens=0
across multi-turn investigations. On the same endpoint, same model, same
workspace credentials, a 30-line script using the plain anthropic Python
SDK with cache_control: ephemeral on the system block gets a healthy
cache read on the second call.
This makes the regression CLI-side rather than gateway- or API-side, and
the cache_create-on-every-turn behaviour quickly bursts past per-minute
input-token rate limits on multi-turn runs.
Environment
claude-agent-sdk versions tested: 0.1.81 (bundled CLI 2.1.139) and 0.2.82 (bundled CLI 2.1.142) — both broken.
Python 3.12 inside a Debian-slim Docker image (Node.js 20 installed for
the bundled CLI shebang).
Model: claude-sonnet-4-6.
Endpoint: the Claude on AWS first-party endpoint
(https://aws-external-anthropic.<region>.api.aws) with an anthropic-workspace-id custom header. Same endpoint used by the healthy
direct-SDK probe below.
MCP servers configured via .mcp.json (two HTTP-transport servers).
permission_mode="bypassPermissions", strict_mcp_config=True, custom system_prompt string (not the claude_code preset), cwd scoped to a
read-only mount, allowed_tools / disallowed_tools enforced natively.
Symptom — bundled CLI (broken)
Representative log line from a 3-turn run (annotated):
ResultMessage received turns=3 cost_usd=0.187 stop_reason=stop_sequence
usage input=3 output=240 cache_read=0 cache_create=45781
agent completed with error(s)=[] api_error_status=429 subtype=success
cache_create_input_tokens ≈ 14–15K per turn, accumulating to
43–46K across 3 turns.
cache_read_input_tokens = 0 across every turn.
The third turn trips the workspace's per-minute input-token cap and the
CLI exits with is_error=true, subtype=success, api_error_status=429.
The "ResultMessage…subtype=success api_error_status=429" combo is
reported via PR #923 — that part is working correctly. The cache miss
is the underlying issue.
Symptom — direct anthropic SDK over the same endpoint (healthy)
Same ANTHROPIC_BASE_URL, same anthropic-workspace-id header, same claude-sonnet-4-6 model, same ~11K-char system block carrying cache_control: {"type": "ephemeral"}.
Conclusion: the gateway / API / workspace / model are all fine. The
bundled CLI is shipping a payload that defeats the cache lookup despite
the prefix appearing identical across turns.
Minimal repro
Two pieces:
1. Healthy reference — direct SDK call, ~50 LOC
importos, timefromanthropicimportAnthropicSYSTEM= ("This is a static filler line used to grow the system prompt past ""the cache minimum block size. ") *80client=Anthropic(
api_key=os.environ["ANTHROPIC_API_KEY"],
base_url=os.environ.get("ANTHROPIC_BASE_URL") orNone,
default_headers={"anthropic-workspace-id": os.environ["ANTHROPIC_WORKSPACE_ID"]}
ifos.environ.get("ANTHROPIC_WORKSPACE_ID") elseNone,
)
defcall(label):
r=client.messages.create(
model="claude-sonnet-4-6",
max_tokens=64,
system=[{"type": "text", "text": SYSTEM,
"cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": "ping"}],
)
print(label, r.usage.model_dump())
call("call 1"); time.sleep(1); call("call 2")
# call 2 shows cache_read_input_tokens > 0
2. Broken reference — same workspace, same endpoint, via claude-agent-sdk
fromclaude_agent_sdkimportClaudeAgentOptions, queryfromclaude_agent_sdk.typesimportResultMessageimportasyncio, osoptions=ClaudeAgentOptions(
system_prompt="<a static ~1.5K-token system prompt>",
model="claude-sonnet-4-6",
mcp_servers="/tmp/.mcp.json", # any two HTTP MCP serversstrict_mcp_config=True,
allowed_tools=[...], # ~16K tokens of tool defs totalpermission_mode="bypassPermissions",
cwd="/some/read-only/path",
env={
"ANTHROPIC_API_KEY": os.environ["ANTHROPIC_API_KEY"],
"ANTHROPIC_BASE_URL": os.environ["ANTHROPIC_BASE_URL"],
"ANTHROPIC_CUSTOM_HEADERS":
f"anthropic-workspace-id: {os.environ['ANTHROPIC_WORKSPACE_ID']}",
},
max_turns=50,
)
asyncdefmain():
asyncformsginquery(prompt="<a small ticket-shaped user message>",
options=options):
ifisinstance(msg, ResultMessage):
print(msg.usage) # observe cache_read=0 across turnsasyncio.run(main())
The model takes a couple of cheap tool calls and the second/third turn's usage.cache_read_input_tokens is still 0.
What we ruled out
SDK version regression in the v0.2.x line — pinned back to v0.1.81
(bundled CLI 2.1.139); same broken behaviour.
Async MCP loading (v0.2.82's breaking change) — not the cause;
v0.1.81 exhibits the same symptom.
Gateway / workspace / model — direct anthropic SDK over the same
endpoint shows healthy cache reads.
MCP_CONNECTION_NONBLOCKING=0 — env var was set; no observable
effect (we may have had the env var name wrong, but it didn't move the
needle either way).
Hypotheses
Things the CLI may be doing differently from the direct SDK call, any of
which would defeat the prefix match:
Per-turn dynamic injection into the system block — date, session
token, tool registration metadata, or similar non-stable text appearing
inside the cache_control-marked block.
Tool-array re-ordering between turns once MCP server connections
settle.
cache_control breakpoint placement drifting between turn 1 and
turn 2 (e.g., placed on different block indices once more conversation
accumulates), so it doesn't match what was originally written.
We can't observe which is true from outside the bundled binary. If anyone
on the SDK side can confirm what the CLI sends as system and tools
between turns when given a custom system_prompt plus MCP servers, that
would settle it quickly.
Impact
Multi-turn agentic runs against any workspace with a strict per-minute
input-token cap will trip 429 within 2–3 turns even on small (<20K-token)
prefixes.
Cost: every turn writes the full prefix to cache fresh, ~3× the expected
amortised cost.
Severity is amplified for the Claude on AWS endpoint where ITPM
defaults are lower than direct Anthropic accounts.
Asks
Confirm whether dynamic per-call content is being injected into the
system block in headless mode, and if so, document the env var or
option to suppress it (analogous to exclude_dynamic_sections on the
preset).
If not, a way to log the bundled CLI's outbound API request body would
be sufficient to diagnose from outside.
Happy to share the worker-side and probe scripts in their entirety if it
helps shorten triage.
Summary
In a headless
claude-agent-sdkrun (Python, custom system prompt, MCPservers configured via
.mcp.json), the bundled Claude Code CLI writesprompt cache on every turn but never reads it —
cache_read_input_tokens=0across multi-turn investigations. On the same endpoint, same model, same
workspace credentials, a 30-line script using the plain
anthropicPythonSDK with
cache_control: ephemeralon the system block gets a healthycache read on the second call.
This makes the regression CLI-side rather than gateway- or API-side, and
the cache_create-on-every-turn behaviour quickly bursts past per-minute
input-token rate limits on multi-turn runs.
Environment
claude-agent-sdkversions tested: 0.1.81 (bundled CLI 2.1.139) and0.2.82 (bundled CLI 2.1.142) — both broken.
the bundled CLI shebang).
claude-sonnet-4-6.(
https://aws-external-anthropic.<region>.api.aws) with ananthropic-workspace-idcustom header. Same endpoint used by the healthydirect-SDK probe below.
.mcp.json(two HTTP-transport servers).permission_mode="bypassPermissions",strict_mcp_config=True, customsystem_promptstring (not theclaude_codepreset),cwdscoped to aread-only mount,
allowed_tools/disallowed_toolsenforced natively.Symptom — bundled CLI (broken)
Representative log line from a 3-turn run (annotated):
cache_create_input_tokens≈ 14–15K per turn, accumulating to43–46K across 3 turns.
cache_read_input_tokens= 0 across every turn.CLI exits with
is_error=true, subtype=success, api_error_status=429.The "ResultMessage…subtype=success api_error_status=429" combo is
reported via PR #923 — that part is working correctly. The cache miss
is the underlying issue.
Symptom — direct
anthropicSDK over the same endpoint (healthy)Same
ANTHROPIC_BASE_URL, sameanthropic-workspace-idheader, sameclaude-sonnet-4-6model, same ~11K-char system block carryingcache_control: {"type": "ephemeral"}.Conclusion: the gateway / API / workspace / model are all fine. The
bundled CLI is shipping a payload that defeats the cache lookup despite
the prefix appearing identical across turns.
Minimal repro
Two pieces:
1. Healthy reference — direct SDK call, ~50 LOC
2. Broken reference — same workspace, same endpoint, via
claude-agent-sdkThe model takes a couple of cheap tool calls and the second/third turn's
usage.cache_read_input_tokensis still 0.What we ruled out
(bundled CLI 2.1.139); same broken behaviour.
v0.1.81 exhibits the same symptom.
anthropicSDK over the sameendpoint shows healthy cache reads.
MCP_CONNECTION_NONBLOCKING=0— env var was set; no observableeffect (we may have had the env var name wrong, but it didn't move the
needle either way).
Hypotheses
Things the CLI may be doing differently from the direct SDK call, any of
which would defeat the prefix match:
token, tool registration metadata, or similar non-stable text appearing
inside the
cache_control-marked block.settle.
cache_controlbreakpoint placement drifting between turn 1 andturn 2 (e.g., placed on different block indices once more conversation
accumulates), so it doesn't match what was originally written.
subprocess CLI rejects list-form system_prompt) —if the CLI is forced to flatten
systemto a string, thecache_controlmarker placement may not survive intact.We can't observe which is true from outside the bundled binary. If anyone
on the SDK side can confirm what the CLI sends as
systemandtoolsbetween turns when given a custom
system_promptplus MCP servers, thatwould settle it quickly.
Impact
input-token cap will trip 429 within 2–3 turns even on small (<20K-token)
prefixes.
amortised cost.
defaults are lower than direct Anthropic accounts.
Asks
system block in headless mode, and if so, document the env var or
option to suppress it (analogous to
exclude_dynamic_sectionson thepreset).
be sufficient to diagnose from outside.
Happy to share the worker-side and probe scripts in their entirety if it
helps shorten triage.