v0.4.0: cost optimization (caching, cheap-model offload, auto-compression)#5
Merged
Merged
Conversation
…sion) Stacking these levers reduces AI cost 50-75% on long-running projects. Protocol additions (sections 7.1, 7.2): - 7.1 Prompt caching as recommended convention. Memory files don't change between turns; cached reads pay ~10% of original cost on Anthropic / OpenAI APIs. Single biggest cost lever. - 7.2 Cheap-model offloading. Memory writes (entries, recaps, summaries) don't need frontier intelligence. Documents what can be offloaded and what (anti-drift) must stay on the main model. Optional companion scripts (opt-in, protocol stays portable): - scripts/memory_assistant.py — routes memory ops to any OpenAI-compatible chat completions endpoint (Groq, Together, Fireworks, OpenRouter, Ollama, Anthropic, OpenAI). Subcommands: recap (deterministic by default), decision-entry, drift-entry. Pure stdlib (urllib). - scripts/compress.py — auto-implements protocol section 7. Summarizes oldest entries into decisions-archive-<date>.md via cheap LLM, replaces with single archive entry. --dry-run / --keep / --threshold flags. README: - "Cost optimization" section with 5-lever stack, sample math, savings table. - Structure list updated. CLAUDE.md: prompt caching documented as Claude-specific cost note. Tests: 8 new tests (33 total). Mocked LLM calls; stdlib parts (deterministic recap, JSON extraction, dry-run logic) covered. Protocol version bumped 0.3.0 -> 0.4.0.
🧠 vibe-memory changes in this PR5 new decision(s):
Posted by the vibe-memory PR-comment workflow. Edit decisions/drift entries before merging if needed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacking the five levers documented in this release can reduce AI cost 50-75% on long-running projects. Anti-drift stays on the frontier model (the one operation that needs real reasoning); everything else can be optimized.
Protocol additions
§7.1 Prompt caching (recommended convention)
Memory files don't change between turns of the same session. Marking the memory read as cacheable on Anthropic / OpenAI APIs makes the second message onwards pay ~10% of the original cost. Single biggest lever for direct API users.
§7.2 Cheap-model offloading
Memory writes (decision/drift entries, recaps, summaries) don't need frontier intelligence. Documents what's offloadable (anything that's templating + light judgment) and what must stay on the main model (anti-drift in §4, code generation).
Optional companion scripts
Both are opt-in — the protocol stays portable, files-only, no-infra by default. Both pure Python stdlib (urllib).
scripts/memory_assistant.pyRoutes memory operations to any OpenAI-compatible
/v1/chat/completionsendpoint. Works with Groq, Together, Fireworks, OpenRouter, Anthropic, OpenAI, and local Ollama.scripts/compress.pyImplements protocol §7 automatically. Summarizes the oldest decisions.jsonl entries into
decisions-archive-<date>.mdvia a cheap LLM, replaces them with a singlearchiveentry. Originals stay in git history.README
New "Cost optimization" section with the 5-lever stack, sample math (20-message session: $0.12 → $0.018 with caching alone), and a savings table.
Tests
8 new tests (33 total). Covers:
call_llmenv-var requirementLLM calls themselves are not unit-tested (would require either mocking the network or a live endpoint); the network boundary is exercised manually.
Validation matrix (all green locally)
memory_assistant.py recapworks deterministically with no env varscompress.py --dry-runcorrectly reports nothing-to-do below thresholdTest plan
validate.ymlandmemory-pr-comment.yml)v0.4.0and create the release with the[0.4.0]block fromCHANGELOG.mdNotes
The script's docstring contains the full env-var setup and sample endpoints. The protocol itself stays portable — these scripts are pure additive tooling for users who want to optimize cost.
Generated by Claude Code