Skip to content

v0.4.0: cost optimization (caching, cheap-model offload, auto-compression)#5

Merged
gregherbe76 merged 1 commit into
mainfrom
claude/v0.4-cost-optimization
May 20, 2026
Merged

v0.4.0: cost optimization (caching, cheap-model offload, auto-compression)#5
gregherbe76 merged 1 commit into
mainfrom
claude/v0.4-cost-optimization

Conversation

@gregherbe76

Copy link
Copy Markdown
Owner

Stacking the five levers documented in this release can reduce AI cost 50-75% on long-running projects. Anti-drift stays on the frontier model (the one operation that needs real reasoning); everything else can be optimized.

Protocol additions

§7.1 Prompt caching (recommended convention)

Memory files don't change between turns of the same session. Marking the memory read as cacheable on Anthropic / OpenAI APIs makes the second message onwards pay ~10% of the original cost. Single biggest lever for direct API users.

§7.2 Cheap-model offloading

Memory writes (decision/drift entries, recaps, summaries) don't need frontier intelligence. Documents what's offloadable (anything that's templating + light judgment) and what must stay on the main model (anti-drift in §4, code generation).

Optional companion scripts

Both are opt-in — the protocol stays portable, files-only, no-infra by default. Both pure Python stdlib (urllib).

scripts/memory_assistant.py

Routes memory operations to any OpenAI-compatible /v1/chat/completions endpoint. Works with Groq, Together, Fireworks, OpenRouter, Anthropic, OpenAI, and local Ollama.

export VIBEMEM_LLM_ENDPOINT=https://api.groq.com/openai/v1/chat/completions
export VIBEMEM_LLM_MODEL=llama-3.1-8b-instant
export VIBEMEM_LLM_API_KEY=...

python3 scripts/memory_assistant.py recap                # deterministic, no LLM call
python3 scripts/memory_assistant.py decision-entry "swapped Prisma for Drizzle, serverless cold starts"
python3 scripts/memory_assistant.py drift-entry "inline DB query in billing/page.tsx bypasses lib/db/"

scripts/compress.py

Implements protocol §7 automatically. Summarizes the oldest decisions.jsonl entries into decisions-archive-<date>.md via a cheap LLM, replaces them with a single archive entry. Originals stay in git history.

python3 scripts/compress.py --dry-run
python3 scripts/compress.py --keep 200 --threshold 400

README

New "Cost optimization" section with the 5-lever stack, sample math (20-message session: $0.12 → $0.018 with caching alone), and a savings table.

Tests

8 new tests (33 total). Covers:

  • Deterministic recap with full / empty memory
  • JSON extraction from noisy LLM responses (markdown fences, prose wrap)
  • call_llm env-var requirement
  • Compress dry-run / threshold logic

LLM calls themselves are not unit-tested (would require either mocking the network or a live endpoint); the network boundary is exercised manually.

Validation matrix (all green locally)

  • root memory/ — 40 decisions, 0 drifts
  • 33/33 tests pass
  • template/memory/ + 3 examples all validate
  • memory_assistant.py recap works deterministically with no env vars
  • compress.py --dry-run correctly reports nothing-to-do below threshold

Test plan

  • CI passes (validate.yml and memory-pr-comment.yml)
  • After merge: tag v0.4.0 and create the release with the [0.4.0] block from CHANGELOG.md

Notes

The script's docstring contains the full env-var setup and sample endpoints. The protocol itself stays portable — these scripts are pure additive tooling for users who want to optimize cost.


Generated by Claude Code

…sion)

Stacking these levers reduces AI cost 50-75% on long-running projects.

Protocol additions (sections 7.1, 7.2):
- 7.1 Prompt caching as recommended convention. Memory files don't
  change between turns; cached reads pay ~10% of original cost on
  Anthropic / OpenAI APIs. Single biggest cost lever.
- 7.2 Cheap-model offloading. Memory writes (entries, recaps,
  summaries) don't need frontier intelligence. Documents what can
  be offloaded and what (anti-drift) must stay on the main model.

Optional companion scripts (opt-in, protocol stays portable):
- scripts/memory_assistant.py — routes memory ops to any
  OpenAI-compatible chat completions endpoint (Groq, Together,
  Fireworks, OpenRouter, Ollama, Anthropic, OpenAI). Subcommands:
  recap (deterministic by default), decision-entry, drift-entry.
  Pure stdlib (urllib).
- scripts/compress.py — auto-implements protocol section 7.
  Summarizes oldest entries into decisions-archive-<date>.md via
  cheap LLM, replaces with single archive entry. --dry-run / --keep /
  --threshold flags.

README:
- "Cost optimization" section with 5-lever stack, sample math,
  savings table.
- Structure list updated.

CLAUDE.md: prompt caching documented as Claude-specific cost note.

Tests: 8 new tests (33 total). Mocked LLM calls; stdlib parts
(deterministic recap, JSON extraction, dry-run logic) covered.

Protocol version bumped 0.3.0 -> 0.4.0.
@github-actions

Copy link
Copy Markdown

🧠 vibe-memory changes in this PR

5 new decision(s):

  • convention [protocol] — added section 7.1 (prompt caching) and 7.2 (cheap-model offloading) to MEMORY_PROTOCOL.md
    • why: cost optimization is a top user concern; document the 2 highest-impact levers as protocol-level conventions
    • by: claude-code
  • dependency [tooling] — added scripts/memory_assistant.py — optional companion that routes memory writes (entries, recap) to any OpenAI-compatible LLM endpoint
    • why: enables 4-60x cost reduction on memory operations without breaking the no-infra default (script is opt-in)
    • by: claude-code
  • dependency [tooling] — added scripts/compress.py — auto-implements protocol section 7 via a cheap LLM
    • why: section 7 was manual; making it scriptable removes friction for long-running projects
    • by: claude-code
  • convention [docs] — added 'Cost optimization' README section with 5-lever stack and savings table
    • why: makes the economic argument explicit for founders/heavy users; positions vibe-memory as a cost reducer in addition to a continuity layer
    • by: claude-code
  • convention [versioning] — bumped protocol version 0.3.0 -> 0.4.0
    • why: sections 7.1 and 7.2 are additive protocol-level conventions
    • by: claude-code

Posted by the vibe-memory PR-comment workflow. Edit decisions/drift entries before merging if needed.

@gregherbe76 gregherbe76 merged commit fcd740c into main May 20, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants