Infrastructure layer that makes local LLMs viable for coding agents by solving the KV cache invalidation problem.
Quick Start • Features • Examples • Contributing
Local AI Coding Agent Optimizer is a command‑line utility that sits between a coding agent and a locally‑run LLM, re‑using previously computed KV cache tensors when the prompt prefix has not changed. It enables private, low‑latency, cost‑free inference for multi‑turn coding workflows.
Example usage:
$ echo -e "def foo():\n0" | python -m optimizer store
eJzLVQ... (base64 blob) 5
The command stores the KV cache for the prompt and returns a base64‑encoded tensor together with the sequence length.
Developers want to use local LLMs for coding agents to avoid API costs and maintain privacy, but local models become unusable after a few turns due to prefix shifts that invalidate the KV cache. This forces expensive re‑prefilling of the entire context, making the experience painfully slow. There's a technical gap preventing local models from being practical for agentic coding workflows.
| Feature | Description |
|---|---|
| Prefix Hashing & Cache Lookup | Computes a SHA‑256 hash of the prompt prefix and checks a local SQLite database for a matching KV cache entry. |
| KV Cache Storage | When a prefix hash is miss, runs the LLM on the full prompt, serializes the resulting KV tensor, and stores it keyed by the hash. |
| Incremental KV Cache Extension | Extends a cached KV tensor with new suffix tokens by running the LLM only on the suffix while feeding the cached KV as initial state. |
| CLI Pipe Interface | All subcommands read from stdin and write to stdout, allowing seamless integration with coding agents via simple pipes. |
| Persistent SQLite Cache | Cache entries are saved in a single local SQLite file, surviving restarts of the optimizer process. |
| Low‑latency Lookup | Retrieving a cached KV tensor takes under 50 ms on a typical CPU laptop, avoiding costly re‑prefilling. |
| Zero External Dependencies | Runs with only Python 3.11+, torch, typer, and pytest; no network calls or extra services are required. |
| Tokenizer Agnostic (MVP) | Uses a whitespace split for tokenization in the minimum viable product, but can be swapped for any tokenizer later. |
- Clone the repository:
git clone https://github.com/m2ai-portfolio/local-ai-coding-agent-optimizer.git cd local-ai-coding-agent-optimizer - Create a virtual environment and install dependencies:
(torch should already be installed with your local LLM.)
python -m venv venv source venv/bin/activate pip install typer pytest - Test the store subcommand with a simple prompt:
You should see a base64‑encoded blob and an integer sequence length printed to stdout.
echo -e "def foo():\n0" | python -m optimizer store
Storing a new prompt prefix
$ echo -e "def foo():\n0" | python -m optimizer store
eJzLVQ... (base64 blob) 5
The optimizer runs the LLM on the full prompt, caches the KV tensor, and outputs the serialized blob together with the sequence length.
Looking up an existing prefix
$ echo -e "def foo():\n0" | python -m optimizer lookup
eJzLVQ... (same base64 blob) 5
Because the prefix hash matches a cached entry, the stored tensor is returned instantly without re‑running the LLM.
Extending a cached KV with new tokens
$ echo -e "eJzLVQ...\n5\n10 20 30" | python -m optimizer extend
eJzLVQ... (updated blob) 8
The command feeds the cached KV as past_key_values, processes the suffix tokens [10, 20, 30], and returns the updated KV tensor covering the full prefix+suffix.
Local AI Coding Agent Optimizer/
├── optimizer/
│ ├── __init__.py
│ ├── cli.py # Typer entry point with lookup, store, extend subcommands
│ ├── db.py # SQLite wrapper for KV cache persistence
│ ├── hashing.py # Prefix hashing helpers (SHA‑256)
│ ├── kv_utils.py # Tensor serialize/deserialize and LLM forward hooks
│ ├── models.py # Dataclasses for KVCacheEntry and OptimizerIO
│ └── __main__.py # Allows `python -m optimizer` execution
├── tests/
│ ├── test_cli.py
│ ├── test_db.py
│ ├── test_hashing.py
│ └── test_kv_utils.py
├── assets/
│ └── infographic.png
├── .gitignore
├── LICENSE
└── README.md
| Technology | Purpose |
|---|---|
| Python 3.11+ | Core language and runtime |
| torch | Tensor handling and LLM inference (assumed pre‑installed) |
| sqlite3 (stdlib) | Persistent storage of KV cache entries |
| hashlib (stdlib) | SHA‑256 hashing of prompt prefixes |
| typer | Command‑line interface and subcommand parsing |
| pytest | Unit and integration testing |
Fork the repository, make your changes, run the test suite, and submit a pull request. Please follow the existing coding style and add tests for new functionality.
MIT
Matthew Snow -- M2AI | @m2ai-portfolio
