GitHub - m2ai-portfolio/local-ai-coding-agent-optimizer: Enables fast, private, zero‑cost coding agents by reusing KV cache prefixes so local LLMs stay responsive across multi‑turn prompts.

Infrastructure layer that makes local LLMs viable for coding agents by solving the KV cache invalidation problem.

Quick Start • Features • Examples • Contributing

What is this?

Local AI Coding Agent Optimizer is a command‑line utility that sits between a coding agent and a locally‑run LLM, re‑using previously computed KV cache tensors when the prompt prefix has not changed. It enables private, low‑latency, cost‑free inference for multi‑turn coding workflows.

Example usage:

$ echo -e "def foo():\n0" | python -m optimizer store
eJzLVQ... (base64 blob) 5

The command stores the KV cache for the prompt and returns a base64‑encoded tensor together with the sequence length.

Problem

Developers want to use local LLMs for coding agents to avoid API costs and maintain privacy, but local models become unusable after a few turns due to prefix shifts that invalidate the KV cache. This forces expensive re‑prefilling of the entire context, making the experience painfully slow. There's a technical gap preventing local models from being practical for agentic coding workflows.

Features

Feature	Description
Prefix Hashing & Cache Lookup	Computes a SHA‑256 hash of the prompt prefix and checks a local SQLite database for a matching KV cache entry.
KV Cache Storage	When a prefix hash is miss, runs the LLM on the full prompt, serializes the resulting KV tensor, and stores it keyed by the hash.
Incremental KV Cache Extension	Extends a cached KV tensor with new suffix tokens by running the LLM only on the suffix while feeding the cached KV as initial state.
CLI Pipe Interface	All subcommands read from stdin and write to stdout, allowing seamless integration with coding agents via simple pipes.
Persistent SQLite Cache	Cache entries are saved in a single local SQLite file, surviving restarts of the optimizer process.
Low‑latency Lookup	Retrieving a cached KV tensor takes under 50 ms on a typical CPU laptop, avoiding costly re‑prefilling.
Zero External Dependencies	Runs with only Python 3.11+, torch, typer, and pytest; no network calls or extra services are required.
Tokenizer Agnostic (MVP)	Uses a whitespace split for tokenization in the minimum viable product, but can be swapped for any tokenizer later.

Quick Start

Clone the repository:

git clone https://github.com/m2ai-portfolio/local-ai-coding-agent-optimizer.git
cd local-ai-coding-agent-optimizer

Create a virtual environment and install dependencies:
```
python -m venv venv
source venv/bin/activate
pip install typer pytest
```
(torch should already be installed with your local LLM.)
Test the store subcommand with a simple prompt:
```
echo -e "def foo():\n0" | python -m optimizer store
```
You should see a base64‑encoded blob and an integer sequence length printed to stdout.

Examples

Storing a new prompt prefix

$ echo -e "def foo():\n0" | python -m optimizer store
eJzLVQ... (base64 blob) 5

The optimizer runs the LLM on the full prompt, caches the KV tensor, and outputs the serialized blob together with the sequence length.

Looking up an existing prefix

$ echo -e "def foo():\n0" | python -m optimizer lookup
eJzLVQ... (same base64 blob) 5

Because the prefix hash matches a cached entry, the stored tensor is returned instantly without re‑running the LLM.

Extending a cached KV with new tokens

$ echo -e "eJzLVQ...\n5\n10 20 30" | python -m optimizer extend
eJzLVQ... (updated blob) 8

The command feeds the cached KV as past_key_values, processes the suffix tokens [10, 20, 30], and returns the updated KV tensor covering the full prefix+suffix.

File Structure

Local AI Coding Agent Optimizer/
├── optimizer/
│   ├── __init__.py
│   ├── cli.py          # Typer entry point with lookup, store, extend subcommands
│   ├── db.py           # SQLite wrapper for KV cache persistence
│   ├── hashing.py      # Prefix hashing helpers (SHA‑256)
│   ├── kv_utils.py     # Tensor serialize/deserialize and LLM forward hooks
│   ├── models.py       # Dataclasses for KVCacheEntry and OptimizerIO
│   └── __main__.py     # Allows `python -m optimizer` execution
├── tests/
│   ├── test_cli.py
│   ├── test_db.py
│   ├── test_hashing.py
│   └── test_kv_utils.py
├── assets/
│   └── infographic.png
├── .gitignore
├── LICENSE
└── README.md

Tech Stack

Technology	Purpose
Python 3.11+	Core language and runtime
torch	Tensor handling and LLM inference (assumed pre‑installed)
sqlite3 (stdlib)	Persistent storage of KV cache entries
hashlib (stdlib)	SHA‑256 hashing of prompt prefixes
typer	Command‑line interface and subcommand parsing
pytest	Unit and integration testing

Contributing

Fork the repository, make your changes, run the test suite, and submit a pull request. Please follow the existing coding style and add tests for new functionality.

License

MIT

Author

Matthew Snow -- M2AI | @m2ai-portfolio

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Infrastructure layer that makes local LLMs viable for coding agents by solving the KV cache invalidation problem.

What is this?

Problem

Features

Quick Start

Examples

File Structure

Tech Stack

Contributing

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
assets		assets
optimizer		optimizer
screenshots		screenshots
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Infrastructure layer that makes local LLMs viable for coding agents by solving the KV cache invalidation problem.

What is this?

Problem

Features

Quick Start

Examples

File Structure

Tech Stack

Contributing

License

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages