Hekaton Core

Autonomous 10-LLM development harness for NVIDIA GH200. ZeroMQ debate orchestration, 6-tier memory, hybrid routing, self-improving prompts.

Live Demo: https://hekaton.herakles.dev — Real-time WebSocket dashboard showing debate rounds, agent metrics, and mission progress.

What is Hekaton?

Hekaton is an autonomous coding harness that coordinates 10 large language models through structured multi-agent debate to solve complex software engineering tasks. It is not a chatbot, not a wrapper around a single API, and not a prompt-chaining library.

You give Hekaton a goal. A Planning Swarm of agents first researches the problem and debates an approach. An Architect then generates a plan. Sapper agents build the implementation across parallel debate rounds while an Auditor reviews each output against a structured rubric. The cycle loops until the code passes or a kill switch triggers. Every artifact that leaves the system has been adversarially reviewed.

The system runs 6 local open-source models (Qwen2.5, DeepSeek-Coder, Phi-4) via vLLM on GH200 unified memory, with 4 Gemini API roles (researcher, reasoner, worker, executor) for tasks that benefit from external knowledge. A hybrid LiteLLM router decides at request time whether to route to local inference or the API, achieving 98.9% API cost savings in production.

Every design decision in Hekaton is empirically validated. The A/B testing story is intentional: in Sprint 30 we ran four candidate features on real GH200 hardware. Only one survived (Planning Swarm). The other three were cut. That rigor is what makes the results reliable.

Architecture

                        ┌─────────────────────────────┐
  Mission Goal ────────>│       Planning Swarm        │
                        │  DISCOVER → DEBATE → SCAFFOLD│
                        └────────────┬────────────────┘
                                     │ Plan
                                     v
                        ┌────────────────────────────┐
                        │         Architect          │
                        │   (system design + tasks)  │
                        └────────────┬───────────────┘
                                     │ Task list
                    ┌────────────────┼────────────────┐
                    v                v                v
             ┌──────────┐    ┌──────────┐    ┌──────────┐
             │  Sapper  │    │  Sapper  │    │  Sapper  │  (parallel)
             │ (builder)│    │ (builder)│    │ (builder)│
             └────┬─────┘    └────┬─────┘    └────┬─────┘
                  └───────────────┼────────────────┘
                                  │ Code artifacts
                                  v
                        ┌────────────────────────────┐
                        │          Auditor           │
                        │  (structured rubric review)│
                        └────────────┬───────────────┘
                                     │ PASS / loop back
                                     v
                        ┌────────────────────────────┐
                        │      SITREP + Memory       │
                        │  (pgvector, 6-tier store)  │
                        └────────────────────────────┘

  Transport: ZeroMQ ipc://  |  Router: LiteLLM  |  Memory: PostgreSQL + pgvector

Recent Breakthroughs

Sprint 30 — A/B Validation on Real GH200 Hardware

We ran four candidate features on a live GH200 instance with statistical significance testing. Only one of four features survived:

Feature	Result	Notes
Planning Swarm	PROVEN	Statistically significant improvement
Dynamic Context Allocation	CUT	No measurable benefit at task scale
Speculative Parallel Draft	CUT	Latency gains outweighed by coordination overhead
Cross-Round Attention Sharing	CUT	Memory pressure exceeded gains

The three cut features are gone. Not "parked for later" — removed. This is what scientific rigor looks like in LLM systems research.

Sprint 30b — TurboQuant KV Cache Compression (ICLR 2026)

TurboQuant implements the ICLR 2026 KV cache compression paper as a production vLLM hook:

5.33x compression ratio on KV cache
7.03 GB HBM3e freed on GH200
32K context unlocked for Formation E-TQ (was limited to 12K without compression)
30/30 tests PASS — zero regressions across all memory benchmarks
Lloyd-Max quantizer + adaptive scalar quantization on attention heads

Quick Start

git clone https://github.com/herakles-ai/hekaton-core.git
cd hekaton-core
./setup.sh

# Edit .env — set HEKATON_DB_PASSWORD, GEMINI_API_KEY
# (LAMBDA_API_KEY required only for GH200 cloud deploy)

docker compose up -d          # Start PostgreSQL (required for memory)
pytest tests/                 # Verify: 1929 tests should pass

# Run a benchmark mission (local mock, no GPU required)
python3 scripts/run_benchmark.py --level 1

For GH200 deployment:

HEKATON_FORMATION=heavy-hitter-tq ./scripts/gh200-deploy.sh YOUR_GH200_IP

Prerequisites

Requirement	Notes
Python 3.10+	3.12 recommended
Docker + Compose	For PostgreSQL (pgvector image)
NVIDIA GH200	96GB HBM3e — for full formation runs
Gemini API key	Required for API-backed roles
Lambda Labs key	Required for cloud GH200 provisioning scripts

Local development (without GH200) works for: unit tests, mock benchmarks, memory system, router calibration, dashboard.

Formations

A Formation is a named topology of LLM agents with defined roles, VRAM budgets, and routing rules. Formations are defined in config/formations/*.yaml.

Formation	Models	VRAM	Context	Use Case
`heavy-hitter-tq`	6 local + 4 API	~73 GB	32K	Full production (recommended)
`heavy-hitter`	6 local + 4 API	~78 GB	12K	Production, no TurboQuant
`precision-swarm`	10 local	~87 GB	8K	Full local, no API cost
`precision-swarm-tq`	10 local	~80 GB	32K	Full local with TurboQuant
`precision-strike`	3 local	~25 GB	8K	Lightweight, fast iteration
`red-blue-team`	4 local	~30 GB	8K	Adversarial pair review
`swarm-debate`	6 local	~55 GB	8K	Max debate rounds
`pipeline`	3 local	~20 GB	8K	Sequential pipeline, low VRAM
`parallel-review`	4 local	~35 GB	8K	Parallel rubric review
`the-hive`	8 local	~70 GB	8K	High-concurrency swarm
`war-room`	6 local + 2 API	~55 GB	12K	Balanced cost/quality

Formation E-TQ (heavy-hitter-tq) is the recommended formation for serious work. It uses a 32B Sapper (the builder agent) for highest code quality, TurboQuant to expand context to 32K, and four Gemini API roles for research and reasoning tasks where external knowledge helps.

Key Innovations

1. ZeroMQ Debate Bus

All inter-agent communication runs over ipc:// ZeroMQ sockets. Agents are Python asyncio actors. No shared memory, no global state — pure message passing. The broker in war-room-gh200/zmq_broker/ manages routing, round arbitration, and kill switch enforcement.

2. Planning Swarm (Phase 0)

Before any code is written, a dedicated planning phase runs: DISCOVER (research the problem space), DEBATE (multi-agent argument over approaches), SCAFFOLD (generate structured plan). A/B validated on GH200 — Planning Swarm measurably improves final PASS rates.

3. TurboQuant KV Cache Compression

research/turboquant/ implements an ICLR 2026 paper as a drop-in vLLM hook. The Lloyd-Max quantizer runs on Grace CPU, compresses attention head KV caches at inference time, and is transparent to the rest of the system.

4. 6-Tier Memory

PostgreSQL + pgvector stores six memory tiers: episodic, semantic, procedural, resource, knowledge, and core. The consolidator (war-room-gh200/memory/consolidator.py) periodically promotes high-value episodes to semantic memory. Retrieval uses hybrid BM25 + cosine similarity search. Validated at Gate G7: +14% PASS rate over memoryless baseline.

5. Hybrid LLM Router

war-room-gh200/router/ scores each request by complexity, budget, and capability requirements, then routes to local vLLM or the Gemini API. Cost tracking is per-mission. In production: 98.9% API cost reduction vs. routing everything to the API.

6. Adversarial Review Gate

Every code artifact passes through war-room-gh200/formations/review_gate.py before acceptance. The Auditor agent evaluates against a structured rubric (correctness, style, security, test coverage). Code that fails goes back to the Sapper — not to the user.

7. Self-Improving Prompts (OPRO)

war-room-gh200/self_improve/ implements OPRO-based prompt evolution. After each mission, trajectories are scored and used to generate improved prompt candidates. The best candidates are staged and promoted after validation. Validated at Gate G10: +20% PASS rate over static prompts.

Gate Protocol

Quality is enforced through 15 sequential gates (G0–G15). A gate must PASS before work begins on the next.

Gate	Name	Key Metric
G0	Mock passing	Local unit tests green
G1	Single-model smoke	One model end-to-end
G2	Formation bring-up	All roles connected
G3	Rubric review	Adversarial gate live
G4	Checkpointing	LangGraph PostgreSQL
G5	1hr endurance	50% PASS, 18 missions
G6	SGLang benchmark	vLLM vs SGLang comparison
G7	6-tier memory	+14% PASS rate
G8	Hybrid router	98.9% API cost savings
G9	Formation swap	Hot-swap in production
G10	Self-improvement	+20% PASS rate (OPRO)
G11	TurboQuant	5.33x compression, 7GB freed
G12	Plug-and-play	100% PASS, 12 instances
G13-G15	Endurance + scaling	Not yet started

Full gate history and criteria: docs/gates.md

Test Suite

1929 tests across unit, integration, and end-to-end:

pytest tests/                              # Full suite
pytest tests/unit/                         # Unit tests (fast, no GPU)
pytest tests/integration/                  # Integration (requires Docker)
pytest tests/e2e/                          # End-to-end
pytest tests/ -k "turboquant"             # TurboQuant compression tests
pytest tests/ -k "memory"                 # Memory system tests
pytest tests/ --cov=war-room-gh200        # With coverage

Key test files:

tests/e2e/test_turboquant_*.py — 10 TurboQuant test modules, 30 core assertions
tests/integration/test_memory_integration.py — 6-tier memory with live PostgreSQL
tests/unit/test_formation_runner.py — Formation topology and routing
tests/unit/test_litellm_bridge.py — Hybrid router calibration

Dashboard

Real-time WebSocket dashboard at http://localhost:8475 (or https://hekaton.herakles.dev):

source .env && python3 war-room-gh200/dashboard/api.py

Shows: active debate rounds, per-agent metrics, mission queue, memory tier utilization, router cost tracking.

Configuration

All configuration is via environment variables. Copy .env.example and fill in:

Variable	Required	Description
`HEKATON_DB_PASSWORD`	Yes	PostgreSQL password
`GEMINI_API_KEY`	Yes (API formations)	Google AI Studio API key
`LAMBDA_API_KEY`	GH200 deploy only	Lambda Labs cloud key
`HEKATON_FORMATION`	No	Formation to use (default: `heavy-hitter-tq`)
`HEKATON_LOG_LEVEL`	No	Log verbosity (default: `INFO`)
`HEKATON_DB_HOST`	No	PostgreSQL host (default: `localhost`)
`HEKATON_DB_PORT`	No	PostgreSQL port (default: `5470`)

Formation-level configuration lives in config/formations/*.yaml. Role prompts are in config/prompts/{role}/{small,medium,large}.md — 31 prompt files covering 9 roles across 3 model size tiers.

Using with Claude Code

This project includes a CLAUDE.md that gives Claude Code full context about the architecture, commands, and design decisions. Clone the repo, open it with Claude Code, and it will understand the codebase immediately.

git clone https://github.com/herakles-ai/hekaton-core.git
cd hekaton-core
claude    # Claude Code reads CLAUDE.md automatically

Contributing

See CONTRIBUTING.md for development setup, test requirements, and PR process.

Areas where contributions are most valuable:

New formation topologies (config/formations/)
Additional LLM router complexity scorers
TurboQuant quantization improvements
Benchmark missions (config/benchmarks/)
SGLang integration (currently vLLM only)

License

Apache License 2.0 — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hekaton Core

What is Hekaton?

Architecture

Recent Breakthroughs

Sprint 30 — A/B Validation on Real GH200 Hardware

Sprint 30b — TurboQuant KV Cache Compression (ICLR 2026)

Quick Start

Prerequisites

Formations

Key Innovations

1. ZeroMQ Debate Bus

2. Planning Swarm (Phase 0)

3. TurboQuant KV Cache Compression

4. 6-Tier Memory

5. Hybrid LLM Router

6. Adversarial Review Gate

7. Self-Improving Prompts (OPRO)

Gate Protocol

Test Suite

Dashboard

Configuration

Using with Claude Code

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
config		config
docs		docs
research/turboquant		research/turboquant
scripts		scripts
tests		tests
war-room-gh200		war-room-gh200
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
conftest.py		conftest.py
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
setup.sh		setup.sh

Folders and files

Latest commit

History

Repository files navigation

Hekaton Core

What is Hekaton?

Architecture

Recent Breakthroughs

Sprint 30 — A/B Validation on Real GH200 Hardware

Sprint 30b — TurboQuant KV Cache Compression (ICLR 2026)

Quick Start

Prerequisites

Formations

Key Innovations

1. ZeroMQ Debate Bus

2. Planning Swarm (Phase 0)

3. TurboQuant KV Cache Compression

4. 6-Tier Memory

5. Hybrid LLM Router

6. Adversarial Review Gate

7. Self-Improving Prompts (OPRO)

Gate Protocol

Test Suite

Dashboard

Configuration

Using with Claude Code

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages