Autonomous 10-LLM development harness for NVIDIA GH200. ZeroMQ debate orchestration, 6-tier memory, hybrid routing, self-improving prompts.
Live Demo: https://hekaton.herakles.dev — Real-time WebSocket dashboard showing debate rounds, agent metrics, and mission progress.
Hekaton is an autonomous coding harness that coordinates 10 large language models through structured multi-agent debate to solve complex software engineering tasks. It is not a chatbot, not a wrapper around a single API, and not a prompt-chaining library.
You give Hekaton a goal. A Planning Swarm of agents first researches the problem and debates an approach. An Architect then generates a plan. Sapper agents build the implementation across parallel debate rounds while an Auditor reviews each output against a structured rubric. The cycle loops until the code passes or a kill switch triggers. Every artifact that leaves the system has been adversarially reviewed.
The system runs 6 local open-source models (Qwen2.5, DeepSeek-Coder, Phi-4) via vLLM on GH200 unified memory, with 4 Gemini API roles (researcher, reasoner, worker, executor) for tasks that benefit from external knowledge. A hybrid LiteLLM router decides at request time whether to route to local inference or the API, achieving 98.9% API cost savings in production.
Every design decision in Hekaton is empirically validated. The A/B testing story is intentional: in Sprint 30 we ran four candidate features on real GH200 hardware. Only one survived (Planning Swarm). The other three were cut. That rigor is what makes the results reliable.
┌─────────────────────────────┐
Mission Goal ────────>│ Planning Swarm │
│ DISCOVER → DEBATE → SCAFFOLD│
└────────────┬────────────────┘
│ Plan
v
┌────────────────────────────┐
│ Architect │
│ (system design + tasks) │
└────────────┬───────────────┘
│ Task list
┌────────────────┼────────────────┐
v v v
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Sapper │ │ Sapper │ │ Sapper │ (parallel)
│ (builder)│ │ (builder)│ │ (builder)│
└────┬─────┘ └────┬─────┘ └────┬─────┘
└───────────────┼────────────────┘
│ Code artifacts
v
┌────────────────────────────┐
│ Auditor │
│ (structured rubric review)│
└────────────┬───────────────┘
│ PASS / loop back
v
┌────────────────────────────┐
│ SITREP + Memory │
│ (pgvector, 6-tier store) │
└────────────────────────────┘
Transport: ZeroMQ ipc:// | Router: LiteLLM | Memory: PostgreSQL + pgvector
We ran four candidate features on a live GH200 instance with statistical significance testing. Only one of four features survived:
| Feature | Result | Notes |
|---|---|---|
| Planning Swarm | PROVEN | Statistically significant improvement |
| Dynamic Context Allocation | CUT | No measurable benefit at task scale |
| Speculative Parallel Draft | CUT | Latency gains outweighed by coordination overhead |
| Cross-Round Attention Sharing | CUT | Memory pressure exceeded gains |
The three cut features are gone. Not "parked for later" — removed. This is what scientific rigor looks like in LLM systems research.
TurboQuant implements the ICLR 2026 KV cache compression paper as a production vLLM hook:
- 5.33x compression ratio on KV cache
- 7.03 GB HBM3e freed on GH200
- 32K context unlocked for Formation E-TQ (was limited to 12K without compression)
- 30/30 tests PASS — zero regressions across all memory benchmarks
- Lloyd-Max quantizer + adaptive scalar quantization on attention heads
git clone https://github.com/herakles-ai/hekaton-core.git
cd hekaton-core
./setup.sh
# Edit .env — set HEKATON_DB_PASSWORD, GEMINI_API_KEY
# (LAMBDA_API_KEY required only for GH200 cloud deploy)
docker compose up -d # Start PostgreSQL (required for memory)
pytest tests/ # Verify: 1929 tests should pass
# Run a benchmark mission (local mock, no GPU required)
python3 scripts/run_benchmark.py --level 1For GH200 deployment:
HEKATON_FORMATION=heavy-hitter-tq ./scripts/gh200-deploy.sh YOUR_GH200_IP| Requirement | Notes |
|---|---|
| Python 3.10+ | 3.12 recommended |
| Docker + Compose | For PostgreSQL (pgvector image) |
| NVIDIA GH200 | 96GB HBM3e — for full formation runs |
| Gemini API key | Required for API-backed roles |
| Lambda Labs key | Required for cloud GH200 provisioning scripts |
Local development (without GH200) works for: unit tests, mock benchmarks, memory system, router calibration, dashboard.
A Formation is a named topology of LLM agents with defined roles, VRAM budgets, and routing rules. Formations are defined in config/formations/*.yaml.
| Formation | Models | VRAM | Context | Use Case |
|---|---|---|---|---|
heavy-hitter-tq |
6 local + 4 API | ~73 GB | 32K | Full production (recommended) |
heavy-hitter |
6 local + 4 API | ~78 GB | 12K | Production, no TurboQuant |
precision-swarm |
10 local | ~87 GB | 8K | Full local, no API cost |
precision-swarm-tq |
10 local | ~80 GB | 32K | Full local with TurboQuant |
precision-strike |
3 local | ~25 GB | 8K | Lightweight, fast iteration |
red-blue-team |
4 local | ~30 GB | 8K | Adversarial pair review |
swarm-debate |
6 local | ~55 GB | 8K | Max debate rounds |
pipeline |
3 local | ~20 GB | 8K | Sequential pipeline, low VRAM |
parallel-review |
4 local | ~35 GB | 8K | Parallel rubric review |
the-hive |
8 local | ~70 GB | 8K | High-concurrency swarm |
war-room |
6 local + 2 API | ~55 GB | 12K | Balanced cost/quality |
Formation E-TQ (heavy-hitter-tq) is the recommended formation for serious work. It uses a 32B Sapper (the builder agent) for highest code quality, TurboQuant to expand context to 32K, and four Gemini API roles for research and reasoning tasks where external knowledge helps.
All inter-agent communication runs over ipc:// ZeroMQ sockets. Agents are Python asyncio actors. No shared memory, no global state — pure message passing. The broker in war-room-gh200/zmq_broker/ manages routing, round arbitration, and kill switch enforcement.
Before any code is written, a dedicated planning phase runs: DISCOVER (research the problem space), DEBATE (multi-agent argument over approaches), SCAFFOLD (generate structured plan). A/B validated on GH200 — Planning Swarm measurably improves final PASS rates.
research/turboquant/ implements an ICLR 2026 paper as a drop-in vLLM hook. The Lloyd-Max quantizer runs on Grace CPU, compresses attention head KV caches at inference time, and is transparent to the rest of the system.
PostgreSQL + pgvector stores six memory tiers: episodic, semantic, procedural, resource, knowledge, and core. The consolidator (war-room-gh200/memory/consolidator.py) periodically promotes high-value episodes to semantic memory. Retrieval uses hybrid BM25 + cosine similarity search. Validated at Gate G7: +14% PASS rate over memoryless baseline.
war-room-gh200/router/ scores each request by complexity, budget, and capability requirements, then routes to local vLLM or the Gemini API. Cost tracking is per-mission. In production: 98.9% API cost reduction vs. routing everything to the API.
Every code artifact passes through war-room-gh200/formations/review_gate.py before acceptance. The Auditor agent evaluates against a structured rubric (correctness, style, security, test coverage). Code that fails goes back to the Sapper — not to the user.
war-room-gh200/self_improve/ implements OPRO-based prompt evolution. After each mission, trajectories are scored and used to generate improved prompt candidates. The best candidates are staged and promoted after validation. Validated at Gate G10: +20% PASS rate over static prompts.
Quality is enforced through 15 sequential gates (G0–G15). A gate must PASS before work begins on the next.
| Gate | Name | Key Metric |
|---|---|---|
| G0 | Mock passing | Local unit tests green |
| G1 | Single-model smoke | One model end-to-end |
| G2 | Formation bring-up | All roles connected |
| G3 | Rubric review | Adversarial gate live |
| G4 | Checkpointing | LangGraph PostgreSQL |
| G5 | 1hr endurance | 50% PASS, 18 missions |
| G6 | SGLang benchmark | vLLM vs SGLang comparison |
| G7 | 6-tier memory | +14% PASS rate |
| G8 | Hybrid router | 98.9% API cost savings |
| G9 | Formation swap | Hot-swap in production |
| G10 | Self-improvement | +20% PASS rate (OPRO) |
| G11 | TurboQuant | 5.33x compression, 7GB freed |
| G12 | Plug-and-play | 100% PASS, 12 instances |
| G13-G15 | Endurance + scaling | Not yet started |
Full gate history and criteria: docs/gates.md
1929 tests across unit, integration, and end-to-end:
pytest tests/ # Full suite
pytest tests/unit/ # Unit tests (fast, no GPU)
pytest tests/integration/ # Integration (requires Docker)
pytest tests/e2e/ # End-to-end
pytest tests/ -k "turboquant" # TurboQuant compression tests
pytest tests/ -k "memory" # Memory system tests
pytest tests/ --cov=war-room-gh200 # With coverageKey test files:
tests/e2e/test_turboquant_*.py— 10 TurboQuant test modules, 30 core assertionstests/integration/test_memory_integration.py— 6-tier memory with live PostgreSQLtests/unit/test_formation_runner.py— Formation topology and routingtests/unit/test_litellm_bridge.py— Hybrid router calibration
Real-time WebSocket dashboard at http://localhost:8475 (or https://hekaton.herakles.dev):
source .env && python3 war-room-gh200/dashboard/api.pyShows: active debate rounds, per-agent metrics, mission queue, memory tier utilization, router cost tracking.
All configuration is via environment variables. Copy .env.example and fill in:
| Variable | Required | Description |
|---|---|---|
HEKATON_DB_PASSWORD |
Yes | PostgreSQL password |
GEMINI_API_KEY |
Yes (API formations) | Google AI Studio API key |
LAMBDA_API_KEY |
GH200 deploy only | Lambda Labs cloud key |
HEKATON_FORMATION |
No | Formation to use (default: heavy-hitter-tq) |
HEKATON_LOG_LEVEL |
No | Log verbosity (default: INFO) |
HEKATON_DB_HOST |
No | PostgreSQL host (default: localhost) |
HEKATON_DB_PORT |
No | PostgreSQL port (default: 5470) |
Formation-level configuration lives in config/formations/*.yaml. Role prompts are in config/prompts/{role}/{small,medium,large}.md — 31 prompt files covering 9 roles across 3 model size tiers.
This project includes a CLAUDE.md that gives Claude Code full context about the architecture, commands, and design decisions. Clone the repo, open it with Claude Code, and it will understand the codebase immediately.
git clone https://github.com/herakles-ai/hekaton-core.git
cd hekaton-core
claude # Claude Code reads CLAUDE.md automaticallySee CONTRIBUTING.md for development setup, test requirements, and PR process.
Areas where contributions are most valuable:
- New formation topologies (
config/formations/) - Additional LLM router complexity scorers
- TurboQuant quantization improvements
- Benchmark missions (
config/benchmarks/) - SGLang integration (currently vLLM only)
Apache License 2.0 — see LICENSE.
Copyright 2026 hekaton-core contributors.