Skip to content

herakles-dev/hekaton-core

Repository files navigation

Hekaton Core

Autonomous 10-LLM development harness for NVIDIA GH200. ZeroMQ debate orchestration, 6-tier memory, hybrid routing, self-improving prompts.

Python 3.10+ License: Apache 2.0 Tests: 1929 GH200 Optimized Live Demo

Live Demo: https://hekaton.herakles.dev — Real-time WebSocket dashboard showing debate rounds, agent metrics, and mission progress.


What is Hekaton?

Hekaton is an autonomous coding harness that coordinates 10 large language models through structured multi-agent debate to solve complex software engineering tasks. It is not a chatbot, not a wrapper around a single API, and not a prompt-chaining library.

You give Hekaton a goal. A Planning Swarm of agents first researches the problem and debates an approach. An Architect then generates a plan. Sapper agents build the implementation across parallel debate rounds while an Auditor reviews each output against a structured rubric. The cycle loops until the code passes or a kill switch triggers. Every artifact that leaves the system has been adversarially reviewed.

The system runs 6 local open-source models (Qwen2.5, DeepSeek-Coder, Phi-4) via vLLM on GH200 unified memory, with 4 Gemini API roles (researcher, reasoner, worker, executor) for tasks that benefit from external knowledge. A hybrid LiteLLM router decides at request time whether to route to local inference or the API, achieving 98.9% API cost savings in production.

Every design decision in Hekaton is empirically validated. The A/B testing story is intentional: in Sprint 30 we ran four candidate features on real GH200 hardware. Only one survived (Planning Swarm). The other three were cut. That rigor is what makes the results reliable.


Architecture

                        ┌─────────────────────────────┐
  Mission Goal ────────>│       Planning Swarm        │
                        │  DISCOVER → DEBATE → SCAFFOLD│
                        └────────────┬────────────────┘
                                     │ Plan
                                     v
                        ┌────────────────────────────┐
                        │         Architect          │
                        │   (system design + tasks)  │
                        └────────────┬───────────────┘
                                     │ Task list
                    ┌────────────────┼────────────────┐
                    v                v                v
             ┌──────────┐    ┌──────────┐    ┌──────────┐
             │  Sapper  │    │  Sapper  │    │  Sapper  │  (parallel)
             │ (builder)│    │ (builder)│    │ (builder)│
             └────┬─────┘    └────┬─────┘    └────┬─────┘
                  └───────────────┼────────────────┘
                                  │ Code artifacts
                                  v
                        ┌────────────────────────────┐
                        │          Auditor           │
                        │  (structured rubric review)│
                        └────────────┬───────────────┘
                                     │ PASS / loop back
                                     v
                        ┌────────────────────────────┐
                        │      SITREP + Memory       │
                        │  (pgvector, 6-tier store)  │
                        └────────────────────────────┘

  Transport: ZeroMQ ipc://  |  Router: LiteLLM  |  Memory: PostgreSQL + pgvector

Recent Breakthroughs

Sprint 30 — A/B Validation on Real GH200 Hardware

We ran four candidate features on a live GH200 instance with statistical significance testing. Only one of four features survived:

Feature Result Notes
Planning Swarm PROVEN Statistically significant improvement
Dynamic Context Allocation CUT No measurable benefit at task scale
Speculative Parallel Draft CUT Latency gains outweighed by coordination overhead
Cross-Round Attention Sharing CUT Memory pressure exceeded gains

The three cut features are gone. Not "parked for later" — removed. This is what scientific rigor looks like in LLM systems research.

Sprint 30b — TurboQuant KV Cache Compression (ICLR 2026)

TurboQuant implements the ICLR 2026 KV cache compression paper as a production vLLM hook:

  • 5.33x compression ratio on KV cache
  • 7.03 GB HBM3e freed on GH200
  • 32K context unlocked for Formation E-TQ (was limited to 12K without compression)
  • 30/30 tests PASS — zero regressions across all memory benchmarks
  • Lloyd-Max quantizer + adaptive scalar quantization on attention heads

Quick Start

git clone https://github.com/herakles-ai/hekaton-core.git
cd hekaton-core
./setup.sh

# Edit .env — set HEKATON_DB_PASSWORD, GEMINI_API_KEY
# (LAMBDA_API_KEY required only for GH200 cloud deploy)

docker compose up -d          # Start PostgreSQL (required for memory)
pytest tests/                 # Verify: 1929 tests should pass

# Run a benchmark mission (local mock, no GPU required)
python3 scripts/run_benchmark.py --level 1

For GH200 deployment:

HEKATON_FORMATION=heavy-hitter-tq ./scripts/gh200-deploy.sh YOUR_GH200_IP

Prerequisites

Requirement Notes
Python 3.10+ 3.12 recommended
Docker + Compose For PostgreSQL (pgvector image)
NVIDIA GH200 96GB HBM3e — for full formation runs
Gemini API key Required for API-backed roles
Lambda Labs key Required for cloud GH200 provisioning scripts

Local development (without GH200) works for: unit tests, mock benchmarks, memory system, router calibration, dashboard.


Formations

A Formation is a named topology of LLM agents with defined roles, VRAM budgets, and routing rules. Formations are defined in config/formations/*.yaml.

Formation Models VRAM Context Use Case
heavy-hitter-tq 6 local + 4 API ~73 GB 32K Full production (recommended)
heavy-hitter 6 local + 4 API ~78 GB 12K Production, no TurboQuant
precision-swarm 10 local ~87 GB 8K Full local, no API cost
precision-swarm-tq 10 local ~80 GB 32K Full local with TurboQuant
precision-strike 3 local ~25 GB 8K Lightweight, fast iteration
red-blue-team 4 local ~30 GB 8K Adversarial pair review
swarm-debate 6 local ~55 GB 8K Max debate rounds
pipeline 3 local ~20 GB 8K Sequential pipeline, low VRAM
parallel-review 4 local ~35 GB 8K Parallel rubric review
the-hive 8 local ~70 GB 8K High-concurrency swarm
war-room 6 local + 2 API ~55 GB 12K Balanced cost/quality

Formation E-TQ (heavy-hitter-tq) is the recommended formation for serious work. It uses a 32B Sapper (the builder agent) for highest code quality, TurboQuant to expand context to 32K, and four Gemini API roles for research and reasoning tasks where external knowledge helps.


Key Innovations

1. ZeroMQ Debate Bus

All inter-agent communication runs over ipc:// ZeroMQ sockets. Agents are Python asyncio actors. No shared memory, no global state — pure message passing. The broker in war-room-gh200/zmq_broker/ manages routing, round arbitration, and kill switch enforcement.

2. Planning Swarm (Phase 0)

Before any code is written, a dedicated planning phase runs: DISCOVER (research the problem space), DEBATE (multi-agent argument over approaches), SCAFFOLD (generate structured plan). A/B validated on GH200 — Planning Swarm measurably improves final PASS rates.

3. TurboQuant KV Cache Compression

research/turboquant/ implements an ICLR 2026 paper as a drop-in vLLM hook. The Lloyd-Max quantizer runs on Grace CPU, compresses attention head KV caches at inference time, and is transparent to the rest of the system.

4. 6-Tier Memory

PostgreSQL + pgvector stores six memory tiers: episodic, semantic, procedural, resource, knowledge, and core. The consolidator (war-room-gh200/memory/consolidator.py) periodically promotes high-value episodes to semantic memory. Retrieval uses hybrid BM25 + cosine similarity search. Validated at Gate G7: +14% PASS rate over memoryless baseline.

5. Hybrid LLM Router

war-room-gh200/router/ scores each request by complexity, budget, and capability requirements, then routes to local vLLM or the Gemini API. Cost tracking is per-mission. In production: 98.9% API cost reduction vs. routing everything to the API.

6. Adversarial Review Gate

Every code artifact passes through war-room-gh200/formations/review_gate.py before acceptance. The Auditor agent evaluates against a structured rubric (correctness, style, security, test coverage). Code that fails goes back to the Sapper — not to the user.

7. Self-Improving Prompts (OPRO)

war-room-gh200/self_improve/ implements OPRO-based prompt evolution. After each mission, trajectories are scored and used to generate improved prompt candidates. The best candidates are staged and promoted after validation. Validated at Gate G10: +20% PASS rate over static prompts.


Gate Protocol

Quality is enforced through 15 sequential gates (G0–G15). A gate must PASS before work begins on the next.

Gate Name Key Metric
G0 Mock passing Local unit tests green
G1 Single-model smoke One model end-to-end
G2 Formation bring-up All roles connected
G3 Rubric review Adversarial gate live
G4 Checkpointing LangGraph PostgreSQL
G5 1hr endurance 50% PASS, 18 missions
G6 SGLang benchmark vLLM vs SGLang comparison
G7 6-tier memory +14% PASS rate
G8 Hybrid router 98.9% API cost savings
G9 Formation swap Hot-swap in production
G10 Self-improvement +20% PASS rate (OPRO)
G11 TurboQuant 5.33x compression, 7GB freed
G12 Plug-and-play 100% PASS, 12 instances
G13-G15 Endurance + scaling Not yet started

Full gate history and criteria: docs/gates.md


Test Suite

1929 tests across unit, integration, and end-to-end:

pytest tests/                              # Full suite
pytest tests/unit/                         # Unit tests (fast, no GPU)
pytest tests/integration/                  # Integration (requires Docker)
pytest tests/e2e/                          # End-to-end
pytest tests/ -k "turboquant"             # TurboQuant compression tests
pytest tests/ -k "memory"                 # Memory system tests
pytest tests/ --cov=war-room-gh200        # With coverage

Key test files:

  • tests/e2e/test_turboquant_*.py — 10 TurboQuant test modules, 30 core assertions
  • tests/integration/test_memory_integration.py — 6-tier memory with live PostgreSQL
  • tests/unit/test_formation_runner.py — Formation topology and routing
  • tests/unit/test_litellm_bridge.py — Hybrid router calibration

Dashboard

Real-time WebSocket dashboard at http://localhost:8475 (or https://hekaton.herakles.dev):

source .env && python3 war-room-gh200/dashboard/api.py

Shows: active debate rounds, per-agent metrics, mission queue, memory tier utilization, router cost tracking.


Configuration

All configuration is via environment variables. Copy .env.example and fill in:

Variable Required Description
HEKATON_DB_PASSWORD Yes PostgreSQL password
GEMINI_API_KEY Yes (API formations) Google AI Studio API key
LAMBDA_API_KEY GH200 deploy only Lambda Labs cloud key
HEKATON_FORMATION No Formation to use (default: heavy-hitter-tq)
HEKATON_LOG_LEVEL No Log verbosity (default: INFO)
HEKATON_DB_HOST No PostgreSQL host (default: localhost)
HEKATON_DB_PORT No PostgreSQL port (default: 5470)

Formation-level configuration lives in config/formations/*.yaml. Role prompts are in config/prompts/{role}/{small,medium,large}.md — 31 prompt files covering 9 roles across 3 model size tiers.


Using with Claude Code

This project includes a CLAUDE.md that gives Claude Code full context about the architecture, commands, and design decisions. Clone the repo, open it with Claude Code, and it will understand the codebase immediately.

git clone https://github.com/herakles-ai/hekaton-core.git
cd hekaton-core
claude    # Claude Code reads CLAUDE.md automatically

Contributing

See CONTRIBUTING.md for development setup, test requirements, and PR process.

Areas where contributions are most valuable:

  • New formation topologies (config/formations/)
  • Additional LLM router complexity scorers
  • TurboQuant quantization improvements
  • Benchmark missions (config/benchmarks/)
  • SGLang integration (currently vLLM only)

License

Apache License 2.0 — see LICENSE.

Copyright 2026 hekaton-core contributors.

About

Autonomous 10-LLM development harness for NVIDIA GH200. ZeroMQ debate orchestration, 6-tier memory, hybrid routing, self-improving prompts.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors