FormicOS Runbook

Operational guide for deploying and running FormicOS.

Hardware Requirements

GPU (recommended)

Component	VRAM	Notes
Qwen3-30B-A3B Q4_K_M (LLM)	~21.1 GB	MoE, 3.3B active params/token
Qwen3-Embedding-0.6B Q8_0	~0.7 GB	1024-dim dense embeddings
Total	~21.8 GB	Fits RTX 3090/4090/5090 with headroom

Minimum GPU: 24 GB VRAM (RTX 3090, RTX 4090, A5000)
Recommended GPU: 32 GB VRAM (RTX 5090) for comfortable headroom
CPU: 8+ cores recommended (llama.cpp uses --threads 8 --threads-batch 16)
RAM: 32 GB minimum, 64 GB recommended
Disk: ~20 GB for model files + Docker images

Cloud-only (no GPU)

Set ANTHROPIC_API_KEY and/or GEMINI_API_KEY in .env. The tier system routes all castes to cloud providers. No local GPU needed. Requires internet access.

Model Downloads

Download models before first docker compose up:

mkdir -p .models && cd .models

# LLM (required for local inference)
huggingface-cli download Qwen/Qwen3-30B-A3B-Instruct-2507-GGUF \
  Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf --local-dir .

# Embedding sidecar (required for vector search)
huggingface-cli download Qwen/Qwen3-Embedding-0.6B-GGUF \
  Qwen3-Embedding-0.6B-Q8_0.gguf --local-dir .

Install huggingface-cli if needed: pip install huggingface_hub[cli]

Docker Setup

1. Configure environment

cp .env.example .env
# Edit .env to add API keys (optional) or override defaults

Multi-GPU pinning

On multi-GPU systems, pin inference to a specific GPU by setting CUDA_DEVICE in .env:

CUDA_DEVICE=0   # GPU index for LLM + embedding (default: 0)

This sets CUDA_VISIBLE_DEVICES inside the llm and formicos-embed containers. The compose file also sets device_ids in the deploy block, but Docker Desktop on WSL2 ignores device_ids and passes through all GPUs. CUDA_VISIBLE_DEVICES is the effective control on WSL2.

Without pinning, llama.cpp may split model layers across GPUs, causing segfaults on the secondary GPU or severe throughput degradation (e.g., 1 tok/s instead of 240 tok/s).

2. Start all services

docker compose up -d

This starts five containers:

Container	Port	Purpose
`formicos-llm`	8008 (host) → 8080 (container)	llama.cpp LLM inference
`formicos-embed`	8200	Qwen3-Embedding sidecar
`formicos-qdrant`	6333, 6334	Qdrant vector store (hybrid search)
`formicos-docker-proxy`	— (internal 2375)	Docker socket proxy for sandbox spawning
`formicos-colony`	8080	FormicOS application

3. Verify health

docker compose ps          # all containers should show "healthy"
curl http://localhost:8080/health

The app health payload now includes replay/bootstrap counts:

{
  "status": "ok",
  "last_seq": 0,
  "bootstrapped": true,
  "workspaces": 1,
  "threads": 1,
  "colonies": 0
}

4. Open the UI

Navigate to http://localhost:8080. On first boot you'll see the Queen tab after a short startup panel. Once the workspace tree appears, the Queen posts the welcome message and getting-started instructions.

First Colony Walkthrough

Open http://localhost:8080 and wait for the startup panel to clear
Read the welcome message from the Queen
Click the + button or type a task in the input area
Example: Write a Python function that validates email addresses with tests
The Queen suggests a team composition with castes and tiers
Adjust the team if desired, then click Spawn
Watch the colony run -- chat messages appear for each round
Colony completes with a cost summary, artifacts, and knowledge extraction

API Keys

Cloud model keys are optional. Set them in .env:

ANTHROPIC_API_KEY=sk-ant-...    # Enables Anthropic models (Claude)
GEMINI_API_KEY=AI...            # Enables Google Gemini models

The tier system (heavy / standard / light) routes castes to appropriate providers. Cloud keys enable fallback when local inference is slow or unavailable.

Configuration Reference

All settings in .env (see .env.example for defaults). For a comprehensive reference, see DEPLOYMENT.md.

Variable	Default	Purpose
`ANTHROPIC_API_KEY`	(none)	Anthropic API key for cloud models
`GEMINI_API_KEY`	(none)	Gemini API key for cloud models
`LLM_IMAGE`	`local/llama.cpp:server-cuda-blackwell`	Docker image for LLM inference
`LLM_MODEL_FILE`	`Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf`	GGUF model filename
`LLM_CONTEXT_SIZE`	`80000`	Context window size for local LLM
`LLM_SLOTS`	`2`	Concurrent inference slots
`LLM_PORT`	`8008`	Host-side port for LLM container
`LLM_MODEL_DIR`	`./.models`	Shared model directory (LLM + embedding)
`EMBED_GPU_LAYERS`	`99`	GPU layers for embedding (0 = CPU-only)
`CUDA_DEVICE`	`0`	GPU index for LLM + embedding
`FORMICOS_DATA_DIR`	`./data` (dev) / `/data` (Docker)	Persistent data directory
`SANDBOX_ENABLED`	`true`	Enable/disable Docker sandbox

Colony Templates

FormicOS ships with 7 built-in templates in config/templates/:

Template	Castes	Budget	Rounds	Use case
`full-stack`	2x coder, reviewer, archivist	$5	12	General implementation
`code-review`	coder, reviewer	$3	10	Code review and fixes
`debugging`	coder, reviewer	$4	10	Bug diagnosis and repair
`research-heavy`	2x researcher, archivist	$5	15	Deep research tasks
`documentation`	researcher, archivist	$3	8	Documentation generation
`rapid-prototype`	1x coder (heavy tier)	$5	8	Fast prototyping with cloud model
`minimal`	1x coder	$2	6	Quick single-agent tasks

Templates with an archivist caste extract knowledge graph entities.

Troubleshooting

LLM container won't start

Model file missing: Ensure the GGUF file exists in .models/ directory
VRAM insufficient: Check nvidia-smi. Try a smaller quant (Q3_K_M) or reduce LLM_CONTEXT_SIZE
NVIDIA driver: Ensure NVIDIA Container Toolkit is installed: nvidia-ctk runtime configure --runtime=docker && sudo systemctl restart docker

Embedding sidecar unhealthy / segfaults

Check logs: docker logs formicos-embed
Verify model exists: ls .models/Qwen3-Embedding-0.6B-Q8_0.gguf
The sidecar needs ~700 MB VRAM -- check GPU memory pressure
Multi-GPU crash: If the sidecar splits layers across GPUs and segfaults, set CUDA_DEVICE=0 in .env to pin it to a single GPU (see Multi-GPU pinning above)

Qdrant won't connect

Check status: curl http://localhost:6333/healthz
Check logs: docker logs formicos-qdrant
Qdrant v1.16.2 is required for server-side BM25 support

"No embedding function" warnings

The embedding sidecar must be healthy before FormicOS starts
Check docker compose ps -- formicos-embed should show healthy
FormicOS depends on the sidecar via depends_on with health condition

Colony stuck / no progress

Check LLM health: curl http://localhost:8008/health
Check structured logs: docker logs formicos-colony | grep error
Verify API keys if using cloud models: invalid keys cause silent failures (check logs for fallback_triggered or provider_cooldown)

First boot shows no welcome message

The welcome message only appears when projections.last_seq == 0 (fresh database)
If you've run before, the database already has events -- delete the volume to reset: docker compose down -v && docker compose up -d

Persistence Rules

SQLite (event store)

Use named Docker volumes (the default formicos-data volume).
Never bind-mount the SQLite database on macOS or Windows Docker Desktop. Docker Desktop's filesystem translation layer does not support the POSIX shared-memory semantics that SQLite WAL mode requires. Bind-mounting will cause silent corruption or locking failures under load.
Keep .db, .db-wal, and .db-shm on the same filesystem. Moving just the .db without its WAL companions will lose uncommitted data.
FormicOS is a single-writer system. Do not run multiple instances against the same SQLite file.

Qdrant (vector store)

Uses named Docker volume qdrant-data.
Qdrant data is reconstructable from the event store — if you lose the volume, restart FormicOS and embeddings re-index from events on replay.
Qdrant v1.16.2 is required for server-side BM25 hybrid search support.

Maintenance

Backup

The event store is a single SQLite file. Back up the formicos-data volume:

docker run --rm -v formicos-data:/data -v $(pwd):/backup \
  alpine tar czf /backup/formicos-backup-$(date +%Y%m%d).tar.gz /data

Reset

To start fresh (destroys all data):

docker compose down -v
docker compose up -d

Logs

docker logs formicos-colony          # Application logs (structlog JSON)
docker logs formicos-llm             # LLM inference logs
docker logs formicos-embed           # Embedding sidecar logs
docker logs formicos-qdrant          # Vector store logs

Updating

git pull
docker compose build formicos
docker compose up -d

Stream C Smoke Findings (2026-03-15)

Frontend bundle staleness after rebuild

After modifying frontend source and running cd frontend && npm run build, the Docker container still serves the old bundle until you rebuild the Docker image:

docker compose build formicos
docker compose up -d formicos

Symptom: The served JS filename (visible in curl http://localhost:8080/) doesn't match the local frontend/dist/assets/index-*.js filename.

Provider fallback chain

When a cloud provider returns 401 (invalid API key), the LLM router:

Tries the configured model (e.g., anthropic/claude-sonnet-4.6)
Falls back to secondary provider (e.g., gemini/gemini-2.5-flash)
Falls back to local (llama-cpp/gpt-4)
Activates cooldown (120s) for the failing provider

Check logs: docker logs formicos-colony | grep fallback

The colony continues running on fallback models. Quality may vary with local models (e.g., convergence stalls, repeated outputs).

CasteSlot event format migration

If upgrading from a pre-Wave 14 event store, the ColonySpawned event format changed from caste_names: list[str] to castes: list[CasteSlot]. Old events will fail Pydantic validation on replay. Reset the volume to fix:

docker compose down
docker volume rm formicosa_formicos-data
docker compose up -d

WebSocket subscription required for event fan-out

The WS client must send a subscribe action before receiving events:

{"action": "subscribe", "workspaceId": "default", "payload": {}}

Without subscribing, the client receives the initial state snapshot but no subsequent events. The frontend handles this automatically on connect.

Knowledge extraction is archivist-shaped

The strongest structured knowledge extraction path uses the archivist model and archivist-oriented colonies. If your first few colonies do not include strong archivist participation, the Knowledge view may stay sparse. Use templates like full-stack, research-heavy, or documentation to exercise the full knowledge pipeline.

Colony Output Quality — Tuning Debt (Wave 24)

Low-quality colony output under the local stack is model and sandbox tuning debt. It is not evidence that A2A, AG-UI, or any Wave 24 protocol/UX work has regressed. The protocol surfaces (submit, poll, attach, result, transcript) are working correctly; what varies is the quality of work the colony produces given a particular model, task, and sandbox configuration.

Likely causes

Local-model reasoning quality. Smaller quantized models (Q4_K_M) may struggle with complex multi-step tasks, tool-call formatting, or code generation that a cloud model handles easily.
Sandbox/runtime quirks. Code execution tasks may hit __main__ packaging assumptions, missing imports in the sandbox, or filesystem restrictions that cause agent code to fail silently.
Task/model mismatch. Code-heavy or multi-file tasks assigned to a single-agent minimal template on a local model will underperform compared to a full-stack template with cloud-tier models.

What to try

Inspect the transcript and failure_context first. The transcript (GET /api/v1/colonies/{id}/transcript or GET /a2a/tasks/{id}/result) now includes structured failure_context for failed and killed colonies. Start here before assuming a protocol problem.
Compare local vs cloud model assignment. Run the same task with a cloud model (rapid-prototype template or set ANTHROPIC_API_KEY) and compare output quality. If the cloud run succeeds, the issue is model capability, not infrastructure.
Reduce task ambiguity. Vague descriptions produce vague output. Tighten the task description or choose a more specific template.
Validate sandbox assumptions. Run smaller code_execute-style tasks to isolate whether the sandbox itself is the bottleneck.
Treat this as tuning/benchmarking debt. File it as a model-selection or template-tuning issue. Do not open architecture or protocol changes to address output quality.

FilesExpand file tree

RUNBOOK.md

Latest commit

History