Skip to content

Hal0ai/hal0

hal0

Open-source home AI inference platform

hal0.dev · Install · Docs · Roadmap


hal0 turns a Linux box — ideally a Ryzen AI Max+ 395 with 128 GB of unified memory — into a polished, OpenAI-compatible inference appliance. Hardware-aware slots, prewired chat UI, signed self-update. One command installs the lot.

Every inference workload runs as its own podman container under a hal0-slot@<name>.service unit. hal0-api on :8080 is the sole control plane — it owns slot state machines, dispatches OpenAI-compatible /v1/* requests to the right slot port, and serves the dashboard. No shared inference daemon; no extra process to babysit.

curl -fsSL https://hal0.dev/install.sh | bash

Status: v0.5.0-alpha.1 — container-runtime era. Each slot (chat, embed, rerank, stt, tts, img, NPU trio) runs as a dedicated podman container (hal0-slot@<name>.service). Slot definitions live in /etc/hal0/slots/<name>.toml; backend profiles in /etc/hal0/profiles.toml; the model catalog is registry.toml — the single source of truth for every HuggingFace coordinate and SHA-256 digest. The one-liner seeds the recommended Main slot non-interactively; run hal0 setup anytime to configure models, extensions, and NPU interactively. See PLAN.md §1 for what ships now and the path to v1.0.

Why hal0

Strix Halo native, but not Strix-Halo-only. The probe is UMA-aware on Strix Halo and falls back to portable parsers (/proc/cpuinfo, /proc/meminfo, lspci) on every other host. The platform field on /api/hardware resolves to one of strix-halo, wsl2, lxc, proxmox-kvm, kvm, bare-metal-{nvidia,amd,intel}-gpu, or bare-metal-cpu-only, and the dashboard only labels memory as "unified" when it actually is. The slot-fit warnings size against the real unified pool, not a BAR carve-out. The XDNA NPU is a first-class device — opt-in even at Pro/Max tier, packing three model roles into one process via the FLM trio (see below). 128 GB Ryzen AI Max+ 395 is the reference deployment — not a hopeful port.

One control plane, six modalities, dedicated containers. Each slot runs in isolation — chat, embed, rerank, STT, TTS, and image generation each get their own podman container with its own image, flags, port, and lifecycle. Concurrent workloads don't share a process pool; they share only the GPU through the arbiter. Slot state is persisted to /var/lib/hal0/slots/<name>/state.json and streamed via SSE. Chat + embed alone sustains ~258 tok/s on Strix Halo at the reference ROCm FP4 loadout.

NPU multi-role via the FLM trio. On Strix Halo with FastFlowLM installed, a single FLM process inside the hal0-toolbox-flm container hosts chat + transcription + embedding concurrently on the one AMDXDNA hardware context — ~2 GB NPU memory, gemma3:1b at 40 tok/s + Whisper-V3-Turbo + Embedding-Gemma all coresident. hal0 exposes this as three slots (agent, stt-npu, embed-npu); the npu.toml [npu] table toggles ASR and embed. The host-side FLM .deb is installed for device-sanity probes only — inference runs in the container. See ADR-0009 for why.

Dispatcher with single-flight + OmniRouter tool-calling. Registry-aware routing across local slots and external upstreams (OpenRouter, Anthropic, OpenAI, custom). The client-side OmniRouter loop provides 8 OpenAI tool-calling tools (image generation/editing, TTS, transcription, vision, embed, rerank, route_to_chat) dispatched through hal0 from any chat slot whose model carries the tool-calling label. The set is filtered per request — a tool only appears in the prompt if its target slot is enabled and its model carries the required labels.

Reliability bar. Atomic env file writes. Schema-validated TOML. Structured error envelopes ({"error":{"code":"slot.not_ready",...}}). Cosign-verified self-update with one-flag rollback. Per-type LRU concurrency with active-inference protection — a serving slot cannot be evicted out from under a streaming request.

What's in the box

  • OpenAI-compatible /v1/* API — chat, completions, embeddings, rerank, audio transcriptions, audio speech, image generations, image edits, models. Drop-in for any OpenAI SDK; point your client at http://localhost:8080/v1 and go.
  • Slots — each named target in capabilities.toml carries a type (llm | embedding | reranking | transcription | tts | image), a device (gpu-rocm | gpu-vulkan | cpu | npu | img), a model, plus enabled and optional default. Six seeded slots (chat, embed, rerank, stt, tts, img) plus three NPU slots (npu, stt-npu, embed-npu) when FastFlowLM is installed. User-added slots via hal0 slot create NAME --type TYPE --model MODEL. Slots refer to a profile in /etc/hal0/profiles.toml that pins the container image + flag bundle for that backend.
  • Container runtime — each slot runs as its own podman container under hal0-slot@<name>.service. hal0-api on :8080 is the control plane; slot containers bind loopback ports (8081–8099 + fixed seeds). No shared inference daemon. See docs/operate/container-runtime.md.
  • hal0 setup TUI — after install (or anytime), hal0 setup walks you through storage, Extensions (Apps / Agents), Main and Agent model selection, and NPU opt-in, then downloads models with live progress. Non-interactive: hal0 setup --auto applies recommended defaults without prompting. Other flags: --storage-dir PATH, --no-pull (seed slots without downloading models), --no-extensions. Set HAL0_SKIP_SETUP=1 to skip first-run setup entirely during install.
  • Hardware-aware probe — detects GPU / NPU / unified memory, writes /etc/hal0/hardware.json, surfaces VRAM/RAM fit warnings inline in the slot form and during hal0 setup.
  • Dispatcher — registry-aware routing, cold-cache prefetch, upstream fallback (OpenRouter, Anthropic, OpenAI, custom OpenAI-shaped endpoints). Mix local + remote per-model in one config.
  • Dashboard — React 18 + Vite UI for slot/model management, hardware-aware configuration, live logs, system health, and a built-in chat page (with popout window + reasoning toggle). SSE-backed status + log tail. Journal panel streams per-slot container logs via journald. Dark by default. The slots page splits into Inference | Image Gen tabs: the Image-Gen tab operates the ComfyUI container (live GTT/RAM gauges, queue depth, model inventory) with a gated inference ⇄ generation iGPU switchover behind a blast-radius confirm.
  • Extensions — selectable Apps (Open WebUI) and Agents (Hermes, Pi) that hal0 setup auto-wires into the platform. A future Stacks feature will add runtime-switchable model/slot layouts as a replacement for install-time bundle tiers.
  • OpenWebUI prewired — chat at :3001, zero config. The installer writes openwebui.env pointing at the local hal0 API.
  • OmniRouter (8 tools)generate_image, edit_image, text_to_speech, transcribe_audio, analyze_image (vision), embed_text, rerank_documents, route_to_chat. Dispatched client-side from chat slots; dynamically filtered per request.
  • Image generation, day onePOST /v1/images/generations via ComfyUI in the img slot container (ROCm). hal0 setup lets you select image-gen as part of the initial configuration.
  • Atomic self-update with rollbackhal0 update --channel stable|nightly. Cosign-verified tarballs swap a /usr/lib/hal0/current symlink; --rollback reverts.
  • One-line installcurl -fsSL https://hal0.dev/install.sh | bash seeds the recommended Main slot and writes the first-run sentinel automatically (no synchronous model download, no interaction needed). Run hal0 setup afterward to customize models, apps, and agents. (--models-dir=PATH or HAL0_MODELS_DIR=PATH redirects model pulls off /var/lib/hal0/models). The bootstrap fetches the release manifest, sha256-verifies the tarball, cosign-verifies the signature against the workflow OIDC identity, then hands off to installer/install.sh.
  • One-line Proxmox VE install — on a Proxmox host, bash -c "$(curl -fsSL https://raw.githubusercontent.com/Hal0ai/hal0/main/scripts/proxmox-ve/hal0.sh)" creates an unprivileged Debian 13 LXC and runs the standard bootstrap inside it. --advanced opens whiptail prompts; every parameter has an env-var override (CTID, RAM_MB, STORAGE, …). Hardware-agnostic — Strix Halo passthrough still requires the privileged-LXC recipe. See scripts/proxmox-ve/README.md.

Bundled agents

hal0 ships two MCP servers and one bundled agent app. The MCP servers (/mcp/admin for slot / model / capability / config / hardware / log admin and /mcp/memory for Cognee-backed long-term memory) are reachable by any MCP-speaking client — Claude Code, future RAG services, external scripts. The bundled agent is single-pick at install: pi-coder (CLI shape, installed from Hal0ai/pi-mono fork via @earendil-works/pi-coding-agent on npm) or Hermes-Agent (service shape, installed via the hal0-owned hal0-hermes wrapper; connects to hal0-api via HAL0_INFERENCE_BASE=http://127.0.0.1:8080). Select one during hal0 setup (Extensions step) or any time via hal0 agent install <name>; swap atomically with --switch. Capital-D destructive MCP calls (model_pull, slot_delete, config_write, etc.) gate through a header bell + inbox modal in the dashboard, with CLI parity via hal0 agent approvals {list,approve,deny}. See docs/mcp/overview.md and docs/agents/overview.md.

Backends

Each capability runs in its own container, supervised by hal0-slot@<name>.service. The profile in /etc/hal0/profiles.toml pins the container image and flag bundle for each backend.

Capability Profile / image Device Notes
chat + embed + rerank rocm / rocm-mtp / vulkan ROCm / Vulkan / CPU ROCm FP4 fork baked into the image; MTP via --spec-type draft-mtp
chat + STT + embed (NPU) flm (hal0-toolbox-flm) AMD XDNA (opt-in) FLM trio: one container, [npu] asr/embed toggles, ~2 GB NPU mem
transcription whisper.cpp in toolbox image Vulkan / CPU stt slot
TTS tts (hal0-toolbox-kokoro) CPU [CPU] chip + tooltip in dashboard
image comfyui (hal0-toolbox-comfyui) ROCm Exclusive GPU via arbiter; SD Turbo / Flux-2-Klein-9B

The NPU path is opt-in: the installer places a FastFlowLM .deb on the host for device-sanity probes (flm validate); inference runs inside the hal0-toolbox-flm container image. With FLM present, hal0 setup surfaces the NPU trio opt-in during setup.

For the container-runtime operator reference — service layout, slot TOML fields, profiles, GPU arbiter, and day-2 commands — see docs/operate/container-runtime.md.

Hardware

Linux + systemd is the only hard requirement (installer/install.sh:86). macOS and Windows are not in scope for v1.

Tier Hardware Status
First-class AMD Ryzen AI Max+ 395 ("Strix Halo") with iGPU + XDNA NPU + 128 GB unified Reference deployment. All published perf numbers come from this box.
First-class AMD Ryzen AI Max 385 / 390 with 64 GB unified Same path; small + mid tiers fit, 70B Q4 with shorter context.
Supported NVIDIA RTX 30/40/50 (10–32 GB) CUDA-backed llama-server in the vulkan container profile. Same slot lifecycle, dedicated VRAM instead of UMA.
Supported AMD Radeon RX 7000 / discrete (16–24 GB) ROCm or Vulkan container profiles; same hal0-slot@<name> lifecycle.
Fallback CPU-only x86_64 vulkan profile, CPU path. Usable for tiny models / smoke tests, not the headline experience.

Project layout

hal0/
├── src/hal0/         # Python package (FastAPI API + capability layer + ContainerProvider + CLI)
│   ├── providers/    # ContainerProvider, FLMProvider, ComfyUI, etc.
│   ├── slots/        # slot manager, state machine, GpuArbiter
│   └── omni_router/  # client-side tool-calling loop + tool definitions
├── ui/               # React 18 + TypeScript + Vite + Tailwind 4 dashboard (v3)
├── ui-vue.bak/       # v0.2.1 Vue 3 dashboard preserved verbatim for reference
├── installer/        # install.sh (writes /etc/hal0/, systemd units, hal0-api.service)
│   ├── etc-hal0/     # seed slot TOMLs + profiles.toml
│   └── systemd/      # hal0-agent@ template units
├── tests/            # pytest suite (α unit, β integration, γ release-gate)
├── docs/             # user docs (mirror of hal0.dev/docs); dev docs under docs/internal/
└── PLAN.md           # roadmap + history

The model catalog lives at /var/lib/hal0/registry/registry.toml — the single source of truth for HuggingFace coordinates, SHA-256 digests, and curated filenames. Per-leaf symlinks under the models directory redirect to /mnt/ai-models/ when a separate storage volume is in use.

Quick start (development)

# backend
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
hal0 serve --reload

# frontend
cd ui
npm install
npm run dev

The API lives at http://127.0.0.1:8080, the dashboard at the Vite dev server URL (usually http://127.0.0.1:5173).

Run hal0 doctor any time to re-check pre-flight (systemd / python / disk / ports / NPU probe). hal0 model pull <ref> streams models from Hugging Face into registry.toml under the user.* namespace, and hal0 uninstall [--keep-data] tears down a running install (thin wrapper over installer/uninstall.sh).

Useful day-2 commands:

# service status
systemctl status hal0-api
systemctl list-units 'hal0-slot@*'

# logs
journalctl -fu hal0-api
journalctl -fu 'hal0-slot@*'

# restart a wedged slot container
systemctl restart hal0-slot@chat

Auth posture

Per ADR-0012 (supersedes ADR-0001), hal0 ships with no built-in auth and no bundled TLS. hal0-api binds 0.0.0.0:8080 and treats every request as authenticated by virtue of network reachability — the right posture for a homelab appliance on a trusted LAN.

If hal0 is reachable from anything you don't physically control, front it with an upstream reverse proxy that owns auth + TLS — Traefik, nginx, Cloudflare Tunnel, or your own preferred edge. See docs/operate/auth.mdx for example configs of each.

Proxmox integration (optional)

If hal0 runs inside a Proxmox LXC, the container only sees its own cgroup slice of memory — other tenants, ZFS ARC, and the host kernel draw from the same physical DIMMs as GPU GTT but are invisible from inside. To surface that, drop a read-only PVEAuditor API token into the dashboard's Settings → "Proxmox integration" panel. Once saved, the unified-memory bar swaps to the physical host's DIMM total and adds a muted "Proxmox host" segment for other-tenant + kernel pressure. Token is sensitive and stored 0600 at /etc/hal0/proxmox.json; the API never echoes it back. Bare-metal and VM installs leave the panel off and the dashboard stays quiet.

Roadmap

No dates — items are direction. The closer to the left, the closer to running on your box. Full version at hal0.dev/roadmap.

Shipped (v0.3 / container runtime)

  • Per-slot podman containers — every inference workload runs in its own hal0-slot@<name>.service container; ContainerProvider + profiles.toml replace the old single-daemon model (epic #652, extraction #687, PRs #688–#726)
  • GpuArbiter exclusive image mode — ComfyUI gets the iGPU to itself; LLM GPU slots pause and resume automatically
  • NPU trio via hal0-toolbox-flm container — FLM trio now runs containerised; host FLM .deb is probe-only
  • registry.toml as sole model catalog — no secondary runtime catalog to sync; hal0 model commands and the dashboard are the only safe edit paths
  • Slot-state Prometheus + EventBus journal — per-slot observability via journald (hal0-slot@<name>) and the dashboard Journal panel
  • FLM trio NPU packing — chat + ASR + embed coresident on one AMDXDNA hardware context; toggle via [npu] in the slot TOML
  • OmniRouter client-side tool-calling — 8 tools, dynamic per-request filtering, route_to_chat cross-slot delegation
  • hal0 setup TUI — replaces the web FirstRun picker; seeds the recommended Main slot on install (--auto --no-pull); interactive post-install flow covers storage, Extensions, models, and NPU opt-in
  • hal0 registry import — one-shot v0.1.x → v0.3 registry recovery from a backup tarball
  • Carried forward: OpenAI-compatible /v1/*, portable hardware probe, capability slots + orchestrator, dispatcher with single-flight + cold-cache prefetch, bundled OpenWebUI on :3001, cosign-keyless self-update with rollback, bundled agents (pi-coder / Hermes-Agent) + MCP admin + Cognee memory MCP

Soon

  • GPU-accelerated TTS — kokoro-vulkan or successor; closes the [CPU] chip on the voice slot card
  • Advanced memory — Cognee graph + Memify pipeline enabled, MCP client side of hal0 (agents reach external MCP servers), federated memory across local + remote sources
  • Benchmarks & presets UI — in-dashboard tok/s + latency runs, plus curated loadout presets you can flash onto a fresh install
  • AUR PKGBUILD & Ubuntu PPA — native distro packages on top of the install script; pacman and apt as first-class install paths
  • hal0.local mDNS auto-discovery polish
  • Light mode toggle

Exploring (v1.x +)

  • Multi-host federation — a slot mesh across LAN boxes — primary on the Strix Halo, embed on the workstation, all behind one /v1/* surface
  • Fine-tune & LoRA hot-swap — attach and rotate LoRAs against a warm base model without unloading the underlying weights
  • Per-model rate limits & budgets — cost-style accounting for local inference — cap a chatty agent without taking the whole box down
  • Voice mode end-to-end — agent loop stitched into a hands-free streaming conversation
  • ChatOps adapters — Slack and Matrix bridges as extensions — talk to hal0 from the rooms you already live in

License

Apache 2.0. See LICENSE.

Contributing

The contribution model is still being decided (PLAN.md §16). File issues for discussion; PRs aren't being accepted from outside contributors yet. See CONTRIBUTING.md for the test tiers and the eventual flow.

This project adheres to the Contributor Covenant code of conduct. To report a security vulnerability, see SECURITY.md.

About

Open-source self-hosted home AI inference platform for AMD Strix Halo — multi-backend slots, OpenAI-compatible gateway, Vue 3 + FastAPI + systemd.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors