Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
85 changes: 48 additions & 37 deletions CONTENT_BRIEF.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,12 @@

hal0 is a homelab AI inference platform: the Linux box you already
have in the rack, running real OpenAI-compatible inference. It manages
model **slots** as systemd units with a typed lifecycle state machine,
exposes an **OpenAI-compatible `/v1/*` API**, and ships with a Vue
**dashboard** plus a **prewired OpenWebUI** chat tab. One command
installs on any modern Linux box — Strix Halo iGPU, AMD discrete,
NVIDIA, or CPU — and it's happy in a privileged Proxmox LXC with
GPU/NPU passthrough, behind your Traefik.
model **slots** as podman containers under per-slot systemd units with
a typed lifecycle state machine, exposes an **OpenAI-compatible
`/v1/*` API**, and ships with a React **dashboard** plus a **prewired
OpenWebUI** chat tab. One command installs on any modern Linux box —
Strix Halo iGPU, AMD discrete, NVIDIA, or CPU — and it's happy in a
privileged Proxmox LXC with GPU/NPU passthrough, behind your Traefik.

(Synthesised from `hal0/README.md` lines 1–24 and `hal0/PLAN.md` §1.)

Expand Down Expand Up @@ -62,15 +62,15 @@ Overrides (env vars, from `installer/install.sh`):
`HAL0_NO_PROBE`, per-backend `HAL0_TOOLBOX_IMAGE_*`.

Status caveat: the installer is real and produces a running
`hal0-api`. As of v0.1.1 (2026-05-22) **all six toolbox images**
`hal0-api`. All six toolbox images —
`vulkan`, `rocm`, `flm`, `moonshine`, `kokoro`, `comfyui` — are
published to `ghcr.io/hal0ai/` and pinned by sha256 digest in
`hal0/manifest.json` (`toolbox_images.*.digest`). FLM chat + embed
are surfaced in the picker when XDNA hardware and the local toolbox
image are both present; STT slice deferred. v0.1.1 (2026-05-22) is
the first install that completes end-to-end on non-Strix-Halo hosts:
WSL 2 (with systemd), Proxmox VMs, and bare-metal Linux with discrete
GPUs all probe + wizard cleanly. APIs may shift before v1.0.
`hal0/manifest.json` (`toolbox_images.*.digest`). Slots run as
**podman containers** under per-slot `hal0-slot@<name>.service`
systemd units, orchestrated by hal0-api; container images and tuned
flags come from slot profiles. FLM chat + embed are surfaced in the
picker when XDNA hardware and the local toolbox image are both present.
APIs may shift before v1.0.

### Installer overhaul (2026-05-15)

Expand Down Expand Up @@ -177,9 +177,13 @@ and curated models.
warming → ready → serving ↔ idle → unloading; error sideband).
Persisted to `state.json`, streamed over SSE. (PLAN §5 Tier 3,
`src/hal0/slots/state.py`)
3. **systemd-managed containers** — each slot is an instance of the
`hal0-slot@.service` template unit. The API process never holds a
model in its own memory. (`ARCHITECTURE.md` §Process model)
3. **systemd-managed podman containers** — each slot runs as a podman
container under a `hal0-slot@<name>.service` unit; container images
and tuned flags come from slot profiles. The API process never holds
a model in its own memory. An exclusive GPU arbiter swaps the iGPU
between LLM serving and ComfyUI image generation. The NPU runs a
single FastFlowLM container serving chat + ASR + embeddings.
(`ARCHITECTURE.md` §Process model)
4. **Hardware-aware probe** — detects GPU / NPU / unified memory (UMA
pool on Strix Halo), writes `/etc/hal0/hardware.json`, surfaces
VRAM/RAM fit warnings inline in the slot form. (PLAN §6,
Expand All @@ -191,9 +195,9 @@ and curated models.
6. **Bundled prewired OpenWebUI** — chat UI on `:3001`, zero config:
the installer writes `openwebui.env` pointing at the local hal0 API.
(PLAN §8)
7. **Vue 3 + Tailwind 4 dashboard** — 9 views: Dashboard, Slots,
Models, Hardware, Logs, Settings, Providers, FirstRun, plus error
shell. Dark mode default; SSE for status + log tail. (PLAN §6)
7. **React 18 + TypeScript + Vite dashboard** — 9 views: Dashboard,
Slots, Models, Hardware, Logs, Settings, Providers, FirstRun, plus
error shell. Dark mode default; SSE for status + log tail. (PLAN §6)
8. **Atomic self-update with rollback** — `hal0 update --channel
stable|nightly`; cosign-verified tarballs swap a
`/usr/lib/hal0/current` symlink; `--rollback` reverts. (PLAN §9,
Expand All @@ -211,12 +215,12 @@ and curated models.
The dashboard operates ComfyUI as a containerized generation
engine with a gated inference ⇄ generation iGPU switchover
(hal0 PRs #686/#690, 2026-06-11 — see "Image generation" below).
12. **Optional Caddy reverse proxy with basic_auth + Bearer token POC**
— `install.sh --auth=basic` provisions Caddy, writes a hashed
`basic_auth`, mints a Bearer token, and round-trips a self-test
against `https://${HAL0_HOSTNAME}/api/health` before exiting
(commits `ba79427`, `f62902c`; install.sh lines 294+ and 645+).
Trusted-LAN posture remains the default (`--auth=off`).
12. **Trusted-LAN default — no built-in auth** — hal0-api binds
`0.0.0.0:8080` with no network gate (ADR-0012; Caddy / `--auth=basic`
removed in v0.3). Deploy behind your own Traefik/nginx/Cloudflare
Tunnel when you need authentication or TLS. The Origin allowlist
and HMAC session cookie still gate the chat-proxy WebSocket path.
(`src/hal0/api/routes/agents/_auth.py`)
13. **Proxmox host-pressure segment (LXC deployments)** — drop a
read-only `PVEAuditor` API token into Settings → "Proxmox
integration" and the dashboard's unified-memory bar shows the
Expand All @@ -238,8 +242,8 @@ The five always-present slots (`BUILTIN_SLOTS` in

| Slot | What it does | Default backend |
|---|---|---|
| `primary` | Chat / general LLM (`/v1/chat/completions`, `/v1/completions`) | llama.cpp (Vulkan) |
| `embed` | Embeddings (`/v1/embeddings`) and rerank (`/v1/rerankings`) | llama.cpp (Vulkan) |
| `primary` | Chat / general LLM (`/v1/chat/completions`, `/v1/completions`) | llama-server (Vulkan) |
| `embed` | Embeddings (`/v1/embeddings`) and rerank (`/v1/rerankings`) | llama-server (Vulkan) |
| `stt` | Speech-to-text (`/v1/audio/transcriptions`) | Moonshine |
| `tts` | Text-to-speech (`/v1/audio/speech`) | Kokoro |
| `img` | Image generation (`/v1/images/generations`) | ComfyUI (ROCm) |
Expand Down Expand Up @@ -316,16 +320,22 @@ From `hal0/PLAN.md` §1 + `src/hal0/providers/`:

| Provider | Hardware | What it serves |
|---|---|---|
| **llama.cpp** | Vulkan (default) / ROCm (opt-in) | chat, embed, rerank, vision |
| **FLM** | AMD XDNA NPU (opt-in) | chat / embed / ASR multiplex (chat + embed surfaced to picker today; STT slice deferred) |
| **Moonshine** | CPU / Vulkan | STT (`/v1/audio/transcriptions`) |
| **llama.cpp** (llama-server) | Vulkan (default) / ROCm (opt-in) | chat, embed, rerank, vision |
| **FLM** | AMD XDNA NPU (opt-in) | chat / embed / ASR multiplex — one FastFlowLM container serves all three |
| **Moonshine** | CPU only | STT (`/v1/audio/transcriptions`) |
| **Kokoro** | CPU / Vulkan | TTS (`/v1/audio/speech`) |
| **ComfyUI** | ROCm (Strix Halo iGPU class) | Image gen (`/v1/images/generations`) |
| **ComfyUI** | ROCm (Strix Halo iGPU class) | Image gen (`/v1/images/generations`) — exclusive GPU arbiter swaps iGPU between inference and generation |

All five are first-class in v1. Each provider is a class with
All five are first-class in v1. Every slot runs as a **podman container**
under a `hal0-slot@<name>.service` unit; container images + tuned flags
come from profiles. Each provider is a class with
`build_env() / start_cmd() / health() / infer()` — stateless, swappable
(`ARCHITECTURE.md` §Key boundaries).

**Provider name in TOML**: use `llama-server` (not `llama.cpp`) in slot
TOML files — `_VALID_PROVIDERS` = `{llama-server, flm, moonshine, kokoro}`
(`src/hal0/config/schema.py:89`).

### FLM NPU (AMD XDNA) deep-dive

FLM is live as the NPU provider — opt-in, surfaced in the picker only
Expand Down Expand Up @@ -652,7 +662,7 @@ only `models_dir` raises (see comments at
(`src/hal0/api/routes/models.py:148`,
`src/hal0/registry/detect.py:140`)
- Per-slot live metrics endpoint — `GET /api/slots/metrics` reads
docker container cgroup memory + `ActiveEnterTimestampMonotonic` +
podman container cgroup memory + `ActiveEnterTimestampMonotonic` +
scraped llama-server `/metrics`; falls back to the systemd unit's
own `MemoryCurrent` for native-host slots
(`src/hal0/api/routes/slots.py:376–467`)
Expand Down Expand Up @@ -927,10 +937,11 @@ a stale placeholder.
versioned dir), `/etc/hal0/` (config, preserved across updates),
`/var/lib/hal0/` (models, registry, OpenWebUI state). `HAL0_HOME`
overrides all of the above for dev installs. (PLAN §2)
- **systemd template units** — slots are `hal0-slot@<name>.service`
instances (`packaging/systemd/hal0-slot@.service`), not per-slot
hand-written units. One template, N instances, all rendered from
config.
- **systemd template units + podman containers** — slots are
`hal0-slot@<name>.service` instances that each launch a **podman**
container; not per-slot hand-written units and not Docker Compose.
One template, N instances, all rendered from config. Container images
and flags come from slot profiles managed by hal0-api.
- **Atomic TOML config** — every config write is
`NamedTemporaryFile(delete=False) + os.replace()`; a failed write
leaves the prior file intact (PLAN §5 Tier 1).
Expand Down
8 changes: 4 additions & 4 deletions docs/SITE-FIXES.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,8 @@
- [ ] **Dashboard framework wrong** — "Vue 3" → **React 18 + TypeScript + Vite** · `src/pages/index.astro:64`
- [ ] **Version eyebrow 1–2 minors stale** → `0.3.2-alpha.1` (sweep ALL version strings; ideally template from release manifest) · `src/pages/index.astro:13`, `src/content/docs/docs/index.mdx:16`, `src/content/docs/docs/operate/updates.mdx:13`
- [ ] **Caddy / `--auth=basic` / auto-HTTPS = broken install** — entire auth/TLS subsystem retired (ADR-0012; API binds `0.0.0.0:8080` open). Delete/rewrite to the honest `no-auth-default` story · `src/pages/index.astro:361`, `src/content/docs/docs/operate/auth.mdx:228,288`
- [ ] **Per-slot toolbox containers / `hal0-slot@.service` / 6 published images** — superseded by the single `lemond` runtime; slots are logical name→model mappings. Rewrite slot + provider-matrix docs · `getting-started/install.mdx:154`, `reference/provider-matrix.mdx:31`, `slots/what-is-a-slot.mdx:53`, `custom-slots.mdx:65`
- [ ] **Provider matrix = pre-Lemonade set** (Moonshine/ComfyUI/Kokoro first-class) — rewrite around Lemonade-unified runtime · `reference/provider-matrix.mdx:12,39,62,76,90,99,110`, `api/audio.mdx:15`
- [ ] **Per-slot container runtime story** — slots run as **podman** containers under `hal0-slot@<name>.service` units (not Docker, not lemond); container images + flags from profiles; GPU arbiter for iGPU; NPU = single FastFlowLM container. Rewrite slot + provider-matrix + install docs · `getting-started/install.mdx:66,103,138,156`, `reference/provider-matrix.mdx:31`, `slots/what-is-a-slot.mdx:48-52`, `custom-slots.mdx:65`
- [ ] **Provider matrix uses wrong provider names** — `llama.cpp` → `llama-server` in TOML examples and matrix; `_VALID_PROVIDERS = {llama-server, flm, moonshine, kokoro}` · `reference/provider-matrix.mdx:20`, `hardware/amd-discrete.mdx:69`, `hardware/nvidia.mdx:160`, `reference/config-schema.mdx:50`
- [ ] **Hermes-Agent labelled "(roadmap: soon)" but SHIPPED** — promote to shipped; document install + dashboard chat surface · `src/pages/index.astro:346`

## ⛔ BLOCKED — backend must reconcile code↔docs first (do NOT edit site yet)
Expand Down Expand Up @@ -51,7 +51,7 @@ _(Most marketing-worthy first. Each is real + user-facing.)_
- [ ] **Bundle-tier first-run picker** — Lite/Default/Pro/Max/LMX-Omni, RAM-gated · `api/routes/bundles.py:46-82`
- [ ] **NPU FLM trio** — agent + stt-npu + embed-npu from one `flm serve` · `CONTEXT.md:115-136`
- [ ] **Bundled pi-coder agent** (single-pick with Hermes) · `agents/pi_coder.py`
- [ ] **Lemonade admin panel** — `GET/POST /api/lemonade/config` · `api/routes/lemonade_admin.py:297`
- [ ] **Slot container admin** — expose container-runtime management story (podman slot containers, GPU arbiter, profile-driven image selection) in the docs; `api/routes/lemonade_admin.py` is removed in the lemonade-removal epic
- [ ] **Prometheus metrics endpoint** · `api/routes/health.py:112`
- [ ] **Merged journal SSE** — `/api/journal` + `/stream` + Lemonade log proxy · `api/routes/journal.py:203`
- [ ] **HMAC session cookie** for agents chat proxy (HttpOnly, SameSite=Lax, 8h) · `api/agents/_auth.py`
Expand All @@ -67,7 +67,7 @@ _New items beyond the drift report. Full detail + verification notes in the back

### 🔴 HIGH — OSS / broken copy-paste
- [ ] **B2 — Scrub private IPs/domains from the website repo** · `astro.config.mjs:93` (`allowedHosts ['.thinmint.dev']` → localhost) · `operate/openwebui.mdx:79,81` (`hal0.thinmint.dev` → `hal0.local`) · `operate/auth.mdx:31,331,337` (`10.0.1.230` — goes away with B1)
- [ ] **B3 — Fix invalid `--provider` / `provider =` examples (they error on validation)** · valid set `{lemonade,llama-server,flm,moonshine,kokoro}` · `hardware/amd-discrete.mdx:69`, `hardware/nvidia.mdx:160`, `reference/config-schema.mdx:50`
- [ ] **B3 — Fix invalid `--provider` / `provider =` examples (they error on validation)** · valid set `{llama-server,flm,moonshine,kokoro}` (lemonade removed in phase-F epic) · `hardware/amd-discrete.mdx:69`, `hardware/nvidia.mdx:160`, `reference/config-schema.mdx:50`

### 🟡 MEDIUM
- [ ] **B6 — Remove `hal0-slot@.service` from install step 5** (template removed PR-9) · `getting-started/install.mdx:137-148`
Expand Down
47 changes: 21 additions & 26 deletions src/content/docs/docs/getting-started/install.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -60,10 +60,10 @@ below).
## Deployment shapes

hal0 doesn't care which slice of your homelab it runs on, as long as
the kernel speaks systemd and Docker. Three shapes worth naming:
the kernel speaks systemd and podman. Three shapes worth naming:

- **Bare-metal Linux.** The simplest case. `hal0-api`, `hal0-openwebui`,
and the per-slot toolbox containers all live on the host.
and the per-slot podman containers all live on the host.
- **VM.** Works if you give it enough RAM and pass the GPU through.
Eats more host memory than an LXC for the same workload.
- **Privileged LXC on Proxmox (recommended for homelabs).** GPU and
Expand Down Expand Up @@ -99,8 +99,9 @@ loads one.

2. **x86_64.** ARM is not currently supported.

3. **Docker reachable.** The slot toolboxes run as containers. The
installer checks `docker ps` before doing anything destructive.
3. **Podman reachable.** Slot inference containers run under podman.
The installer checks podman availability before doing anything
destructive.

4. **At least 20 GB free under `/var/lib`.** Models, registry, and
OpenWebUI state land there. Override via `--models-dir=/path` if
Expand All @@ -113,7 +114,7 @@ loads one.
## What the installer does

<Steps>
1. **Pre-flight checks.** Verifies systemd, x86\_64, Docker, free
1. **Pre-flight checks.** Verifies systemd, x86\_64, podman, free
space, and free ports. Bails before touching anything if a check
fails. Re-runnable on its own as `hal0 doctor`.

Expand All @@ -134,10 +135,11 @@ loads one.
pull a model and flip it on yourself. Existing files are never
overwritten on re-run.

5. **systemd units.** Drops `hal0-api.service`,
`hal0-openwebui.service`, and the `hal0-slot@.service` template into
`/etc/systemd/system/`. Reloads the daemon, enables and starts the
API plus OpenWebUI.
5. **systemd units.** Drops `hal0-api.service` and
`hal0-openwebui.service` into `/etc/systemd/system/`. The
`hal0-slot@.service` template is written separately for per-slot
podman containers. Reloads the daemon, enables and starts the API
plus OpenWebUI.

6. **`hal0` on PATH.** Symlinks `/usr/local/bin/hal0` →
`${VENV_DIR}/bin/hal0` so the CLI works without sourcing anything.
Expand All @@ -153,7 +155,7 @@ The installer needs `sudo` to drop systemd unit files in
`/etc/systemd/system/`, create the `/usr/lib/hal0/` tree, and
optionally write the `/usr/local/bin/hal0` symlink. The `hal0-api`
service itself runs as the unprivileged `hal0` user, as do the
per-slot toolbox containers (each declares `User=hal0` in the
per-slot podman containers (each declares `User=hal0` in the
template unit).
</Aside>

Expand Down Expand Up @@ -209,22 +211,15 @@ then `/var/lib/hal0/models`. The path is persisted as
`[models].pull_root` and auto-included in `[models].roots` so a fresh
install scans the existing tree on first boot.

## Authentication & HTTPS (optional)
## Authentication & HTTPS

The default install has no auth in front. Fine on a trusted home LAN
behind your Traefik or Caddy. For exposed deployments add
`--auth=basic`:

```sh frame="terminal"
sudo bash installer/install.sh --auth=basic
```

The installer prompts for an admin user and password, installs Caddy,
generates a TLS cert (self-signed for `.local` hostnames, Let's
Encrypt for real domains), mints a Bearer token, and round-trips
`https://${HAL0_HOSTNAME}/api/health` as a self-test before exiting.
See [Authentication & HTTPS](/docs/operate/auth/) for the full flow
including client-side cert trust and ACME setup.
hal0-api binds `0.0.0.0:8080` with no built-in auth or TLS. On a
trusted home LAN behind your Traefik this is the correct default.
For exposed deployments, put a reverse proxy (Traefik, nginx,
Cloudflare Tunnel) in front and handle auth and TLS there.
See [Authentication & Security](/docs/operate/auth/) for recommended
patterns including WebSocket passthrough and the MCP transport
allowlists.

## From a clone

Expand Down Expand Up @@ -258,7 +253,7 @@ hal0 doctor
```

Re-runs the pre-flight pack against the live host. Handy after a
kernel upgrade, a Docker daemon swap, or whenever something feels off.
kernel upgrade, a podman update, or whenever something feels off.

## Next steps

Expand Down
Loading