From bafcec19cf2aaa64be068b93d5d2348f18283f48 Mon Sep 17 00:00:00 2001 From: "Alexander (via Claude)" Date: Fri, 12 Jun 2026 02:30:46 -0400 Subject: [PATCH] =?UTF-8?q?content:=20lemonade=20=E2=86=92=20container-run?= =?UTF-8?q?time=20sweep=20(hal0=20Phase=20F)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - CONTENT_BRIEF: podman per-slot containers, llama-server toolbox images, GPU arbiter + FLM trio, honest no-auth posture, React 18 - install/provider-matrix/what-is-a-slot/index: podman + container claims; Caddy/basic-auth copy replaced with reverse-proxy-at-edge - SITE-FIXES tracker items updated to the container story Co-Authored-By: Claude Fable 5 --- CONTENT_BRIEF.md | 85 +++++++++++-------- docs/SITE-FIXES.md | 8 +- .../docs/docs/getting-started/install.mdx | 47 +++++----- .../docs/docs/reference/provider-matrix.mdx | 58 +++++++------ .../docs/docs/slots/what-is-a-slot.mdx | 33 +++---- src/pages/index.astro | 10 +-- 6 files changed, 126 insertions(+), 115 deletions(-) diff --git a/CONTENT_BRIEF.md b/CONTENT_BRIEF.md index 27177fb..b1153a2 100644 --- a/CONTENT_BRIEF.md +++ b/CONTENT_BRIEF.md @@ -12,12 +12,12 @@ hal0 is a homelab AI inference platform: the Linux box you already have in the rack, running real OpenAI-compatible inference. It manages -model **slots** as systemd units with a typed lifecycle state machine, -exposes an **OpenAI-compatible `/v1/*` API**, and ships with a Vue -**dashboard** plus a **prewired OpenWebUI** chat tab. One command -installs on any modern Linux box — Strix Halo iGPU, AMD discrete, -NVIDIA, or CPU — and it's happy in a privileged Proxmox LXC with -GPU/NPU passthrough, behind your Traefik. +model **slots** as podman containers under per-slot systemd units with +a typed lifecycle state machine, exposes an **OpenAI-compatible +`/v1/*` API**, and ships with a React **dashboard** plus a **prewired +OpenWebUI** chat tab. One command installs on any modern Linux box — +Strix Halo iGPU, AMD discrete, NVIDIA, or CPU — and it's happy in a +privileged Proxmox LXC with GPU/NPU passthrough, behind your Traefik. (Synthesised from `hal0/README.md` lines 1–24 and `hal0/PLAN.md` §1.) @@ -62,15 +62,15 @@ Overrides (env vars, from `installer/install.sh`): `HAL0_NO_PROBE`, per-backend `HAL0_TOOLBOX_IMAGE_*`. Status caveat: the installer is real and produces a running -`hal0-api`. As of v0.1.1 (2026-05-22) **all six toolbox images** — +`hal0-api`. All six toolbox images — `vulkan`, `rocm`, `flm`, `moonshine`, `kokoro`, `comfyui` — are published to `ghcr.io/hal0ai/` and pinned by sha256 digest in -`hal0/manifest.json` (`toolbox_images.*.digest`). FLM chat + embed -are surfaced in the picker when XDNA hardware and the local toolbox -image are both present; STT slice deferred. v0.1.1 (2026-05-22) is -the first install that completes end-to-end on non-Strix-Halo hosts: -WSL 2 (with systemd), Proxmox VMs, and bare-metal Linux with discrete -GPUs all probe + wizard cleanly. APIs may shift before v1.0. +`hal0/manifest.json` (`toolbox_images.*.digest`). Slots run as +**podman containers** under per-slot `hal0-slot@.service` +systemd units, orchestrated by hal0-api; container images and tuned +flags come from slot profiles. FLM chat + embed are surfaced in the +picker when XDNA hardware and the local toolbox image are both present. +APIs may shift before v1.0. ### Installer overhaul (2026-05-15) @@ -177,9 +177,13 @@ and curated models. warming → ready → serving ↔ idle → unloading; error sideband). Persisted to `state.json`, streamed over SSE. (PLAN §5 Tier 3, `src/hal0/slots/state.py`) -3. **systemd-managed containers** — each slot is an instance of the - `hal0-slot@.service` template unit. The API process never holds a - model in its own memory. (`ARCHITECTURE.md` §Process model) +3. **systemd-managed podman containers** — each slot runs as a podman + container under a `hal0-slot@.service` unit; container images + and tuned flags come from slot profiles. The API process never holds + a model in its own memory. An exclusive GPU arbiter swaps the iGPU + between LLM serving and ComfyUI image generation. The NPU runs a + single FastFlowLM container serving chat + ASR + embeddings. + (`ARCHITECTURE.md` §Process model) 4. **Hardware-aware probe** — detects GPU / NPU / unified memory (UMA pool on Strix Halo), writes `/etc/hal0/hardware.json`, surfaces VRAM/RAM fit warnings inline in the slot form. (PLAN §6, @@ -191,9 +195,9 @@ and curated models. 6. **Bundled prewired OpenWebUI** — chat UI on `:3001`, zero config: the installer writes `openwebui.env` pointing at the local hal0 API. (PLAN §8) -7. **Vue 3 + Tailwind 4 dashboard** — 9 views: Dashboard, Slots, - Models, Hardware, Logs, Settings, Providers, FirstRun, plus error - shell. Dark mode default; SSE for status + log tail. (PLAN §6) +7. **React 18 + TypeScript + Vite dashboard** — 9 views: Dashboard, + Slots, Models, Hardware, Logs, Settings, Providers, FirstRun, plus + error shell. Dark mode default; SSE for status + log tail. (PLAN §6) 8. **Atomic self-update with rollback** — `hal0 update --channel stable|nightly`; cosign-verified tarballs swap a `/usr/lib/hal0/current` symlink; `--rollback` reverts. (PLAN §9, @@ -211,12 +215,12 @@ and curated models. The dashboard operates ComfyUI as a containerized generation engine with a gated inference ⇄ generation iGPU switchover (hal0 PRs #686/#690, 2026-06-11 — see "Image generation" below). -12. **Optional Caddy reverse proxy with basic_auth + Bearer token POC** - — `install.sh --auth=basic` provisions Caddy, writes a hashed - `basic_auth`, mints a Bearer token, and round-trips a self-test - against `https://${HAL0_HOSTNAME}/api/health` before exiting - (commits `ba79427`, `f62902c`; install.sh lines 294+ and 645+). - Trusted-LAN posture remains the default (`--auth=off`). +12. **Trusted-LAN default — no built-in auth** — hal0-api binds + `0.0.0.0:8080` with no network gate (ADR-0012; Caddy / `--auth=basic` + removed in v0.3). Deploy behind your own Traefik/nginx/Cloudflare + Tunnel when you need authentication or TLS. The Origin allowlist + and HMAC session cookie still gate the chat-proxy WebSocket path. + (`src/hal0/api/routes/agents/_auth.py`) 13. **Proxmox host-pressure segment (LXC deployments)** — drop a read-only `PVEAuditor` API token into Settings → "Proxmox integration" and the dashboard's unified-memory bar shows the @@ -238,8 +242,8 @@ The five always-present slots (`BUILTIN_SLOTS` in | Slot | What it does | Default backend | |---|---|---| -| `primary` | Chat / general LLM (`/v1/chat/completions`, `/v1/completions`) | llama.cpp (Vulkan) | -| `embed` | Embeddings (`/v1/embeddings`) and rerank (`/v1/rerankings`) | llama.cpp (Vulkan) | +| `primary` | Chat / general LLM (`/v1/chat/completions`, `/v1/completions`) | llama-server (Vulkan) | +| `embed` | Embeddings (`/v1/embeddings`) and rerank (`/v1/rerankings`) | llama-server (Vulkan) | | `stt` | Speech-to-text (`/v1/audio/transcriptions`) | Moonshine | | `tts` | Text-to-speech (`/v1/audio/speech`) | Kokoro | | `img` | Image generation (`/v1/images/generations`) | ComfyUI (ROCm) | @@ -316,16 +320,22 @@ From `hal0/PLAN.md` §1 + `src/hal0/providers/`: | Provider | Hardware | What it serves | |---|---|---| -| **llama.cpp** | Vulkan (default) / ROCm (opt-in) | chat, embed, rerank, vision | -| **FLM** | AMD XDNA NPU (opt-in) | chat / embed / ASR multiplex (chat + embed surfaced to picker today; STT slice deferred) | -| **Moonshine** | CPU / Vulkan | STT (`/v1/audio/transcriptions`) | +| **llama.cpp** (llama-server) | Vulkan (default) / ROCm (opt-in) | chat, embed, rerank, vision | +| **FLM** | AMD XDNA NPU (opt-in) | chat / embed / ASR multiplex — one FastFlowLM container serves all three | +| **Moonshine** | CPU only | STT (`/v1/audio/transcriptions`) | | **Kokoro** | CPU / Vulkan | TTS (`/v1/audio/speech`) | -| **ComfyUI** | ROCm (Strix Halo iGPU class) | Image gen (`/v1/images/generations`) | +| **ComfyUI** | ROCm (Strix Halo iGPU class) | Image gen (`/v1/images/generations`) — exclusive GPU arbiter swaps iGPU between inference and generation | -All five are first-class in v1. Each provider is a class with +All five are first-class in v1. Every slot runs as a **podman container** +under a `hal0-slot@.service` unit; container images + tuned flags +come from profiles. Each provider is a class with `build_env() / start_cmd() / health() / infer()` — stateless, swappable (`ARCHITECTURE.md` §Key boundaries). +**Provider name in TOML**: use `llama-server` (not `llama.cpp`) in slot +TOML files — `_VALID_PROVIDERS` = `{llama-server, flm, moonshine, kokoro}` +(`src/hal0/config/schema.py:89`). + ### FLM NPU (AMD XDNA) deep-dive FLM is live as the NPU provider — opt-in, surfaced in the picker only @@ -652,7 +662,7 @@ only `models_dir` raises (see comments at (`src/hal0/api/routes/models.py:148`, `src/hal0/registry/detect.py:140`) - Per-slot live metrics endpoint — `GET /api/slots/metrics` reads - docker container cgroup memory + `ActiveEnterTimestampMonotonic` + + podman container cgroup memory + `ActiveEnterTimestampMonotonic` + scraped llama-server `/metrics`; falls back to the systemd unit's own `MemoryCurrent` for native-host slots (`src/hal0/api/routes/slots.py:376–467`) @@ -927,10 +937,11 @@ a stale placeholder. versioned dir), `/etc/hal0/` (config, preserved across updates), `/var/lib/hal0/` (models, registry, OpenWebUI state). `HAL0_HOME` overrides all of the above for dev installs. (PLAN §2) -- **systemd template units** — slots are `hal0-slot@.service` - instances (`packaging/systemd/hal0-slot@.service`), not per-slot - hand-written units. One template, N instances, all rendered from - config. +- **systemd template units + podman containers** — slots are + `hal0-slot@.service` instances that each launch a **podman** + container; not per-slot hand-written units and not Docker Compose. + One template, N instances, all rendered from config. Container images + and flags come from slot profiles managed by hal0-api. - **Atomic TOML config** — every config write is `NamedTemporaryFile(delete=False) + os.replace()`; a failed write leaves the prior file intact (PLAN §5 Tier 1). diff --git a/docs/SITE-FIXES.md b/docs/SITE-FIXES.md index 1d3675f..84b78a9 100644 --- a/docs/SITE-FIXES.md +++ b/docs/SITE-FIXES.md @@ -18,8 +18,8 @@ - [ ] **Dashboard framework wrong** — "Vue 3" → **React 18 + TypeScript + Vite** · `src/pages/index.astro:64` - [ ] **Version eyebrow 1–2 minors stale** → `0.3.2-alpha.1` (sweep ALL version strings; ideally template from release manifest) · `src/pages/index.astro:13`, `src/content/docs/docs/index.mdx:16`, `src/content/docs/docs/operate/updates.mdx:13` - [ ] **Caddy / `--auth=basic` / auto-HTTPS = broken install** — entire auth/TLS subsystem retired (ADR-0012; API binds `0.0.0.0:8080` open). Delete/rewrite to the honest `no-auth-default` story · `src/pages/index.astro:361`, `src/content/docs/docs/operate/auth.mdx:228,288` -- [ ] **Per-slot toolbox containers / `hal0-slot@.service` / 6 published images** — superseded by the single `lemond` runtime; slots are logical name→model mappings. Rewrite slot + provider-matrix docs · `getting-started/install.mdx:154`, `reference/provider-matrix.mdx:31`, `slots/what-is-a-slot.mdx:53`, `custom-slots.mdx:65` -- [ ] **Provider matrix = pre-Lemonade set** (Moonshine/ComfyUI/Kokoro first-class) — rewrite around Lemonade-unified runtime · `reference/provider-matrix.mdx:12,39,62,76,90,99,110`, `api/audio.mdx:15` +- [ ] **Per-slot container runtime story** — slots run as **podman** containers under `hal0-slot@.service` units (not Docker, not lemond); container images + flags from profiles; GPU arbiter for iGPU; NPU = single FastFlowLM container. Rewrite slot + provider-matrix + install docs · `getting-started/install.mdx:66,103,138,156`, `reference/provider-matrix.mdx:31`, `slots/what-is-a-slot.mdx:48-52`, `custom-slots.mdx:65` +- [ ] **Provider matrix uses wrong provider names** — `llama.cpp` → `llama-server` in TOML examples and matrix; `_VALID_PROVIDERS = {llama-server, flm, moonshine, kokoro}` · `reference/provider-matrix.mdx:20`, `hardware/amd-discrete.mdx:69`, `hardware/nvidia.mdx:160`, `reference/config-schema.mdx:50` - [ ] **Hermes-Agent labelled "(roadmap: soon)" but SHIPPED** — promote to shipped; document install + dashboard chat surface · `src/pages/index.astro:346` ## ⛔ BLOCKED — backend must reconcile code↔docs first (do NOT edit site yet) @@ -51,7 +51,7 @@ _(Most marketing-worthy first. Each is real + user-facing.)_ - [ ] **Bundle-tier first-run picker** — Lite/Default/Pro/Max/LMX-Omni, RAM-gated · `api/routes/bundles.py:46-82` - [ ] **NPU FLM trio** — agent + stt-npu + embed-npu from one `flm serve` · `CONTEXT.md:115-136` - [ ] **Bundled pi-coder agent** (single-pick with Hermes) · `agents/pi_coder.py` -- [ ] **Lemonade admin panel** — `GET/POST /api/lemonade/config` · `api/routes/lemonade_admin.py:297` +- [ ] **Slot container admin** — expose container-runtime management story (podman slot containers, GPU arbiter, profile-driven image selection) in the docs; `api/routes/lemonade_admin.py` is removed in the lemonade-removal epic - [ ] **Prometheus metrics endpoint** · `api/routes/health.py:112` - [ ] **Merged journal SSE** — `/api/journal` + `/stream` + Lemonade log proxy · `api/routes/journal.py:203` - [ ] **HMAC session cookie** for agents chat proxy (HttpOnly, SameSite=Lax, 8h) · `api/agents/_auth.py` @@ -67,7 +67,7 @@ _New items beyond the drift report. Full detail + verification notes in the back ### 🔴 HIGH — OSS / broken copy-paste - [ ] **B2 — Scrub private IPs/domains from the website repo** · `astro.config.mjs:93` (`allowedHosts ['.thinmint.dev']` → localhost) · `operate/openwebui.mdx:79,81` (`hal0.thinmint.dev` → `hal0.local`) · `operate/auth.mdx:31,331,337` (`10.0.1.230` — goes away with B1) -- [ ] **B3 — Fix invalid `--provider` / `provider =` examples (they error on validation)** · valid set `{lemonade,llama-server,flm,moonshine,kokoro}` · `hardware/amd-discrete.mdx:69`, `hardware/nvidia.mdx:160`, `reference/config-schema.mdx:50` +- [ ] **B3 — Fix invalid `--provider` / `provider =` examples (they error on validation)** · valid set `{llama-server,flm,moonshine,kokoro}` (lemonade removed in phase-F epic) · `hardware/amd-discrete.mdx:69`, `hardware/nvidia.mdx:160`, `reference/config-schema.mdx:50` ### 🟡 MEDIUM - [ ] **B6 — Remove `hal0-slot@.service` from install step 5** (template removed PR-9) · `getting-started/install.mdx:137-148` diff --git a/src/content/docs/docs/getting-started/install.mdx b/src/content/docs/docs/getting-started/install.mdx index ccb6a23..f80bc6a 100644 --- a/src/content/docs/docs/getting-started/install.mdx +++ b/src/content/docs/docs/getting-started/install.mdx @@ -60,10 +60,10 @@ below). ## Deployment shapes hal0 doesn't care which slice of your homelab it runs on, as long as -the kernel speaks systemd and Docker. Three shapes worth naming: +the kernel speaks systemd and podman. Three shapes worth naming: - **Bare-metal Linux.** The simplest case. `hal0-api`, `hal0-openwebui`, - and the per-slot toolbox containers all live on the host. + and the per-slot podman containers all live on the host. - **VM.** Works if you give it enough RAM and pass the GPU through. Eats more host memory than an LXC for the same workload. - **Privileged LXC on Proxmox (recommended for homelabs).** GPU and @@ -99,8 +99,9 @@ loads one. 2. **x86_64.** ARM is not currently supported. -3. **Docker reachable.** The slot toolboxes run as containers. The - installer checks `docker ps` before doing anything destructive. +3. **Podman reachable.** Slot inference containers run under podman. + The installer checks podman availability before doing anything + destructive. 4. **At least 20 GB free under `/var/lib`.** Models, registry, and OpenWebUI state land there. Override via `--models-dir=/path` if @@ -113,7 +114,7 @@ loads one. ## What the installer does -1. **Pre-flight checks.** Verifies systemd, x86\_64, Docker, free +1. **Pre-flight checks.** Verifies systemd, x86\_64, podman, free space, and free ports. Bails before touching anything if a check fails. Re-runnable on its own as `hal0 doctor`. @@ -134,10 +135,11 @@ loads one. pull a model and flip it on yourself. Existing files are never overwritten on re-run. -5. **systemd units.** Drops `hal0-api.service`, - `hal0-openwebui.service`, and the `hal0-slot@.service` template into - `/etc/systemd/system/`. Reloads the daemon, enables and starts the - API plus OpenWebUI. +5. **systemd units.** Drops `hal0-api.service` and + `hal0-openwebui.service` into `/etc/systemd/system/`. The + `hal0-slot@.service` template is written separately for per-slot + podman containers. Reloads the daemon, enables and starts the API + plus OpenWebUI. 6. **`hal0` on PATH.** Symlinks `/usr/local/bin/hal0` → `${VENV_DIR}/bin/hal0` so the CLI works without sourcing anything. @@ -153,7 +155,7 @@ The installer needs `sudo` to drop systemd unit files in `/etc/systemd/system/`, create the `/usr/lib/hal0/` tree, and optionally write the `/usr/local/bin/hal0` symlink. The `hal0-api` service itself runs as the unprivileged `hal0` user, as do the -per-slot toolbox containers (each declares `User=hal0` in the +per-slot podman containers (each declares `User=hal0` in the template unit). @@ -209,22 +211,15 @@ then `/var/lib/hal0/models`. The path is persisted as `[models].pull_root` and auto-included in `[models].roots` so a fresh install scans the existing tree on first boot. -## Authentication & HTTPS (optional) +## Authentication & HTTPS -The default install has no auth in front. Fine on a trusted home LAN -behind your Traefik or Caddy. For exposed deployments add -`--auth=basic`: - -```sh frame="terminal" -sudo bash installer/install.sh --auth=basic -``` - -The installer prompts for an admin user and password, installs Caddy, -generates a TLS cert (self-signed for `.local` hostnames, Let's -Encrypt for real domains), mints a Bearer token, and round-trips -`https://${HAL0_HOSTNAME}/api/health` as a self-test before exiting. -See [Authentication & HTTPS](/docs/operate/auth/) for the full flow -including client-side cert trust and ACME setup. +hal0-api binds `0.0.0.0:8080` with no built-in auth or TLS. On a +trusted home LAN behind your Traefik this is the correct default. +For exposed deployments, put a reverse proxy (Traefik, nginx, +Cloudflare Tunnel) in front and handle auth and TLS there. +See [Authentication & Security](/docs/operate/auth/) for recommended +patterns including WebSocket passthrough and the MCP transport +allowlists. ## From a clone @@ -258,7 +253,7 @@ hal0 doctor ``` Re-runs the pre-flight pack against the live host. Handy after a -kernel upgrade, a Docker daemon swap, or whenever something feels off. +kernel upgrade, a podman update, or whenever something feels off. ## Next steps diff --git a/src/content/docs/docs/reference/provider-matrix.mdx b/src/content/docs/docs/reference/provider-matrix.mdx index 91faea3..b27d554 100644 --- a/src/content/docs/docs/reference/provider-matrix.mdx +++ b/src/content/docs/docs/reference/provider-matrix.mdx @@ -1,40 +1,42 @@ --- title: Provider matrix -description: llama.cpp, FLM, Moonshine, Kokoro, ComfyUI — what each handles, on what hardware. +description: llama-server, FLM, Moonshine, Kokoro, ComfyUI — what each handles, on what hardware. sidebar: order: 3 --- import { Aside } from '@astrojs/starlight/components'; -Five providers ship first-class in v0.1.0-alpha. Each is a class with -a small contract (`build_env() / start_cmd() / health() / infer()`) -that makes them stateless and swappable. The slot lifecycle is -provider-agnostic; what changes between providers is the workload +Five providers ship first-class. Each runs as a **podman container** +under a `hal0-slot@.service` unit; container images and tuned +flags come from slot profiles managed by hal0-api. Each provider is a +class with a small contract (`build_env() / start_cmd() / health() / +infer()`) that makes them stateless and swappable. The slot lifecycle +is provider-agnostic; what changes between providers is the workload they serve and the hardware they target. ## The matrix -| Provider | Hardware | What it serves | -|--------------|---------------------------------------|-----------------------------------------------| -| **llama.cpp**| Vulkan (default) / ROCm (opt-in) | chat, embed, rerank, vision | -| **FLM** | AMD XDNA NPU (opt-in) | chat / embed / ASR multiplex | -| **Moonshine**| CPU only | STT (`/v1/audio/transcriptions`) | -| **Kokoro** | CPU / Vulkan | TTS (`/v1/audio/speech`) | -| **ComfyUI** | ROCm (Strix Halo iGPU class) | Image gen (`/v1/images/generations`) | +| Provider | TOML name | Hardware | What it serves | +|-------------------|----------------|----------------------------------|-----------------------------------------| +| **llama-server** | `llama-server` | Vulkan (default) / ROCm (opt-in) | chat, embed, rerank, vision | +| **FLM** | `flm` | AMD XDNA NPU (opt-in) | chat / embed / ASR multiplex | +| **Moonshine** | `moonshine` | CPU only | STT (`/v1/audio/transcriptions`) | +| **Kokoro** | `kokoro` | CPU / Vulkan | TTS (`/v1/audio/speech`) | +| **ComfyUI** | `comfyui` | ROCm (Strix Halo iGPU class) | Image gen (`/v1/images/generations`) | -All five are first-class as of v0.1.0-alpha.1. FLM is opt-in because -XDNA NPU support depends on AMD's driver stack being present (kernel ->= 6.11 with the `amdxdna` driver on the host) and a local FLM toolbox -image; the picker only advertises NPU when both are detected. +All five are first-class. FLM is opt-in because XDNA NPU support +depends on AMD's driver stack being present (kernel >= 6.11 with the +`amdxdna` driver on the host) and a local FLM toolbox image; the +picker only advertises NPU when both are detected. All six toolbox images are published to `ghcr.io/hal0ai/` with pinned sha256 digests in `manifest.json`: `hal0-toolbox-vulkan`, `hal0-toolbox-rocm`, `hal0-toolbox-flm`, `hal0-toolbox-moonshine`, `hal0-toolbox-kokoro`, `hal0-toolbox-comfyui`. (Six images, five -providers — llama.cpp ships both a Vulkan and a ROCm toolbox.) +providers — llama-server ships both a Vulkan and a ROCm toolbox.) -## llama.cpp +## llama-server The default for `primary` and `embed`. Handles: @@ -55,7 +57,7 @@ Backend modes: where Vulkan leaves performance on the table. The CUDA path on NVIDIA uses CUDA-backed llama.cpp through the same -provider. +provider. Use `provider = "llama-server"` in slot TOML for all three. ## FLM @@ -110,12 +112,12 @@ alongside a primary chat model. Every provider implements: -| Method | What it does | -|---------------|-------------------------------------------------------------| -| `build_env()` | Compute the env file the systemd unit will consume. | -| `start_cmd()` | The argv to run inside the toolbox image. | -| `health()` | Cheap probe to decide `warming → ready`. | -| `infer()` | The request path the dispatcher proxies to. | +| Method | What it does | +|---------------|-----------------------------------------------------------------| +| `build_env()` | Compute the env file the systemd unit and container will use. | +| `start_cmd()` | The argv passed to podman to launch the container. | +| `health()` | Cheap probe to decide `warming → ready`. | +| `infer()` | The request path the dispatcher proxies to. | The slot lifecycle (`offline → pulling → starting → warming → ready → serving ↔ idle → unloading`) is identical across providers. Adding @@ -124,7 +126,7 @@ changes required. diff --git a/src/content/docs/docs/slots/what-is-a-slot.mdx b/src/content/docs/docs/slots/what-is-a-slot.mdx index 33da6d5..7f461cc 100644 --- a/src/content/docs/docs/slots/what-is-a-slot.mdx +++ b/src/content/docs/docs/slots/what-is-a-slot.mdx @@ -15,9 +15,11 @@ slot that owns the model, and the slot answers. Concretely, each slot is a real systemd unit (e.g. `hal0-slot@primary.service`, an instance of the `hal0-slot@.service` -template) running inside your LXC. `systemctl status hal0-slot@primary` -works the way you expect it to. The slot shares the LXC's unified -memory pool with any other Proxmox tenants on the same node. +template) that launches a **podman container** on the host. +`systemctl status hal0-slot@primary` works the way you expect it to. +Container images and tuned flags come from slot profiles managed by +hal0-api. The slot shares the LXC's unified memory pool with any +other Proxmox tenants on the same node. ## Why slots exist @@ -45,11 +47,12 @@ Each slot has: - A **name** (`primary`, `embed`, `stt`, `tts`, `img`, or a user-defined name). - A **model assignment** (a registry ref like `qwen2.5-0.5b-instruct-q4_k_m`). -- A **provider** (`llama.cpp`, `flm`, `moonshine`, `kokoro`, `comfyui`) - that knows how to build the env, start the process, and run a health - probe. +- A **provider** (`llama-server`, `flm`, `moonshine`, `kokoro`, or + `comfyui`) that knows how to build the env, start the container, and + run a health probe. - A **systemd unit**, an instance of the `hal0-slot@.service` - template (e.g. `hal0-slot@primary.service`). + template (e.g. `hal0-slot@primary.service`), which launches a + podman container. - A **port** in the range `8081`–`8099`, bound to `127.0.0.1` only. - A **state file** at `/var/lib/hal0/slots//state.json`, updated atomically on every transition and streamed to clients @@ -103,14 +106,14 @@ to that slot's local port. ## What a slot is not -- **Not a container manager.** Slots use plain systemd template - units, not Docker Compose or Kubernetes. Containerised backends - (toolbox images for FLM, ROCm, ComfyUI, etc.) are an implementation - detail of each provider. -- **Not a model cache.** Weights live under `/mnt/ai-models/local` - with the index at `/var/lib/hal0/registry/registry.toml` (see the - [model registry](/docs/slots/model-registry/)); slots only reference - registry entries. +- **Not a container orchestrator.** Slots use `hal0-slot@.service` + systemd template units that each launch one podman container — not + Docker Compose or Kubernetes. Container images and flags are an + implementation detail managed by each provider profile. +- **Not a model cache.** Weights live under `/var/lib/hal0/models/` + (default) with the index at `/var/lib/hal0/registry/registry.toml` + (see the [model registry](/docs/slots/model-registry/)); slots only + reference registry entries. - **Not multi-tenant inside hal0.** Slot names are global to the install. There's no per-user partitioning in the v0.1 alpha line; agent / multi-tenant work is on the v0.2 roadmap. Multi-tenancy diff --git a/src/pages/index.astro b/src/pages/index.astro index 0764579..5d254f5 100644 --- a/src/pages/index.astro +++ b/src/pages/index.astro @@ -61,14 +61,14 @@ const features = [ }, { tag: 'dashboard', - title: 'Vue 3 + Tailwind 4 admin UI', + title: 'React 18 + Tailwind admin UI', body: 'Dark-by-default operator console. SSE-backed status and log tail. Capability cards group flat slots into embed / voice / image picks for one-click ops.', }, ]; const providers = [ { - prov: 'llama.cpp', + prov: 'llama-server', ver: 'b9279', hw: 'Vulkan (default) · ROCm · CUDA', uses: 'chat · embed · rerank · vision', @@ -276,9 +276,9 @@ const themes: RoadmapTheme[] = [ blurb: 'The /v1/* surface and the engines behind it.', shipped: [ { t: 'OpenAI-compatible /v1/* API', d: 'Chat, completions, embeddings, rerank, transcriptions, speech, images. Every OpenAI SDK works unchanged against the local box.' }, - { t: 'Five-provider stack', d: 'llama.cpp (Vulkan / ROCm / CUDA) for chat and embed, FLM for the XDNA NPU, Moonshine for STT, Kokoro for TTS, ComfyUI for image generation.' }, + { t: 'Five-provider stack', d: 'llama-server (Vulkan / ROCm / CUDA) for chat and embed, FLM for the XDNA NPU, Moonshine for STT, Kokoro for TTS, ComfyUI for image generation. Each runs as a podman container under per-slot systemd units.' }, { t: 'Image generation', d: 'POST /v1/images/generations served by a ComfyUI provider on ROCm. Curated SDXL Turbo / SD 1.5 / Flux Schnell with license badges.' }, - { t: 'FLM NPU provider live', d: 'Self-contained ghcr.io/hal0ai/hal0-toolbox-flm:v1, pinned by sha256. Chat + embed surfaced in the picker only when XDNA hardware is present.' }, + { t: 'FLM NPU provider live', d: 'Self-contained ghcr.io/hal0ai/hal0-toolbox-flm:v1, pinned by sha256. Single FastFlowLM container serves chat + ASR + embeddings on the NPU; surfaced in the picker only when XDNA hardware is present.' }, ], soon: [], later: [ @@ -294,7 +294,7 @@ const themes: RoadmapTheme[] = [ { t: 'Slot lifecycle state machine', d: 'Atomic transitions (offline → pulling → starting → warming → ready → serving), persisted and SSE-streamed.' }, { t: 'Capability slots overlay', d: 'Embed / Voice / Img capability cards plus an NPU backend rollup, backed by /etc/hal0/capabilities.toml.' }, { t: 'embed-rerank built-in slot', d: 'Auto-created on first enable: bge-reranker-v2-m3-q4_k_m on :8086 with --reranking. /v1/rerankings stays separate from chat.' }, - { t: 'Per-slot live metrics', d: 'GET /api/slots/metrics reads docker cgroup memory + ActiveEnterTimestampMonotonic + scraped /metrics.' }, + { t: 'Per-slot live metrics', d: 'GET /api/slots/metrics reads podman container cgroup memory + ActiveEnterTimestampMonotonic + scraped /metrics.' }, { t: 'Orchestrator drift reconcile', d: 'apply() reconciles capabilities.toml ↔ slots/*.toml on every call, not only on selection diff.' }, ], soon: [