██╗ ██████╗ ██████╗ █████╗ ██╗ ██████╗ ██████╗ ██╗ ██╗
██║ ██╔═══██╗██╔════╝██╔══██╗██║ ██╔══██╗██╔═══██╗╚██╗██╔╝
██║ ██║ ██║██║ ███████║██║ ██████╔╝██║ ██║ ╚███╔╝
██║ ██║ ██║██║ ██╔══██║██║ ██╔══██╗██║ ██║ ██╔██╗
███████╗╚██████╔╝╚██████╗██║ ██║███████╗██████╔╝╚██████╔╝██╔╝ ██╗
╚══════╝ ╚═════╝ ╚═════╝╚═╝ ╚═╝╚══════╝╚═════╝ ╚═════╝ ╚═╝ ╚═╝
Put a local LLM behind the Claude Code / Unshackled harness
A PowerShell-driven launcher that runs Claude Code (or the Unshackled fork) against a local model — Ollama or llama.cpp — with the right Modelfile / chat template, KV-cache type, sampling, system prompt, and tool allowlist for each model family.
Windows / PowerShell only. Does not work in WSL/bash. The launcher reaches into Ollama and llama-server lifecycle, drives
Start-Process, manages$PROFILE, and readsnvidia-smi. None of that travels cleanly across shells.
- LocalBox is this launcher: it runs local Ollama and llama.cpp models through Claude Code or Unshackled.
- BenchPilot is the companion optimizer: it benchmarks local models and exports recommended launcher profiles.
- Unshackled is the free Claude
Code fork that the launcher can target with
-Unshackled.
The vendored Anthropic models (Opus, Sonnet, Haiku) are good. They're also paid, rate-limited, hosted, and out of your control. A local model running through the same agent harness gets you the Claude Code editing loop, tool-calling discipline, and CLI ergonomics — but pointed at weights you actually own.
That sounds simple. In practice it isn't:
- Each model family wants a different chat template, sampler, and stop set.
Qwen3-Coder needs the
qwen3-coderparser; Qwen 3.6 wantsqwen36; Devstral self-templates and you must passParser: noneor it fights the GGUF. - Anthropic's wire format carries
thinking/reasoningblocks that Ollama's/v1/messagesendpoint can't ingest. The launcher routes traffic through a small Python proxy (no-think-proxy.py) that strips them on the way in. For llama.cpp strip-mode launches it also passes--reasoning offand--reasoning-budget 0so hidden thinking tokens are not generated in the first place. Thinking-trained models (ThinkingPolicy: keep) bypass the proxy. - VRAM math is non-trivial. Q8 KV at 256 k tokens OOMs a 4090. Q4_K_M
weights leave room for KV but lose precision on coding. The launcher tags
every quant with
[fits] / [tight] / [over]against your actual card and refuses combinations that will OOM, telling you what to drop. - One alias per (model, context, quant) — built lazily. Ollama bakes
num_ctxinto the Modelfile at create time, soqcoder30at 32 k and at 256 k are physically different aliases. The launcher generates and tracks them; a sidecar version stamp catches drift when parsers or contexts change. - Agent launches are single-session by default. llama.cpp can serve
multiple slots, but Claude/Unshackled side requests compete with the main
turn when auto-parallelism is left on. LocalBox launches agent sessions with
--parallel 1and prompt-cache reuse by default so repeated large prompts stay local to one slot. Both values are configurable insettings.json. - Two harnesses, one dispatch path. Whether you launch Claude Code or
Unshackled, the same env stack and proxy is set up through the
-Unshackledswitch on every model function.
The end result: one PowerShell function per model, flag-based, with the fiddly bits (process bouncing, env restoration, cache types, KV ceilings, tool allowlists, system prompts, parser stamps) hidden behind it.
qcoder -Ctx 32k -Unshackled # Qwen3-Coder @ 32k -> Unshackled
q36p -Ctx 128k # Qwen 3.6 Plus @ 128k → Claude Code
qcoder -Ctx 256 -Quant iq4xs # 256k coder context (4090 ceiling)
llmdefault # whatever the catalog / settings / .llm-default says
llm # interactive wizard (Spectre when available)
llmc # native selectable wizard
llms # Spectre wizard, explicit
info # dashboard: VRAM fit, parser freshness, defaults
info -Commands # full LocalBox + BenchPilot command listA harness is the agent loop wrapping the model — the thing that turns raw
generation into "read this file, run that command, edit this code, then ask the
user". Claude Code is one such harness. Unshackled is a fork of it. ollama run
is not a harness; it's a chat REPL.
This launcher's whole job is to make a local model usable inside a real harness. Three modes are supported:
qcoder -Ctx 32k # qcoder is the per-model function nameWhat happens:
- The launcher snapshots and clears any
ANTHROPIC_*env vars in the current shell. - Starts the no-think proxy on
127.0.0.1:11435(Python; ~300 ms cold). - Bounces Ollama with the right
OLLAMA_FLASH_ATTENTIONandOLLAMA_KV_CACHE_TYPE. - Builds (or reuses) the Ollama alias for
(model, context)with the correct parser-derived Modelfile. - Sets
ANTHROPIC_BASE_URL=http://localhost:11435, pointsANTHROPIC_DEFAULT_*_MODELat the alias, disables thinking + prompt caching, bumpsAPI_TIMEOUT_MSto 30 min (local prefill is slow on big prompts). - Launches
claude --model <alias> --dangerously-skip-permissions [--tools <allowlist>] --append-system-prompt <local-tool-rules>. - On exit, restores the original env and stops the proxy.
The model believes it's Claude. Claude Code believes it's talking to Anthropic. The proxy quietly strips Anthropic-only fields the local backend can't parse.
Same flow, except the launch shells into bun src/entrypoints/cli.tsx against
an Unshackled checkout instead
of claude. If the configured UnshackledRoot doesn't exist, the first launch
offers to git clone from UnshackledRepoUrl (default
https://github.com/David-c0degeek/unshackled). Decline to abort.
qcoder -Ctx 32k -UnshackledUse -Unshackled when you want the local model behind the Unshackled harness
instead of Claude Code. The launcher treats it as a peer to Claude Code: same
env stack, same proxy, same tool restrictions.
Choose Remote in llm, or use llmremote, to serve a model from this
machine to normal Unshackled clients on another machine. The server starts the
selected local backend and exposes only the LocalBox no-think gateway; Ollama
and llama.cpp stay bound to localhost.
$env:LOCAL_LLM_REMOTE_PASS = "chosenpass"
llmremote -Key qcoder30 -ContextKey 32k -Backend ollamaAfter startup, LocalBox opens a remote monitor with the gateway status and live
request log. Press Q to return to the menu while leaving the server running,
or S to stop the gateway and backend. Use llmremote -NoMonitor for scripted
or detached starts.
On the client, no LocalBox helper is required. Set the Anthropic-compatible environment variables and start the regular Unshackled CLI:
export ANTHROPIC_BASE_URL="http://192.168.178.61:11435"
export ANTHROPIC_AUTH_TOKEN="chosenpass"
export ANTHROPIC_API_KEY="chosenpass"
unshackledPassword-only HTTP is convenient for LAN testing. Over a public IP it is not encrypted: the password and prompts can be observed in transit unless you put a VPN or HTTPS reverse proxy in front of it.
Some models in the catalog have Strict: true. For those, init builds a
second alias — <root>-strict — that derives FROM <root>:latest and overlays
two things:
- Tighter sampling:
temperature 0.2,top_p 0.8,top_k 20,min_p 0.05,repeat_penalty 1.15,repeat_last_n 4096. - A non-negotiable engineering system prompt with rules the model is held
to before every turn:
Do not create mocks, stubs, fake data, dummy implementations, placeholder services, TODO implementations, temporary bypasses, hardcoded sample responses, or
NotImplementedException. Do not invent new architecture, schema fields, configuration properties, or abstractions unless they fit existing patterns. Do not make tests pass by weakening, bypassing, deleting, or faking real behavior. Reuse existing architecture and production code paths. If the real implementation is missing, blocked, or ambiguous: stop and explain what is missing instead of inventing a substitute.
The strict alias is parser-agnostic — inheritance via FROM carries the
RENDERER/PARSER/template forward, and the overlay only overrides
SYSTEM + sampling. Any model family works without per-parser branching.
The same overlay is wired into the llama.cpp path: pass -Strict and
Build-LlamaServerArgs injects the strict sampler flags. The strict system
prompt still comes from the Claude/Unshackled launch harness, not from a
llama-server-specific system-prompt flag.
When to use it. Strict overlay is for actual engineering work where the model's lazy paths (mock, stub, "// TODO", placeholder JSON) cost real time. Skip it for chat, brainstorming, RAG-style Q&A.
q36p -Chat # plain `ollama run`, no Claude Code, no proxyUseful for ad-hoc prompts, GGUF smoke tests, and bench comparisons. No tool
calls, no agent loop, no env-var dance. Llama.cpp doesn't have a built-in chat
REPL; for that backend, run launch-claude and point Claude Code or the
llama-server web UI at the running server.
The launcher dispatches to one of two backends per launch.
The original target. Ollama manages the model server, GGUF storage, alias
namespace, and KV cache. The launcher writes Modelfiles, runs ollama create,
sets OLLAMA_FLASH_ATTENTION=1 plus an optional OLLAMA_KV_CACHE_TYPE, and
expects Ollama ≥ the catalog's MinOllamaVersion. Models with
SourceType: remote pull from ollama.com; SourceType: gguf materializes
a local GGUF (resolved via HuggingFace or reused from an existing Ollama copy)
and creates the alias with FROM <gguf-path>.
For models the catalog marks SourceType: gguf and not LlamaCppCompatible: false, the wizard offers a llama.cpp path:
native— upstreamllama-server.exe. Mainline KV types only (q8_0,f16,q5_1,q5_0,q4_1,q4_0,iq4_nl,bf16,f32).turboquant— TheTom's llama.cpp turboquant fork, which shipsturbo3andturbo4KV cache types (more aggressive thanq4_0but with a quality cliff that's a function of context length). Only available through the fork binary.
Both modes start a native llama-server process, pin to a free port from
LlamaCppPort (default 8080), wait for /v1/models to come up, then point
Claude Code at http://localhost:<port>. The Ollama daemon is shut down
during the launch unless LlamaCppCoexistOllama is set (single-GPU systems
don't have the VRAM headroom to run both).
# Wizard route — pick backend interactively
llm
# Direct (catalog must list a gguf model with no LlamaCppCompatible: false)
Invoke-Backend -Action launch-claude -Backend llamacpp `
-Key qcoder30 -ContextKey 256 `
-LlamaCppMode turboquant -KvCacheK turbo4 -KvCacheV turbo4 -Strict
lps # show running llama-server (port, pid, gguf path)
lstop # stop itUse llm-stop (or llmstop) to stop both Ollama and llama.cpp without
restarting either backend.
llama.cpp is the path you want when:
- You need a KV cache type Ollama doesn't expose (
turbo4,iq4_nl). - You want explicit control over
--n-cpu-moe,--mlock,--no-mmap. - You're running a quant Ollama refuses to load.
- You want the strict overlay applied at launch time rather than baked into a Modelfile.
Otherwise Ollama is the simpler default — alias namespace, ollama ps,
ollama show, the bench history all assume Ollama.
The repo ships in two folders that map to two deployed locations:
repo deployed
local-llm/ ─── install ──→ %USERPROFILE%\.local-llm\
ollama-proxy/ ─── install ──→ %USERPROFILE%\.ollama-proxy\
local-llm/
LocalLLMProfile.ps1 minimal entry point — dot-sourced by $PROFILE
llm-models.json model catalog (committed, sharable)
lib/
00-settings.ps1 config loader, settings.json overlay, env names
10-helpers.ps1 pwsh utility primitives (Section/Pause/Convert paths)
20-models.ps1 model-def access, alias naming, strict-sibling helpers
25-vram.ps1 nvidia-smi auto-detect, fit-class arithmetic
30-ollama.ps1 ollama lifecycle (start/stop/wait/env/version probe)
32-llamacpp.ps1 llama-server lifecycle (port pick, health, session)
33-llamacpp-install.ps1 resolve native + turboquant llama-server binaries
35-backend.ps1 Invoke-Backend dispatcher (ollama vs llamacpp)
40-parsers.ps1 per-family chat template / sampler / strict overlay
41-llamacpp-args.ps1 pure argv builder for llama-server
42-llamacpp-templates.ps1 parser → llama-server flag mapping, strict file
45-profile-version.ps1 Modelfile content hash for staleness detection
50-modelfile.ps1 Ollama alias creation + lifecycle (incl. strict siblings)
55-huggingface.ps1 HF repo discovery, GGUF download, quant code recognition
60-catalog.ps1 catalog editor (addllm/updatellm/removellm/setllm)
65-claude-launch.ps1 Claude/Unshackled launcher; env save/restore, proxy
70-bench.ps1 ospeed → bench-history.jsonl, Show-LLMBenchHistory
75-display.ps1 info dashboard (Spectre + plain-text fallbacks)
80-init.ps1 init/initmodel/purge/ostop/qkill/ops
85-shortcuts.ps1 per-model function generator, default-key resolution
90-wizard.ps1 native selectable + Spectre interactive wizards
99-entrypoints.ps1 llm/llmmenu/llmc/llms/reloadllm/lps/lstop
ollama-proxy/
no-think-proxy.py strips Anthropic thinking/reasoning blocks
enforcer-claude.ps1 wrapper that re-enters the local backend on Claude → Claude calls
LocalLLMProfile.ps1 dot-sources every lib/*.ps1 in numeric prefix order,
loads llm-models.json overlaid with ~/.local-llm/settings.json, and
registers per-model shortcut functions. Everything else hangs off that.
From the repo root:
. .\install.ps1 # copy files to deployed locations + add to $PROFILE
. .\install.ps1 -Symlink # symlink instead of copy (admin / dev mode)
. .\install.ps1 -SetupProfile # only ensure $PROFILE dot-sources the deployed file
. .\install.ps1 -InstallBenchPilot # also clone BenchPilot if missing
. .\install.ps1 -InstallUnshackled # also clone Unshackled if missing
. .\install.ps1 -DryRun # preview without changing anythingAfter install, open a fresh PowerShell. Two things to do:
init # build aliases for the recommended-tier models
info # verify: VRAM, default model, parser freshnessThe install step offers to clone missing BenchPilot and Unshackled checkouts
into ~/.local-llm/tools/. Use -SkipToolPrompts for unattended installs.
Show-Diagnostics also reports on ollama, python, bun (only needed for
Unshackled), PwshSpectreConsole, BenchPilot, and Unshackled.
Installs also record LocalBoxRoot in settings.json, which lets llm-update
pull this repo and redeploy the profile files later.
One function per model. Flag-based:
qcoder -Ctx 32k -Unshackled Code agent (Qwen3-Coder, 32k, Unshackled)
qcoder -Ctx 32k -Codex Code agent (Qwen3-Coder, 32k, Codex)
q36p -Ctx 32k -Unshackled General Qwen 3.6 agent (32k, Unshackled)
dev -Ctx 32k Smaller / faster (Devstral 24B, 32k)
q36p -Ctx 128k -Unshackled Big context (Qwen 3.6 Plus, 128k)
qcoder -Ctx 256 -Quant iq4xs 256k coder context (4090 ceiling — no -Q8)
q36p -Chat Raw ollama chat, no Claude Code
q36p -Q8 Use q8 KV cache for higher quality
q36p -Quant q6kp Switch the GGUF quant (rebuilds aliases)
llmdefault Launch the configured default recipe/model
llmdefaultunshackled Same, via Unshackled
llmdefaultcodex Same, via Codex
llmdefaultchat Same, plain chat
llm Guided wizard (Spectre when available)
llmc Native selectable wizard, explicit alias
llms Spectre wizard, explicit alias
info Dashboard
info -Commands Full LocalBox + BenchPilot command list
llmdocs Quick reference
llm-update Update LocalBox + installed companion checkouts
| Flag | Effect |
|---|---|
-Ctx <name> |
One of the model's context keys (32k, 64k, 128k, 256k). Omit for default. |
-Unshackled |
Use Unshackled instead of Claude Code. |
-Codex |
Use OpenAI Codex instead of Claude Code. |
-Chat |
Run plain ollama run, skip Claude Code entirely. |
-Q8 |
Set OLLAMA_KV_CACHE_TYPE=q8_0 for this launch. Refused above the VRAM-derived Q8KvMaxContext ceiling — q8 KV at long context will OOM. |
-Quant <name> |
Switch the model's selected GGUF quant. No launch — rebuilds the alias. |
The combination of Qwen3-Coder-30B-A3B Heretic (4 KV heads, 48 layers) at the IQ4_XS quant with q4_0 KV cache is the only setup that fits a full 256k context on a single 4090:
qcoder -Ctx 256 -Quant iq4xs # Claude Code @ 256k
qcoder -Ctx 256 -Quant iq4xs -Unshackled # Unshackled @ 256kWeights ~16.5 GB; q4_0 KV @ 256k ~6 GB; total ~23.6 GB. The launcher will
refuse -Q8 at this context because q8 KV would push KV cache to ~12 GB
and OOM the card. Run llmdocs for the full quick reference, info for the
dashboard, or info -Commands for the generated model commands plus LocalBox
and BenchPilot management commands.
addllm <hf-url-or-repo> -Key <key> [-Quants Q4_K_P,IQ4_XS] [-DefaultQuant Q4_K_P] [-Tier recommended]
initmodel <key>addllm registers every recognized GGUF quant the HF repo publishes by
default (the imatrix.gguf calibration file is excluded). Pass -Quants only
when you want to filter the catalog entry to a subset.
Backfilling missing quants on an existing entry (rerunning HF discovery
without overwriting your manual QuantNotes / ContextNotes):
updatellm <key> # adds any HF quants missing from the entry
updatellm <key> -DryRun # preview without writingRemoving a model:
removellm <key> # confirms first
removellm <key> -Force # skip confirmation
removellm <key> -KeepFiles # keep the GGUF blobs on diskThe launcher reads your GPU's VRAM and uses it to:
- Tag every quant with
[fits]/[tight]/[over]ininfoand thellmwizard, so you can see at a glance which builds will load fully on your card. - Set the
Q8KvMaxContextceiling — the largestnum_ctxthat pairs safely with-Q8(q8_0 KV cache). Roughly +16k tokens of headroom per GB above 16 GB; floors at 64 k. The guard refuses launches that would exceed this and tells you what to drop.
VRAM resolves in this order:
VRAMGBset insettings.jsonorllm-models.json(top-level).nvidia-smi --query-gpu=memory.totalauto-detect (largest GPU on a multi-GPU box).- Fallback to 24.
The info dashboard shows the resolved value and source
(auto / configured / fallback).
Set-LocalLLMSetting VRAMGB 32 # 5090
Set-LocalLLMSetting VRAMGB 48 # RTX 6000 Ada / dual-card aggregate
Set-LocalLLMSetting VRAMGB $null # remove override, fall back to auto-detect
Set-LocalLLMSetting Q8KvMaxContext 196608 # pin the q8 ceiling explicitlyPer-quant tradeoffs come from two optional catalog fields:
QuantSizesGB— file size per quant in GB (drives the fit badge).QuantNotes— human-readable note per quant (quality/use-case context). Shown verbatim.
Per-context guidance comes from ContextNotes in the same shape. Backfill
these on any model you add — they show up inline in info and the wizard.
llm-models.json is the model catalog — committed, sharable. Per-machine
paths and preferences belong in a sibling settings.json at
~/.local-llm/settings.json (gitignored). It overlays top-level scalars from
the catalog at load time, so you don't have to hand-edit llm-models.json to
fix paths on a fresh machine.
Use the helper instead of editing JSON:
Set-LocalLLMSetting UnshackledRoot '<path-to-unshackled>' # usually auto-set by install.ps1
Set-LocalLLMSetting BenchPilotRoot '<path-to-benchpilot>' # usually auto-set by install.ps1
Set-LocalLLMSetting LocalBoxRoot '<path-to-LocalBox>' # auto-set by install.ps1
Set-LocalLLMSetting Default q36plus
Set-LocalLLMSetting KeepAlive '5m'
Set-LocalLLMSetting VRAMGB 32 # override auto-detect
Set-LocalLLMSetting Q8KvMaxContext 196608 # pin the -Q8 ceiling
Set-LocalLLMSetting LlamaCppDefaultMode native # or 'turboquant'
Set-LocalLLMSetting LlamaCppCoexistOllama $true # rare: allow both backends concurrently
Set-LocalLLMSetting LlamaCppNCpuMoe 35 # MoE expert CPU offload (default 35; 0 to disable)
Set-LocalLLMSetting LlamaCppMlock $false # disable RAM locking (default $true)
Set-LocalLLMSetting LlamaCppNoMmap $false # disable no-mmap (default $true)
Set-LocalLLMSetting LlamaCppAgentParallel 1 # agent slots (default 1; 0 = llama.cpp auto)
Set-LocalLLMSetting LlamaCppAgentCacheReuse 256 # prompt-cache reuse chunk size (default 256; 0 = llama.cpp default)
Set-LocalLLMSetting LocalModelMaxOutputTokens 4096 # cap local Claude/Unshackled completions (0 = tool default)
Set-LocalLLMSetting UnshackledRoot $null # remove an entryThe Models and CommandAliases keys are catalog-only and rejected by
Set-LocalLLMSetting. Everything else is fair game.
Drop a .llm-default file in any directory containing a single line — a
model key, ShortName, or Root. llmdefault (and the enforcer wrapper)
walks up from $PWD and uses the nearest match. Falls back to settings →
catalog Default.
echo q36p > .llm-default # this workspace prefers Qwen 3.6 Plus
Claude Code's MCP servers expose tools with names like mcp__<server>__<tool>.
They reach the local model through the same launch path:
- Models with
"LimitTools": false(e.g.dev) get every MCP tool automatically — the--toolsflag isn't passed. - Models with
"LimitTools": true(default) only see tools in the allowlist. Add the MCP tool names you want to either the globalLocalModelToolsfield inllm-models.jsonor a per-modelToolsoverride.
Example per-model override:
"q36plus": {
...,
"Tools": "Bash,Read,Write,Edit,Glob,Grep,mcp__filesystem__read_file,mcp__filesystem__write_file"
}info shows a Tools : ... line for any model that overrides the global list.
ospeed <model> appends one JSONL line per run to
~/.local-llm/bench-history.jsonl. View with:
obench # last 20 entries, all models
obench -Model q36plus -Last 50 # filter by model
Trim-LLMBenchHistory -OlderThanDays 90 -DryRun # preview pruning
Trim-LLMBenchHistory -OlderThanDays 90 # apply pruningfindbest is a LocalBox compatibility command that delegates tuning to
BenchPilot. LocalBox no longer
contains a legacy benchmark/search implementation; BenchPilot is the single
owner of benchmark execution, scoring, reports, and AutoBest export.
BenchPilot writes a LocalBox-compatible result to
~/.local-llm/tuner/best-<key>.json, and Start-ClaudeWithLlamaCppModel -AutoBest replays that saved profile.
Standard catalog context aliases are 32k, 64k, 128k, and 256k unless a
model explicitly lacks support. AutoBest profiles are context-aware: the saved
entry records both contextKey and the resolved contextTokens, and launcher
selection still requires the same context key.
# Tune q36plus at the 256k context preset, native llama.cpp, default budget.
# Default goal is coding-agent: long-prefill end-to-end latency.
findbest q36plus -ContextKey 256k
# Quick mode — only baseline + n-cpu-moe + batching (~10 trials)
findbest q36plus -ContextKey 256k -Quick
# Deep mode — normal phases, then finer local offload/batch/thread refinement
findbest q36plus -ContextKey 256k -Deep
# Default sampling is three runs per candidate; override when needed
findbest q36plus -ContextKey 256k -Runs 5
# Save both the fastest raw profile and a workstation-friendly balanced profile
findbest q36plus -ContextKey 256k -Profile both
# Force the expanded beam search and keep three survivors after each phase
findbest q36plus -ContextKey 256k -SearchStrategy beam -BeamWidth 3
# Optimize for prompt-eval (prefill) or generation explicitly
findbest q36plus -ContextKey 256k -Optimize prompt
findbest q36plus -ContextKey 256k -Optimize gen
# Allow KV cache variation. Native mode defaults to the model's current type;
# turboquant mode always also tests turbo3/turbo4 KV cache encodings.
findbest q36plus -ContextKey 256k -AllowedKvTypes q8_0,f16
# Try mismatched K/V pairs too, and allow an explicit quality trade if wanted
findbest q36plus -ContextKey 256k -AllowedKvTypes q8_0,q4_0 -AggressiveKv
# Power-user: tune separate short- and long-prefill profiles
findbest q36plus -ContextKey 256k -PromptLengths short,long
# Inspect every trial run for a model
Show-LlamaCppTunerHistory -Key q36plus -Last 50BenchPilot may use fast llama-bench probes where supported, but turboquant
mode uses llama-server probes so turbo3 / turbo4 are measured through the
same binary LocalBox will actually launch. Upstream llama-bench has KV-cache
flags (-ctk / -ctv), but TurboQuant cache types only work in a fork/build
that registers them; LocalBox's turboquant path uses TheTom's fork.
-Quant selects the GGUF model file and stays fixed during a tuner run. KvK
and KvV are only runtime KV-cache encodings. In native mode BenchPilot
defaults KV search to the model's current cache type unless you pass
-AllowedKvTypes. In turboquant mode BenchPilot always includes turbo3 and
turbo4 as KV-cache candidates and probes them early, because otherwise a
turboquant "best" run may never test the backend's defining cache encodings.
During a run, BenchPilot prints one row per candidate:
Waiting for llama-server...
[ 4] OK pp= 645.95 tg= 58.39 score= 559.36 phase=moe NCpuMoe=20 KvK=q8_0 KvV=q8_0
Live output legend:
[ 4]— trial number within this tuning run.OK/OOM/FAIL— candidate status.OOMmeans the trial exhausted memory;FAILmeans startup or probing failed for another reason.pp— prompt-processing / prefill throughput, in tokens/sec. This measures how fast llama.cpp consumes the input prompt before generation starts. By default this is the average across three samples for that candidate.tg— text-generation throughput, in tokens/sec after generation starts. This is also averaged across the candidate's samples.score— the value BenchPilot is optimizing. With the default-Optimize coding-agent, this is effective end-to-end tokens/sec for a large coding-agent prompt plus a moderate generated reply, so higher is better. With-Optimize gen, it followstg; with-Optimize prompt, it followspp; with-Optimize both, it uses a balanced prompt/generation score. With-Profile balanced, BenchPilot stores the selected balanced score and also preserves the raw pure score for comparison.phase— which search step produced the candidate. Common values arebaseline,moe,ngl,batching,flash,mmap,threads,kv,verify, and, with-Deep,deep_moe,deep_ngl,deep_batching,deep_flash, anddeep_threads.
Extra Name=value pairs at the end are the settings changed for that trial:
NCpuMoe— llama.cpp--n-cpu-moe, the number of MoE expert layers kept on CPU. Lower usually moves more expert work to GPU, which is faster but uses more VRAM.NGpuLayers— llama.cpp-ngl/--n-gpu-layers, the dense-model layer offload count. Higher generally uses more GPU and more VRAM.UbatchSize— llama.cpp--ubatch-size, the physical microbatch size used during prompt processing.BatchSize— llama.cpp--batch-size, the logical prompt batch size.Threads— llama.cpp--threads, CPU threads for generation/eval work.ThreadsBatch— llama.cpp--threads-batch, CPU threads for prompt batch processing.FlashAttn— flash-attention on/off.Mlock— llama.cpp--mlock, requests locking model/cache pages in memory.NoMmap— llama.cpp--no-mmap, loads model weights without memory mapping.KvK— KV-cache type for attention keys, passed as-ctk.KvV— KV-cache type for attention values, passed as-ctv.
KV-cache values are llama.cpp cache encodings such as f16, q8_0, q4_0,
or, in TheTom turboquant builds, turbo3 / turbo4. These are not GGUF model
quants like Q4_K_M, Q8_0, or APEX variants; they only change how the
runtime stores attention keys/values. KV changes can affect quality, so
BenchPilot runs a perplexity sanity check before saving a faster changed KV
pair unless -AllowKvQualityRegression is passed.
Skip/pruning messages use the same ideas with shorter llama.cpp-style labels:
ngl means NGpuLayers, ub means UbatchSize, b means BatchSize,
flash means FlashAttn, threads means Threads / ThreadsBatch, and
kv K=... V=... means the KvK / KvV pair being tested.
What gets searched:
BenchPilot supports two search strategies. greedy keeps only the current best
candidate after each phase. beam keeps the top -BeamWidth candidates, so a
near-winner from the MoE/NGL phase can still combine with later batching,
flash-attention, mmap/mlock, and thread settings. Normal pure runs default to
greedy with width 1. -Deep or any balanced profile defaults to beam with
width 3 unless you pass -SearchStrategy / -BeamWidth explicitly.
- baseline — catalog defaults, one probe.
- moe_or_ngl — for MoE models, sweep
--n-cpu-moeto find the smallest value that still fits VRAM (more layers on GPU = faster). For dense models with-nglalready at 999, this phase is a no-op unless baseline OOMed. - batching — joint sweep of
(--ubatch-size, --batch-size)over a small 2-D grid. - flash — compares flash-attention on/off.
- mmap — compares
--mlock --no-mmapagainst the default mapping mode (-Aggressivetries the cross-combinations too). - threads — sweeps CPU thread counts when the winning config keeps MoE experts or dense layers on CPU.
- kv — runs when more than one KV-cache encoding is available. Native mode
only widens this phase when
-AllowedKvTypesis supplied. Turboquant mode always includesturbo3andturbo4, and probes KV early enough that the normal MoE/batch grid cannot consume the whole default budget first. The tuner runs a small perplexity sanity check and refuses a >1% regression unless-AllowKvQualityRegressionis passed. - deep — optional (
-Deep). Re-tests a finer local neighborhood around the winning offload value, expands the batch grid up to-ub 2048/-b 4096, re-checks flash-attention after batch changes, and tries a wider CPU-thread set when CPU offload remains. If-Budgetis omitted, deep mode raises the budget from 30 to 60. - verify — re-runs the final winner through
llama-serverwhen a coarse bench phase was used.
The default -Optimize coding-agent score is effective end-to-end throughput
for a local coding-agent request: large prompt prefill plus a moderate generated
reply. That prevents decode-only winners with high generation TPS and poor
prompt-processing TPS from being saved as "best".
-Profile pure selects the highest measured LLM throughput. -Profile balanced starts from that pure score and applies visible CPU, RAM, VRAM,
variance, and stability factors so the saved launch leaves more workstation
headroom. -Profile both runs and saves both profile types for the same target.
Balanced reports include the pure score and score-factor breakdown when
telemetry is available.
OOM/failure is detected from process output; OOM-monotonicity prunes branches
that are guaranteed to fail. Saved entries include the GPU name and llama.cpp
build stamp, so -AutoBest can warn when a re-tune is advisable after a
hardware or llama.cpp upgrade. Saved entries also include profile,
searchStrategy, beamWidth, pureScore, optional telemetry, and optional
scoreBreakdown; older entries without profile are treated as pure. Tuner
version changes require a re-tune.
-PromptLengths short,long stores separate profile entries. -AutoBest
defaults to auto: it prefers balanced when present, falls back to pure,
and within each profile prefers long/coding-agent prompt profiles before short
profiles. Use -AutoBestProfile pure or -AutoBestProfile balanced to force
the selection profile. -AutoBestProfile short and -AutoBestProfile long
remain available as legacy prompt-length overrides and force a pure profile.
Replaying the saved best:
Start-ClaudeWithLlamaCppModel -Key q36plus -ContextKey 256k -Mode native -AutoBest
Start-ClaudeWithLlamaCppModel -Key q36plus -ContextKey 256k -Mode native -AutoBest -AutoBestProfile balancedThe launcher matches the saved entry on (key, contextKey, mode, profile, prompt_length, quant, vramGB ± 1) and a tuner-version stamp; contextTokens
is recorded as provenance for the actual num_ctx used by the run. On a miss
it warns and falls through to defaults. Caller-supplied -KvCacheK /
-KvCacheV / -ExtraArgs always win over the saved values.
Before handing an AutoBest llama.cpp session to Claude or Unshackled, LocalBox
sends a tiny /v1/messages smoke request, including the same system prompt
used for the real launch, through the same Anthropic-compatible route. The
smoke must produce the requested visible answer; text hidden inside
<think>...</think> does not count. If the no-think proxy route fails,
LocalBox tries a direct llama-server route for that session. If both routes
fail, launch stops immediately instead of starting an unusable spinner-only
session.
In the wizard, choose the llama.cpp backend and then Find best settings to
run the same tuner interactively, with prompts for normal vs deep tuning,
pure vs balanced vs both selection profiles, KV variation, saving the winner,
and launching immediately with -AutoBest.
When both pure and balanced profiles are saved, the launch-settings step shows
separate Use balanced and Use pure choices, plus Use AutoBest for
the default balanced-then-pure preference. After a -Profile both tuning run,
the immediate-launch flow also asks which saved profile to use.
Choose Delete best settings from the same action menu to remove saved
AutoBest entries for the selected (model, quant, context, backend mode, VRAM)
before re-tuning.
After a matching best config has been saved, normal wizard launches for the
same (model, quant, context, backend mode, VRAM) automatically replay it and
skip the manual KV-cache picker. If no matching entry exists, the wizard keeps
the usual manual KV-cache selection.
llm launches the Spectre picker when PwshSpectreConsole is available. Use
llmc for the native selectable picker; it uses arrow keys + Enter, while
keeping number/letter shortcuts for fast selection.
It walks: model → quant → backend → context → action → q8/kvcache → launch.
Each step has a Back option (0/Escape in native, [[Back]] in Spectre); the
Spectre wizard wraps each prompt in Invoke-LLMWizardStep and logs the
full exception trace to ~/.local-llm/wizard-errors.log if anything throws,
so a Spectre live-display refresh can't scroll the trace off screen. Inspect
with llmlogerr [-Lines 80]; reset with llmlogerrclear. The launch debug
trace (vision, proxy, llama-server, Claude launches) is recorded in
~/.local-llm/launch.log and tailable with llmlog [-Lines 80].
After a model is selected, the Spectre wizard waits briefly before drawing the
next prompt and retries one fast-empty transition. Tune that guard with
LOCAL_LLM_SPECTRE_PROMPT_COOLDOWN_MS (default 500, max 5000).
llms launches the Spectre wizard explicitly. llmc remains an explicit
native-picker alias.
$env:LOCAL_LLM_SPECTRE_PROMPT_COOLDOWN_MS = '750'
$env:LOCAL_LLM_NO_SPECTRE = '1' # disable Spectre everywhere / make llm use nativeThe repo mixes three styles intentionally:
kebab-casefor folders (local-llm/,ollama-proxy/) — matches their deployed path.PascalCasefor the entry-point script (LocalLLMProfile.ps1) — PowerShell convention.kebab-casefor data files (llm-models.json).
These names are user-visible (the deployed paths). Renaming them would break setups, so they stay.
initsays Ollama version too old →winget upgrade Ollama.Ollama.Refusing -Q8 with -Ctx 256 ...→ drop-Q8, lower-Ctx, or raise the ceiling:Set-LocalLLMSetting Q8KvMaxContext <tokens>.<model> does not advertise tool support→RequireAdvertisedToolsis on by default; some uncensored Qwen variants don't advertise capabilities. Verify withollama show <alias>. Bypass for one launch withStart-ClaudeWithOllamaModel -Model <alias> -SkipToolCheck, or globally viaSet-LocalLLMSetting RequireAdvertisedTools $false.- Stale aliases after editing a parser →
init -Stalerebuilds only the aliases whose Modelfile content hash drifted. - Spectre wizard crashed or stalls →
llmlogerrfor the full trace; usellmlogfor launch/debug details (vision, proxy, llama-server, Claude);llmcfor the native picker or set$env:LOCAL_LLM_NO_SPECTRE=1to disable Spectre everywhere. If the next prompt appears too slowly after selecting a model, raise$env:LOCAL_LLM_SPECTRE_PROMPT_COOLDOWN_MS. bunnot on PATH → only required for Unshackled launches. Install viawinget install Oven-sh.Bun.
CHANGELOG.md— what shipped, when.