Skip to content

David-c0degeek/LocalBox

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

91 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

██╗      ██████╗  ██████╗ █████╗ ██╗     ██████╗  ██████╗ ██╗  ██╗
██║     ██╔═══██╗██╔════╝██╔══██╗██║     ██╔══██╗██╔═══██╗╚██╗██╔╝
██║     ██║   ██║██║     ███████║██║     ██████╔╝██║   ██║ ╚███╔╝ 
██║     ██║   ██║██║     ██╔══██║██║     ██╔══██╗██║   ██║ ██╔██╗ 
███████╗╚██████╔╝╚██████╗██║  ██║███████╗██████╔╝╚██████╔╝██╔╝ ██╗
╚══════╝ ╚═════╝  ╚═════╝╚═╝  ╚═╝╚══════╝╚═════╝  ╚═════╝ ╚═╝  ╚═╝
  Put a local LLM behind the Claude Code / Unshackled harness

LocalBox

A PowerShell-driven launcher that runs Claude Code (or the Unshackled fork) against a local model — Ollama or llama.cpp — with the right Modelfile / chat template, KV-cache type, sampling, system prompt, and tool allowlist for each model family.

Windows / PowerShell only. Does not work in WSL/bash. The launcher reaches into Ollama and llama-server lifecycle, drives Start-Process, manages $PROFILE, and reads nvidia-smi. None of that travels cleanly across shells.


Related projects

  • LocalBox is this launcher: it runs local Ollama and llama.cpp models through Claude Code or Unshackled.
  • BenchPilot is the companion optimizer: it benchmarks local models and exports recommended launcher profiles.
  • Unshackled is the free Claude Code fork that the launcher can target with -Unshackled.

What this is

The vendored Anthropic models (Opus, Sonnet, Haiku) are good. They're also paid, rate-limited, hosted, and out of your control. A local model running through the same agent harness gets you the Claude Code editing loop, tool-calling discipline, and CLI ergonomics — but pointed at weights you actually own.

That sounds simple. In practice it isn't:

  • Each model family wants a different chat template, sampler, and stop set. Qwen3-Coder needs the qwen3-coder parser; Qwen 3.6 wants qwen36; Devstral self-templates and you must pass Parser: none or it fights the GGUF.
  • Anthropic's wire format carries thinking / reasoning blocks that Ollama's /v1/messages endpoint can't ingest. The launcher routes traffic through a small Python proxy (no-think-proxy.py) that strips them on the way in. For llama.cpp strip-mode launches it also passes --reasoning off and --reasoning-budget 0 so hidden thinking tokens are not generated in the first place. Thinking-trained models (ThinkingPolicy: keep) bypass the proxy.
  • VRAM math is non-trivial. Q8 KV at 256 k tokens OOMs a 4090. Q4_K_M weights leave room for KV but lose precision on coding. The launcher tags every quant with [fits] / [tight] / [over] against your actual card and refuses combinations that will OOM, telling you what to drop.
  • One alias per (model, context, quant) — built lazily. Ollama bakes num_ctx into the Modelfile at create time, so qcoder30 at 32 k and at 256 k are physically different aliases. The launcher generates and tracks them; a sidecar version stamp catches drift when parsers or contexts change.
  • Agent launches are single-session by default. llama.cpp can serve multiple slots, but Claude/Unshackled side requests compete with the main turn when auto-parallelism is left on. LocalBox launches agent sessions with --parallel 1 and prompt-cache reuse by default so repeated large prompts stay local to one slot. Both values are configurable in settings.json.
  • Two harnesses, one dispatch path. Whether you launch Claude Code or Unshackled, the same env stack and proxy is set up through the -Unshackled switch on every model function.

The end result: one PowerShell function per model, flag-based, with the fiddly bits (process bouncing, env restoration, cache types, KV ceilings, tool allowlists, system prompts, parser stamps) hidden behind it.

qcoder -Ctx 32k -Unshackled   # Qwen3-Coder @ 32k -> Unshackled
q36p -Ctx 128k                # Qwen 3.6 Plus @ 128k → Claude Code
qcoder -Ctx 256 -Quant iq4xs  # 256k coder context (4090 ceiling)
llmdefault                    # whatever the catalog / settings / .llm-default says
llm                           # interactive wizard (Spectre when available)
llmc                          # native selectable wizard
llms                          # Spectre wizard, explicit
info                          # dashboard: VRAM fit, parser freshness, defaults
info -Commands                # full LocalBox + BenchPilot command list

Harness mode

A harness is the agent loop wrapping the model — the thing that turns raw generation into "read this file, run that command, edit this code, then ask the user". Claude Code is one such harness. Unshackled is a fork of it. ollama run is not a harness; it's a chat REPL.

This launcher's whole job is to make a local model usable inside a real harness. Three modes are supported:

Claude Code harness (default)

qcoder -Ctx 32k               # qcoder is the per-model function name

What happens:

  1. The launcher snapshots and clears any ANTHROPIC_* env vars in the current shell.
  2. Starts the no-think proxy on 127.0.0.1:11435 (Python; ~300 ms cold).
  3. Bounces Ollama with the right OLLAMA_FLASH_ATTENTION and OLLAMA_KV_CACHE_TYPE.
  4. Builds (or reuses) the Ollama alias for (model, context) with the correct parser-derived Modelfile.
  5. Sets ANTHROPIC_BASE_URL=http://localhost:11435, points ANTHROPIC_DEFAULT_*_MODEL at the alias, disables thinking + prompt caching, bumps API_TIMEOUT_MS to 30 min (local prefill is slow on big prompts).
  6. Launches claude --model <alias> --dangerously-skip-permissions [--tools <allowlist>] --append-system-prompt <local-tool-rules>.
  7. On exit, restores the original env and stops the proxy.

The model believes it's Claude. Claude Code believes it's talking to Anthropic. The proxy quietly strips Anthropic-only fields the local backend can't parse.

Unshackled Harness

Same flow, except the launch shells into bun src/entrypoints/cli.tsx against an Unshackled checkout instead of claude. If the configured UnshackledRoot doesn't exist, the first launch offers to git clone from UnshackledRepoUrl (default https://github.com/David-c0degeek/unshackled). Decline to abort.

qcoder -Ctx 32k -Unshackled

Use -Unshackled when you want the local model behind the Unshackled harness instead of Claude Code. The launcher treats it as a peer to Claude Code: same env stack, same proxy, same tool restrictions.

Remote Unshackled Gateway

Choose Remote in llm, or use llmremote, to serve a model from this machine to normal Unshackled clients on another machine. The server starts the selected local backend and exposes only the LocalBox no-think gateway; Ollama and llama.cpp stay bound to localhost.

$env:LOCAL_LLM_REMOTE_PASS = "chosenpass"
llmremote -Key qcoder30 -ContextKey 32k -Backend ollama

After startup, LocalBox opens a remote monitor with the gateway status and live request log. Press Q to return to the menu while leaving the server running, or S to stop the gateway and backend. Use llmremote -NoMonitor for scripted or detached starts.

On the client, no LocalBox helper is required. Set the Anthropic-compatible environment variables and start the regular Unshackled CLI:

export ANTHROPIC_BASE_URL="http://192.168.178.61:11435"
export ANTHROPIC_AUTH_TOKEN="chosenpass"
export ANTHROPIC_API_KEY="chosenpass"
unshackled

Password-only HTTP is convenient for LAN testing. Over a public IP it is not encrypted: the password and prompts can be observed in transit unless you put a VPN or HTTPS reverse proxy in front of it.

Strict overlay (engineering harness)

Some models in the catalog have Strict: true. For those, init builds a second alias — <root>-strict — that derives FROM <root>:latest and overlays two things:

  • Tighter sampling: temperature 0.2, top_p 0.8, top_k 20, min_p 0.05, repeat_penalty 1.15, repeat_last_n 4096.
  • A non-negotiable engineering system prompt with rules the model is held to before every turn:

    Do not create mocks, stubs, fake data, dummy implementations, placeholder services, TODO implementations, temporary bypasses, hardcoded sample responses, or NotImplementedException. Do not invent new architecture, schema fields, configuration properties, or abstractions unless they fit existing patterns. Do not make tests pass by weakening, bypassing, deleting, or faking real behavior. Reuse existing architecture and production code paths. If the real implementation is missing, blocked, or ambiguous: stop and explain what is missing instead of inventing a substitute.

The strict alias is parser-agnostic — inheritance via FROM carries the RENDERER/PARSER/template forward, and the overlay only overrides SYSTEM + sampling. Any model family works without per-parser branching.

The same overlay is wired into the llama.cpp path: pass -Strict and Build-LlamaServerArgs injects the strict sampler flags. The strict system prompt still comes from the Claude/Unshackled launch harness, not from a llama-server-specific system-prompt flag.

When to use it. Strict overlay is for actual engineering work where the model's lazy paths (mock, stub, "// TODO", placeholder JSON) cost real time. Skip it for chat, brainstorming, RAG-style Q&A.

Chat mode (no harness)

q36p -Chat                    # plain `ollama run`, no Claude Code, no proxy

Useful for ad-hoc prompts, GGUF smoke tests, and bench comparisons. No tool calls, no agent loop, no env-var dance. Llama.cpp doesn't have a built-in chat REPL; for that backend, run launch-claude and point Claude Code or the llama-server web UI at the running server.


Backends

The launcher dispatches to one of two backends per launch.

Ollama (default)

The original target. Ollama manages the model server, GGUF storage, alias namespace, and KV cache. The launcher writes Modelfiles, runs ollama create, sets OLLAMA_FLASH_ATTENTION=1 plus an optional OLLAMA_KV_CACHE_TYPE, and expects Ollama ≥ the catalog's MinOllamaVersion. Models with SourceType: remote pull from ollama.com; SourceType: gguf materializes a local GGUF (resolved via HuggingFace or reused from an existing Ollama copy) and creates the alias with FROM <gguf-path>.

llama.cpp

For models the catalog marks SourceType: gguf and not LlamaCppCompatible: false, the wizard offers a llama.cpp path:

  • native — upstream llama-server.exe. Mainline KV types only (q8_0, f16, q5_1, q5_0, q4_1, q4_0, iq4_nl, bf16, f32).
  • turboquant — TheTom's llama.cpp turboquant fork, which ships turbo3 and turbo4 KV cache types (more aggressive than q4_0 but with a quality cliff that's a function of context length). Only available through the fork binary.

Both modes start a native llama-server process, pin to a free port from LlamaCppPort (default 8080), wait for /v1/models to come up, then point Claude Code at http://localhost:<port>. The Ollama daemon is shut down during the launch unless LlamaCppCoexistOllama is set (single-GPU systems don't have the VRAM headroom to run both).

# Wizard route — pick backend interactively
llm

# Direct (catalog must list a gguf model with no LlamaCppCompatible: false)
Invoke-Backend -Action launch-claude -Backend llamacpp `
  -Key qcoder30 -ContextKey 256 `
  -LlamaCppMode turboquant -KvCacheK turbo4 -KvCacheV turbo4 -Strict

lps                           # show running llama-server (port, pid, gguf path)
lstop                         # stop it

Use llm-stop (or llmstop) to stop both Ollama and llama.cpp without restarting either backend.

llama.cpp is the path you want when:

  • You need a KV cache type Ollama doesn't expose (turbo4, iq4_nl).
  • You want explicit control over --n-cpu-moe, --mlock, --no-mmap.
  • You're running a quant Ollama refuses to load.
  • You want the strict overlay applied at launch time rather than baked into a Modelfile.

Otherwise Ollama is the simpler default — alias namespace, ollama ps, ollama show, the bench history all assume Ollama.


Architecture

The repo ships in two folders that map to two deployed locations:

repo                              deployed
local-llm/      ─── install ──→   %USERPROFILE%\.local-llm\
ollama-proxy/   ─── install ──→   %USERPROFILE%\.ollama-proxy\
local-llm/
  LocalLLMProfile.ps1   minimal entry point — dot-sourced by $PROFILE
  llm-models.json       model catalog (committed, sharable)
  lib/
    00-settings.ps1     config loader, settings.json overlay, env names
    10-helpers.ps1      pwsh utility primitives (Section/Pause/Convert paths)
    20-models.ps1       model-def access, alias naming, strict-sibling helpers
    25-vram.ps1         nvidia-smi auto-detect, fit-class arithmetic
    30-ollama.ps1       ollama lifecycle (start/stop/wait/env/version probe)
    32-llamacpp.ps1     llama-server lifecycle (port pick, health, session)
    33-llamacpp-install.ps1   resolve native + turboquant llama-server binaries
    35-backend.ps1      Invoke-Backend dispatcher (ollama vs llamacpp)
    40-parsers.ps1      per-family chat template / sampler / strict overlay
    41-llamacpp-args.ps1   pure argv builder for llama-server
    42-llamacpp-templates.ps1  parser → llama-server flag mapping, strict file
    45-profile-version.ps1  Modelfile content hash for staleness detection
    50-modelfile.ps1    Ollama alias creation + lifecycle (incl. strict siblings)
    55-huggingface.ps1  HF repo discovery, GGUF download, quant code recognition
    60-catalog.ps1      catalog editor (addllm/updatellm/removellm/setllm)
    65-claude-launch.ps1   Claude/Unshackled launcher; env save/restore, proxy
    70-bench.ps1        ospeed → bench-history.jsonl, Show-LLMBenchHistory
    75-display.ps1      info dashboard (Spectre + plain-text fallbacks)
    80-init.ps1         init/initmodel/purge/ostop/qkill/ops
    85-shortcuts.ps1    per-model function generator, default-key resolution
    90-wizard.ps1       native selectable + Spectre interactive wizards
    99-entrypoints.ps1  llm/llmmenu/llmc/llms/reloadllm/lps/lstop

ollama-proxy/
  no-think-proxy.py     strips Anthropic thinking/reasoning blocks
  enforcer-claude.ps1   wrapper that re-enters the local backend on Claude → Claude calls

LocalLLMProfile.ps1 dot-sources every lib/*.ps1 in numeric prefix order, loads llm-models.json overlaid with ~/.local-llm/settings.json, and registers per-model shortcut functions. Everything else hangs off that.


Install

From the repo root:

. .\install.ps1                  # copy files to deployed locations + add to $PROFILE
. .\install.ps1 -Symlink         # symlink instead of copy (admin / dev mode)
. .\install.ps1 -SetupProfile    # only ensure $PROFILE dot-sources the deployed file
. .\install.ps1 -InstallBenchPilot   # also clone BenchPilot if missing
. .\install.ps1 -InstallUnshackled   # also clone Unshackled if missing
. .\install.ps1 -DryRun          # preview without changing anything

After install, open a fresh PowerShell. Two things to do:

init                             # build aliases for the recommended-tier models
info                             # verify: VRAM, default model, parser freshness

The install step offers to clone missing BenchPilot and Unshackled checkouts into ~/.local-llm/tools/. Use -SkipToolPrompts for unattended installs. Show-Diagnostics also reports on ollama, python, bun (only needed for Unshackled), PwshSpectreConsole, BenchPilot, and Unshackled. Installs also record LocalBoxRoot in settings.json, which lets llm-update pull this repo and redeploy the profile files later.


Day-to-day usage

One function per model. Flag-based:

qcoder -Ctx 32k -Unshackled   Code agent (Qwen3-Coder, 32k, Unshackled)
qcoder -Ctx 32k -Codex        Code agent (Qwen3-Coder, 32k, Codex)
q36p -Ctx 32k -Unshackled     General Qwen 3.6 agent (32k, Unshackled)
dev -Ctx 32k                  Smaller / faster (Devstral 24B, 32k)
q36p -Ctx 128k -Unshackled    Big context (Qwen 3.6 Plus, 128k)
qcoder -Ctx 256 -Quant iq4xs  256k coder context (4090 ceiling — no -Q8)
q36p -Chat                    Raw ollama chat, no Claude Code
q36p -Q8                      Use q8 KV cache for higher quality
q36p -Quant q6kp              Switch the GGUF quant (rebuilds aliases)
llmdefault                    Launch the configured default recipe/model
llmdefaultunshackled          Same, via Unshackled
llmdefaultcodex               Same, via Codex
llmdefaultchat                Same, plain chat
llm                           Guided wizard (Spectre when available)
llmc                          Native selectable wizard, explicit alias
llms                          Spectre wizard, explicit alias
info                          Dashboard
info -Commands                Full LocalBox + BenchPilot command list
llmdocs                       Quick reference
llm-update                    Update LocalBox + installed companion checkouts
Flag Effect
-Ctx <name> One of the model's context keys (32k, 64k, 128k, 256k). Omit for default.
-Unshackled Use Unshackled instead of Claude Code.
-Codex Use OpenAI Codex instead of Claude Code.
-Chat Run plain ollama run, skip Claude Code entirely.
-Q8 Set OLLAMA_KV_CACHE_TYPE=q8_0 for this launch. Refused above the VRAM-derived Q8KvMaxContext ceiling — q8 KV at long context will OOM.
-Quant <name> Switch the model's selected GGUF quant. No launch — rebuilds the alias.

256 k context on a 24 GB card

The combination of Qwen3-Coder-30B-A3B Heretic (4 KV heads, 48 layers) at the IQ4_XS quant with q4_0 KV cache is the only setup that fits a full 256k context on a single 4090:

qcoder -Ctx 256 -Quant iq4xs        # Claude Code @ 256k
qcoder -Ctx 256 -Quant iq4xs -Unshackled  # Unshackled @ 256k

Weights ~16.5 GB; q4_0 KV @ 256k ~6 GB; total ~23.6 GB. The launcher will refuse -Q8 at this context because q8 KV would push KV cache to ~12 GB and OOM the card. Run llmdocs for the full quick reference, info for the dashboard, or info -Commands for the generated model commands plus LocalBox and BenchPilot management commands.


Adding a model

addllm <hf-url-or-repo> -Key <key> [-Quants Q4_K_P,IQ4_XS] [-DefaultQuant Q4_K_P] [-Tier recommended]
initmodel <key>

addllm registers every recognized GGUF quant the HF repo publishes by default (the imatrix.gguf calibration file is excluded). Pass -Quants only when you want to filter the catalog entry to a subset.

Backfilling missing quants on an existing entry (rerunning HF discovery without overwriting your manual QuantNotes / ContextNotes):

updatellm <key>            # adds any HF quants missing from the entry
updatellm <key> -DryRun    # preview without writing

Removing a model:

removellm <key>            # confirms first
removellm <key> -Force     # skip confirmation
removellm <key> -KeepFiles # keep the GGUF blobs on disk

VRAM-aware tradeoffs

The launcher reads your GPU's VRAM and uses it to:

  1. Tag every quant with [fits] / [tight] / [over] in info and the llm wizard, so you can see at a glance which builds will load fully on your card.
  2. Set the Q8KvMaxContext ceiling — the largest num_ctx that pairs safely with -Q8 (q8_0 KV cache). Roughly +16k tokens of headroom per GB above 16 GB; floors at 64 k. The guard refuses launches that would exceed this and tells you what to drop.

VRAM resolves in this order:

  1. VRAMGB set in settings.json or llm-models.json (top-level).
  2. nvidia-smi --query-gpu=memory.total auto-detect (largest GPU on a multi-GPU box).
  3. Fallback to 24.

The info dashboard shows the resolved value and source (auto / configured / fallback).

Set-LocalLLMSetting VRAMGB 32          # 5090
Set-LocalLLMSetting VRAMGB 48          # RTX 6000 Ada / dual-card aggregate
Set-LocalLLMSetting VRAMGB $null       # remove override, fall back to auto-detect
Set-LocalLLMSetting Q8KvMaxContext 196608   # pin the q8 ceiling explicitly

Per-quant tradeoffs come from two optional catalog fields:

  • QuantSizesGB — file size per quant in GB (drives the fit badge).
  • QuantNotes — human-readable note per quant (quality/use-case context). Shown verbatim.

Per-context guidance comes from ContextNotes in the same shape. Backfill these on any model you add — they show up inline in info and the wizard.


Per-machine settings (settings.json)

llm-models.json is the model catalog — committed, sharable. Per-machine paths and preferences belong in a sibling settings.json at ~/.local-llm/settings.json (gitignored). It overlays top-level scalars from the catalog at load time, so you don't have to hand-edit llm-models.json to fix paths on a fresh machine.

Use the helper instead of editing JSON:

Set-LocalLLMSetting UnshackledRoot '<path-to-unshackled>'   # usually auto-set by install.ps1
Set-LocalLLMSetting BenchPilotRoot '<path-to-benchpilot>'   # usually auto-set by install.ps1
Set-LocalLLMSetting LocalBoxRoot '<path-to-LocalBox>'        # auto-set by install.ps1
Set-LocalLLMSetting Default q36plus
Set-LocalLLMSetting KeepAlive '5m'
Set-LocalLLMSetting VRAMGB 32                        # override auto-detect
Set-LocalLLMSetting Q8KvMaxContext 196608            # pin the -Q8 ceiling
Set-LocalLLMSetting LlamaCppDefaultMode native       # or 'turboquant'
Set-LocalLLMSetting LlamaCppCoexistOllama $true      # rare: allow both backends concurrently
Set-LocalLLMSetting LlamaCppNCpuMoe 35               # MoE expert CPU offload (default 35; 0 to disable)
Set-LocalLLMSetting LlamaCppMlock $false             # disable RAM locking (default $true)
Set-LocalLLMSetting LlamaCppNoMmap $false            # disable no-mmap (default $true)
Set-LocalLLMSetting LlamaCppAgentParallel 1          # agent slots (default 1; 0 = llama.cpp auto)
Set-LocalLLMSetting LlamaCppAgentCacheReuse 256      # prompt-cache reuse chunk size (default 256; 0 = llama.cpp default)
Set-LocalLLMSetting LocalModelMaxOutputTokens 4096   # cap local Claude/Unshackled completions (0 = tool default)
Set-LocalLLMSetting UnshackledRoot $null             # remove an entry

The Models and CommandAliases keys are catalog-only and rejected by Set-LocalLLMSetting. Everything else is fair game.

Per-workspace default model

Drop a .llm-default file in any directory containing a single line — a model key, ShortName, or Root. llmdefault (and the enforcer wrapper) walks up from $PWD and uses the nearest match. Falls back to settings → catalog Default.

echo q36p > .llm-default          # this workspace prefers Qwen 3.6 Plus

MCP servers

Claude Code's MCP servers expose tools with names like mcp__<server>__<tool>. They reach the local model through the same launch path:

  • Models with "LimitTools": false (e.g. dev) get every MCP tool automatically — the --tools flag isn't passed.
  • Models with "LimitTools": true (default) only see tools in the allowlist. Add the MCP tool names you want to either the global LocalModelTools field in llm-models.json or a per-model Tools override.

Example per-model override:

"q36plus": {
  ...,
  "Tools": "Bash,Read,Write,Edit,Glob,Grep,mcp__filesystem__read_file,mcp__filesystem__write_file"
}

info shows a Tools : ... line for any model that overrides the global list.


Bench history

ospeed <model> appends one JSONL line per run to ~/.local-llm/bench-history.jsonl. View with:

obench                            # last 20 entries, all models
obench -Model q36plus -Last 50    # filter by model
Trim-LLMBenchHistory -OlderThanDays 90 -DryRun   # preview pruning
Trim-LLMBenchHistory -OlderThanDays 90           # apply pruning

BenchPilot auto-tuner (findbest)

findbest is a LocalBox compatibility command that delegates tuning to BenchPilot. LocalBox no longer contains a legacy benchmark/search implementation; BenchPilot is the single owner of benchmark execution, scoring, reports, and AutoBest export.

BenchPilot writes a LocalBox-compatible result to ~/.local-llm/tuner/best-<key>.json, and Start-ClaudeWithLlamaCppModel -AutoBest replays that saved profile.

Standard catalog context aliases are 32k, 64k, 128k, and 256k unless a model explicitly lacks support. AutoBest profiles are context-aware: the saved entry records both contextKey and the resolved contextTokens, and launcher selection still requires the same context key.

# Tune q36plus at the 256k context preset, native llama.cpp, default budget.
# Default goal is coding-agent: long-prefill end-to-end latency.
findbest q36plus -ContextKey 256k

# Quick mode — only baseline + n-cpu-moe + batching (~10 trials)
findbest q36plus -ContextKey 256k -Quick

# Deep mode — normal phases, then finer local offload/batch/thread refinement
findbest q36plus -ContextKey 256k -Deep

# Default sampling is three runs per candidate; override when needed
findbest q36plus -ContextKey 256k -Runs 5

# Save both the fastest raw profile and a workstation-friendly balanced profile
findbest q36plus -ContextKey 256k -Profile both

# Force the expanded beam search and keep three survivors after each phase
findbest q36plus -ContextKey 256k -SearchStrategy beam -BeamWidth 3

# Optimize for prompt-eval (prefill) or generation explicitly
findbest q36plus -ContextKey 256k -Optimize prompt
findbest q36plus -ContextKey 256k -Optimize gen

# Allow KV cache variation. Native mode defaults to the model's current type;
# turboquant mode always also tests turbo3/turbo4 KV cache encodings.
findbest q36plus -ContextKey 256k -AllowedKvTypes q8_0,f16

# Try mismatched K/V pairs too, and allow an explicit quality trade if wanted
findbest q36plus -ContextKey 256k -AllowedKvTypes q8_0,q4_0 -AggressiveKv

# Power-user: tune separate short- and long-prefill profiles
findbest q36plus -ContextKey 256k -PromptLengths short,long

# Inspect every trial run for a model
Show-LlamaCppTunerHistory -Key q36plus -Last 50

BenchPilot may use fast llama-bench probes where supported, but turboquant mode uses llama-server probes so turbo3 / turbo4 are measured through the same binary LocalBox will actually launch. Upstream llama-bench has KV-cache flags (-ctk / -ctv), but TurboQuant cache types only work in a fork/build that registers them; LocalBox's turboquant path uses TheTom's fork.

-Quant selects the GGUF model file and stays fixed during a tuner run. KvK and KvV are only runtime KV-cache encodings. In native mode BenchPilot defaults KV search to the model's current cache type unless you pass -AllowedKvTypes. In turboquant mode BenchPilot always includes turbo3 and turbo4 as KV-cache candidates and probes them early, because otherwise a turboquant "best" run may never test the backend's defining cache encodings.

During a run, BenchPilot prints one row per candidate:

Waiting for llama-server...
  [ 4] OK   pp= 645.95 tg=  58.39  score= 559.36  phase=moe          NCpuMoe=20 KvK=q8_0 KvV=q8_0

Live output legend:

  • [ 4] — trial number within this tuning run.
  • OK / OOM / FAIL — candidate status. OOM means the trial exhausted memory; FAIL means startup or probing failed for another reason.
  • pp — prompt-processing / prefill throughput, in tokens/sec. This measures how fast llama.cpp consumes the input prompt before generation starts. By default this is the average across three samples for that candidate.
  • tg — text-generation throughput, in tokens/sec after generation starts. This is also averaged across the candidate's samples.
  • score — the value BenchPilot is optimizing. With the default -Optimize coding-agent, this is effective end-to-end tokens/sec for a large coding-agent prompt plus a moderate generated reply, so higher is better. With -Optimize gen, it follows tg; with -Optimize prompt, it follows pp; with -Optimize both, it uses a balanced prompt/generation score. With -Profile balanced, BenchPilot stores the selected balanced score and also preserves the raw pure score for comparison.
  • phase — which search step produced the candidate. Common values are baseline, moe, ngl, batching, flash, mmap, threads, kv, verify, and, with -Deep, deep_moe, deep_ngl, deep_batching, deep_flash, and deep_threads.

Extra Name=value pairs at the end are the settings changed for that trial:

  • NCpuMoe — llama.cpp --n-cpu-moe, the number of MoE expert layers kept on CPU. Lower usually moves more expert work to GPU, which is faster but uses more VRAM.
  • NGpuLayers — llama.cpp -ngl / --n-gpu-layers, the dense-model layer offload count. Higher generally uses more GPU and more VRAM.
  • UbatchSize — llama.cpp --ubatch-size, the physical microbatch size used during prompt processing.
  • BatchSize — llama.cpp --batch-size, the logical prompt batch size.
  • Threads — llama.cpp --threads, CPU threads for generation/eval work.
  • ThreadsBatch — llama.cpp --threads-batch, CPU threads for prompt batch processing.
  • FlashAttn — flash-attention on/off.
  • Mlock — llama.cpp --mlock, requests locking model/cache pages in memory.
  • NoMmap — llama.cpp --no-mmap, loads model weights without memory mapping.
  • KvK — KV-cache type for attention keys, passed as -ctk.
  • KvV — KV-cache type for attention values, passed as -ctv.

KV-cache values are llama.cpp cache encodings such as f16, q8_0, q4_0, or, in TheTom turboquant builds, turbo3 / turbo4. These are not GGUF model quants like Q4_K_M, Q8_0, or APEX variants; they only change how the runtime stores attention keys/values. KV changes can affect quality, so BenchPilot runs a perplexity sanity check before saving a faster changed KV pair unless -AllowKvQualityRegression is passed.

Skip/pruning messages use the same ideas with shorter llama.cpp-style labels: ngl means NGpuLayers, ub means UbatchSize, b means BatchSize, flash means FlashAttn, threads means Threads / ThreadsBatch, and kv K=... V=... means the KvK / KvV pair being tested.

What gets searched:

BenchPilot supports two search strategies. greedy keeps only the current best candidate after each phase. beam keeps the top -BeamWidth candidates, so a near-winner from the MoE/NGL phase can still combine with later batching, flash-attention, mmap/mlock, and thread settings. Normal pure runs default to greedy with width 1. -Deep or any balanced profile defaults to beam with width 3 unless you pass -SearchStrategy / -BeamWidth explicitly.

  1. baseline — catalog defaults, one probe.
  2. moe_or_ngl — for MoE models, sweep --n-cpu-moe to find the smallest value that still fits VRAM (more layers on GPU = faster). For dense models with -ngl already at 999, this phase is a no-op unless baseline OOMed.
  3. batching — joint sweep of (--ubatch-size, --batch-size) over a small 2-D grid.
  4. flash — compares flash-attention on/off.
  5. mmap — compares --mlock --no-mmap against the default mapping mode (-Aggressive tries the cross-combinations too).
  6. threads — sweeps CPU thread counts when the winning config keeps MoE experts or dense layers on CPU.
  7. kv — runs when more than one KV-cache encoding is available. Native mode only widens this phase when -AllowedKvTypes is supplied. Turboquant mode always includes turbo3 and turbo4, and probes KV early enough that the normal MoE/batch grid cannot consume the whole default budget first. The tuner runs a small perplexity sanity check and refuses a >1% regression unless -AllowKvQualityRegression is passed.
  8. deep — optional (-Deep). Re-tests a finer local neighborhood around the winning offload value, expands the batch grid up to -ub 2048 / -b 4096, re-checks flash-attention after batch changes, and tries a wider CPU-thread set when CPU offload remains. If -Budget is omitted, deep mode raises the budget from 30 to 60.
  9. verify — re-runs the final winner through llama-server when a coarse bench phase was used.

The default -Optimize coding-agent score is effective end-to-end throughput for a local coding-agent request: large prompt prefill plus a moderate generated reply. That prevents decode-only winners with high generation TPS and poor prompt-processing TPS from being saved as "best".

-Profile pure selects the highest measured LLM throughput. -Profile balanced starts from that pure score and applies visible CPU, RAM, VRAM, variance, and stability factors so the saved launch leaves more workstation headroom. -Profile both runs and saves both profile types for the same target. Balanced reports include the pure score and score-factor breakdown when telemetry is available.

OOM/failure is detected from process output; OOM-monotonicity prunes branches that are guaranteed to fail. Saved entries include the GPU name and llama.cpp build stamp, so -AutoBest can warn when a re-tune is advisable after a hardware or llama.cpp upgrade. Saved entries also include profile, searchStrategy, beamWidth, pureScore, optional telemetry, and optional scoreBreakdown; older entries without profile are treated as pure. Tuner version changes require a re-tune.

-PromptLengths short,long stores separate profile entries. -AutoBest defaults to auto: it prefers balanced when present, falls back to pure, and within each profile prefers long/coding-agent prompt profiles before short profiles. Use -AutoBestProfile pure or -AutoBestProfile balanced to force the selection profile. -AutoBestProfile short and -AutoBestProfile long remain available as legacy prompt-length overrides and force a pure profile.

Replaying the saved best:

Start-ClaudeWithLlamaCppModel -Key q36plus -ContextKey 256k -Mode native -AutoBest
Start-ClaudeWithLlamaCppModel -Key q36plus -ContextKey 256k -Mode native -AutoBest -AutoBestProfile balanced

The launcher matches the saved entry on (key, contextKey, mode, profile, prompt_length, quant, vramGB ± 1) and a tuner-version stamp; contextTokens is recorded as provenance for the actual num_ctx used by the run. On a miss it warns and falls through to defaults. Caller-supplied -KvCacheK / -KvCacheV / -ExtraArgs always win over the saved values.

Before handing an AutoBest llama.cpp session to Claude or Unshackled, LocalBox sends a tiny /v1/messages smoke request, including the same system prompt used for the real launch, through the same Anthropic-compatible route. The smoke must produce the requested visible answer; text hidden inside <think>...</think> does not count. If the no-think proxy route fails, LocalBox tries a direct llama-server route for that session. If both routes fail, launch stops immediately instead of starting an unusable spinner-only session.

In the wizard, choose the llama.cpp backend and then Find best settings to run the same tuner interactively, with prompts for normal vs deep tuning, pure vs balanced vs both selection profiles, KV variation, saving the winner, and launching immediately with -AutoBest. When both pure and balanced profiles are saved, the launch-settings step shows separate Use balanced and Use pure choices, plus Use AutoBest for the default balanced-then-pure preference. After a -Profile both tuning run, the immediate-launch flow also asks which saved profile to use. Choose Delete best settings from the same action menu to remove saved AutoBest entries for the selected (model, quant, context, backend mode, VRAM) before re-tuning.

After a matching best config has been saved, normal wizard launches for the same (model, quant, context, backend mode, VRAM) automatically replay it and skip the manual KV-cache picker. If no matching entry exists, the wizard keeps the usual manual KV-cache selection.


Wizard

llm launches the Spectre picker when PwshSpectreConsole is available. Use llmc for the native selectable picker; it uses arrow keys + Enter, while keeping number/letter shortcuts for fast selection. It walks: model → quant → backend → context → action → q8/kvcache → launch. Each step has a Back option (0/Escape in native, [[Back]] in Spectre); the Spectre wizard wraps each prompt in Invoke-LLMWizardStep and logs the full exception trace to ~/.local-llm/wizard-errors.log if anything throws, so a Spectre live-display refresh can't scroll the trace off screen. Inspect with llmlogerr [-Lines 80]; reset with llmlogerrclear. The launch debug trace (vision, proxy, llama-server, Claude launches) is recorded in ~/.local-llm/launch.log and tailable with llmlog [-Lines 80].

After a model is selected, the Spectre wizard waits briefly before drawing the next prompt and retries one fast-empty transition. Tune that guard with LOCAL_LLM_SPECTRE_PROMPT_COOLDOWN_MS (default 500, max 5000).

llms launches the Spectre wizard explicitly. llmc remains an explicit native-picker alias.

$env:LOCAL_LLM_SPECTRE_PROMPT_COOLDOWN_MS = '750'
$env:LOCAL_LLM_NO_SPECTRE = '1'   # disable Spectre everywhere / make llm use native

Casing convention

The repo mixes three styles intentionally:

  • kebab-case for folders (local-llm/, ollama-proxy/) — matches their deployed path.
  • PascalCase for the entry-point script (LocalLLMProfile.ps1) — PowerShell convention.
  • kebab-case for data files (llm-models.json).

These names are user-visible (the deployed paths). Renaming them would break setups, so they stay.


Troubleshooting

  • init says Ollama version too oldwinget upgrade Ollama.Ollama.
  • Refusing -Q8 with -Ctx 256 ... → drop -Q8, lower -Ctx, or raise the ceiling: Set-LocalLLMSetting Q8KvMaxContext <tokens>.
  • <model> does not advertise tool supportRequireAdvertisedTools is on by default; some uncensored Qwen variants don't advertise capabilities. Verify with ollama show <alias>. Bypass for one launch with Start-ClaudeWithOllamaModel -Model <alias> -SkipToolCheck, or globally via Set-LocalLLMSetting RequireAdvertisedTools $false.
  • Stale aliases after editing a parserinit -Stale rebuilds only the aliases whose Modelfile content hash drifted.
  • Spectre wizard crashed or stallsllmlogerr for the full trace; use llmlog for launch/debug details (vision, proxy, llama-server, Claude); llmc for the native picker or set $env:LOCAL_LLM_NO_SPECTRE=1 to disable Spectre everywhere. If the next prompt appears too slowly after selecting a model, raise $env:LOCAL_LLM_SPECTRE_PROMPT_COOLDOWN_MS.
  • bun not on PATH → only required for Unshackled launches. Install via winget install Oven-sh.Bun.

More

  • CHANGELOG.md — what shipped, when.

About

Run Claude Code against a local Ollama backend on Windows. Per-model parser/sampling, thinking-trained model support, free-code integration.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors