Problem
Our MTP_FLAG_BUNDLE (src/hal0/config/schema.py) ships --spec-draft-p-min 0.0 --spec-draft-n-max 4. Current research on MTP/speculative-decoding tuning for llama.cpp indicates:
--spec-draft-p-min matters more than --spec-draft-n-max for effective throughput, and 0.0 is too permissive — it accepts every draft regardless of confidence, wasting verification on low-probability drafts. Recommended sweet spot ~0.75.
- For dense models (e.g. Qwen3.6-27B dense),
--spec-draft-n-max 5 is the recommended draft length; MoE wants shorter/none.
Our bundle was bench-tuned earlier (hal0-container-bench-2026-06-08.md) but predates this guidance; the rocm-mtp profile currently benches slower than rocm (24.4 vs 52.8 tps) on the MoE workload — expected for MoE, but the dense path may be leaving throughput on the table with p-min 0.0.
Ask
Bench MTP_FLAG_BUNDLE variants on Strix Halo with an MTP-capable dense GGUF:
- p-min: 0.0 (current) vs 0.5 vs 0.75
- n-max: 4 (current) vs 5
Measure tok/s + acceptance rate. Update the bundle + PROFILE_BENCH if a variant wins; keep hal0-container-bench-*.md in sync.
Context
Surfaced during the slot-config MTP work (per-slot MTP override + capability-gated pill, PR for Phase 2). MTP helps dense / hurts MoE / needs an MTP-capable GGUF — see docs/superpowers/specs/2026-06-14-slot-config-grouping-mtp-templates-design.md.
Sources: github.com/ggml-org/llama.cpp/blob/master/docs/speculative.md; dredyson.com Qwen3.6-27B MTP guide.
Problem
Our
MTP_FLAG_BUNDLE(src/hal0/config/schema.py) ships--spec-draft-p-min 0.0 --spec-draft-n-max 4. Current research on MTP/speculative-decoding tuning for llama.cpp indicates:--spec-draft-p-minmatters more than--spec-draft-n-maxfor effective throughput, and 0.0 is too permissive — it accepts every draft regardless of confidence, wasting verification on low-probability drafts. Recommended sweet spot ~0.75.--spec-draft-n-max 5is the recommended draft length; MoE wants shorter/none.Our bundle was bench-tuned earlier (
hal0-container-bench-2026-06-08.md) but predates this guidance; therocm-mtpprofile currently benches slower thanrocm(24.4 vs 52.8 tps) on the MoE workload — expected for MoE, but the dense path may be leaving throughput on the table with p-min 0.0.Ask
Bench
MTP_FLAG_BUNDLEvariants on Strix Halo with an MTP-capable dense GGUF:Measure tok/s + acceptance rate. Update the bundle +
PROFILE_BENCHif a variant wins; keephal0-container-bench-*.mdin sync.Context
Surfaced during the slot-config MTP work (per-slot MTP override + capability-gated pill, PR for Phase 2). MTP helps dense / hurts MoE / needs an MTP-capable GGUF — see
docs/superpowers/specs/2026-06-14-slot-config-grouping-mtp-templates-design.md.Sources: github.com/ggml-org/llama.cpp/blob/master/docs/speculative.md; dredyson.com Qwen3.6-27B MTP guide.