Skip to content

Bench-retune MTP_FLAG_BUNDLE (spec-draft-p-min 0.0 → ~0.75, n-max → 5 for dense) #799

@thinmintdev

Description

@thinmintdev

Problem

Our MTP_FLAG_BUNDLE (src/hal0/config/schema.py) ships --spec-draft-p-min 0.0 --spec-draft-n-max 4. Current research on MTP/speculative-decoding tuning for llama.cpp indicates:

  • --spec-draft-p-min matters more than --spec-draft-n-max for effective throughput, and 0.0 is too permissive — it accepts every draft regardless of confidence, wasting verification on low-probability drafts. Recommended sweet spot ~0.75.
  • For dense models (e.g. Qwen3.6-27B dense), --spec-draft-n-max 5 is the recommended draft length; MoE wants shorter/none.

Our bundle was bench-tuned earlier (hal0-container-bench-2026-06-08.md) but predates this guidance; the rocm-mtp profile currently benches slower than rocm (24.4 vs 52.8 tps) on the MoE workload — expected for MoE, but the dense path may be leaving throughput on the table with p-min 0.0.

Ask

Bench MTP_FLAG_BUNDLE variants on Strix Halo with an MTP-capable dense GGUF:

  • p-min: 0.0 (current) vs 0.5 vs 0.75
  • n-max: 4 (current) vs 5
    Measure tok/s + acceptance rate. Update the bundle + PROFILE_BENCH if a variant wins; keep hal0-container-bench-*.md in sync.

Context

Surfaced during the slot-config MTP work (per-slot MTP override + capability-gated pill, PR for Phase 2). MTP helps dense / hurts MoE / needs an MTP-capable GGUF — see docs/superpowers/specs/2026-06-14-slot-config-grouping-mtp-templates-design.md.

Sources: github.com/ggml-org/llama.cpp/blob/master/docs/speculative.md; dredyson.com Qwen3.6-27B MTP guide.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions