Skip to content

[Experiment] QMV LUT Dequant#394

Draft
CC-Yeh wants to merge 2 commits intomainfrom
qmv_lut
Draft

[Experiment] QMV LUT Dequant#394
CC-Yeh wants to merge 2 commits intomainfrom
qmv_lut

Conversation

@CC-Yeh
Copy link
Copy Markdown
Contributor

@CC-Yeh CC-Yeh commented May 8, 2026

Tested replacing QmvFast's pure-ALU int4→float dequant: uint_to_fp
mantissa trick (nibble extract via shift/mask + bit-OR + fsub) with
a threadgroup-memory LUT lookup. No win at any tested shape, mild
E2E regression on real models.
Mantissa trick stays optimal.

Headline (Apple M4, ZP_BF16_gs64, 4-bit, kernel µs medians)

Shape M main LUT Δ
(4096, 4096) 1 80.7 81.7 +1% (tied)
(4096, 4096) 2 83.0 123.2 +48%
(4096, 4096) 4 150.5 241.0 +60%
(14336, 4096) 1 314.4 305.8 −2.7% (tied, within noise)

E2E LFM2.5 4-bit decode: −1.8% (RHT), −3.5% (MLX).

What was tried

  • Constant-memory LUT: divergent loads serialize. +73–130%.
  • Threadgroup-memory LUT: partial recovery. Still +50–60% at M≥2.
  • bfloat2 entries: same wallclock as half2.
  • Manual bf << 16 widen: eliminates air.convert from AIR
    (verified) but no wallclock change. Convert wasn't the cost.

Verdict

The actual cost is L1 cache port contention: LUT reads compete
with weight loads at the same L1 read port. The mantissa trick has
zero memory ops in the dequant chain (pure ALU, extracts nibbles
and converts to float in one fused bit-twiddle), so it doesn't fight
for the port. No LUT variant can beat that on Apple GPU.

@CC-Yeh
Copy link
Copy Markdown
Contributor Author

CC-Yeh commented May 8, 2026

Kernel benchmarks (M4, ZP_BF16_gs64, criterion median µs)

(4096, 4096) — 8 MB, fits SLC

M main LUT Δ
1 80.7 81.7 +1% (tied)
2 83.0 123.2 +48%
4 150.5 241.0 +60%

(14336, 4096) M=1 — 28 MB, 5-rep verification

main LUT
median 314.4 305.8
min/max 309.3 / 323.8 304.9 / 319.1
spread 4.6% 4.6%

LUT 2.7% faster, within noise. No regression at this shape/M.

(Earlier sweep reported +275% for this case — turned out to be a
broken measurement; clean rerun didn't reproduce. The "wide-shallow
shapes are catastrophic" claim is retracted.)

E2E LFM2.5 4-bit (n=15, M4)

Model Build Decode tok/s
RHT-4bitLmHead main 143.7
RHT-4bitLmHead LUT 141.1
MLX-4bit main 146.2
MLX-4bit LUT 141.1

Decode regresses 1.8–3.5% across both formats — small but consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant