[Experiment] QMV LUT Dequant by CC-Yeh · Pull Request #394 · trymirai/uzu

CC-Yeh · 2026-05-08T13:27:36Z

Tested replacing QmvFast's pure-ALU int4→float dequant: uint_to_fp
mantissa trick (nibble extract via shift/mask + bit-OR + fsub) with
a threadgroup-memory LUT lookup. No win at any tested shape, mild
E2E regression on real models. Mantissa trick stays optimal.

Headline (Apple M4, ZP_BF16_gs64, 4-bit, kernel µs medians)

Shape	M	main	LUT	Δ
(4096, 4096)	1	80.7	81.7	+1% (tied)
(4096, 4096)	2	83.0	123.2	+48%
(4096, 4096)	4	150.5	241.0	+60%
(14336, 4096)	1	314.4	305.8	−2.7% (tied, within noise)

E2E LFM2.5 4-bit decode: −1.8% (RHT), −3.5% (MLX).

What was tried

Constant-memory LUT: divergent loads serialize. +73–130%.
Threadgroup-memory LUT: partial recovery. Still +50–60% at M≥2.
bfloat2 entries: same wallclock as half2.
Manual bf << 16 widen: eliminates air.convert from AIR
(verified) but no wallclock change. Convert wasn't the cost.

Verdict

The actual cost is L1 cache port contention: LUT reads compete
with weight loads at the same L1 read port. The mantissa trick has
zero memory ops in the dequant chain (pure ALU, extracts nibbles
and converts to float in one fused bit-twiddle), so it doesn't fight
for the port. No LUT variant can beat that on Apple GPU.

CC-Yeh · 2026-05-08T13:28:15Z

Kernel benchmarks (M4, ZP_BF16_gs64, criterion median µs)

(4096, 4096) — 8 MB, fits SLC

M	main	LUT	Δ
1	80.7	81.7	+1% (tied)
2	83.0	123.2	+48%
4	150.5	241.0	+60%

(14336, 4096) M=1 — 28 MB, 5-rep verification

	main	LUT
median	314.4	305.8
min/max	309.3 / 323.8	304.9 / 319.1
spread	4.6%	4.6%

LUT 2.7% faster, within noise. No regression at this shape/M.

(Earlier sweep reported +275% for this case — turned out to be a
broken measurement; clean rerun didn't reproduce. The "wide-shallow
shapes are catastrophic" claim is retracted.)

E2E LFM2.5 4-bit (n=15, M4)

Model	Build	Decode tok/s
RHT-4bitLmHead	main	143.7
RHT-4bitLmHead	LUT	141.1
MLX-4bit	main	146.2
MLX-4bit	LUT	141.1

Decode regresses 1.8–3.5% across both formats — small but consistent.

CC-Yeh added 2 commits May 8, 2026 10:47

lut experiments

41f0690

use threadgroup

66b6629

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Experiment] QMV LUT Dequant#394

[Experiment] QMV LUT Dequant#394
CC-Yeh wants to merge 2 commits intomainfrom
qmv_lut

CC-Yeh commented May 8, 2026

Uh oh!

CC-Yeh commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

CC-Yeh commented May 8, 2026

Headline (Apple M4, ZP_BF16_gs64, 4-bit, kernel µs medians)

What was tried

Verdict

Uh oh!

CC-Yeh commented May 8, 2026

Kernel benchmarks (M4, ZP_BF16_gs64, criterion median µs)

(4096, 4096) — 8 MB, fits SLC

(14336, 4096) M=1 — 28 MB, 5-rep verification

E2E LFM2.5 4-bit (n=15, M4)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant