Skip to content

Add --kv-cache turbo3 (TurboQuant+ 3-bit KV cache, CUDA + Metal)#243

Open
TheTom wants to merge 9 commits into
antirez:mainfrom
TheTom:pr/01-cuda-turbo3
Open

Add --kv-cache turbo3 (TurboQuant+ 3-bit KV cache, CUDA + Metal)#243
TheTom wants to merge 9 commits into
antirez:mainfrom
TheTom:pr/01-cuda-turbo3

Conversation

@TheTom
Copy link
Copy Markdown

@TheTom TheTom commented May 24, 2026

Adds an opt-in --kv-cache turbo3 dtype that packs each KV row to
431 bytes - a 4.75x reduction over the float-sim fp8 baseline
(2048 B/row) - using the TurboQuant+ scheme: per-group Walsh-Hadamard
rotation + 8-level Lloyd-Max codebook + matched-norm L2 scale + FP8
per-group scale byte + RoPE tail.

Default behaviour is unchanged (fp8). Ships on both CUDA and Metal
in this PR. turbo4 (4-bit), h8 head-batched Flash attention, and
sparse-V skip are separate follow-ups once this lands.

Bench

GX10 (ASUS Ascent, GB10 Blackwell chip, 128 GB), IQ2XXS,
ds4-bench --backend cuda, 5-run medians:

ctx=8K  decode_tps  fp8: 36.5 / turbo3: 37.2  (+1.9%)
ctx=16K decode_tps  fp8: 32.1 / turbo3: 33.1  (+3.1%)
ctx=32K decode_tps  fp8: 24.6 / turbo3: 25.4  (+3.3%)

prefill_tps within +/-2% of fp8 across all measured frontiers.

M5 Max 128 GB, IQ2XXS, ds4-bench --backend metal:

fp8:    37.04 t/s decode
turbo3: 29.97 t/s decode (Wave M3 inline-dequant path)

Mac Metal decode is ~19% slower at short context: the Wave M3 path
dequants the packed row to scratch per attention call before reusing
the stock fp8 attention kernel. The follow-up PR with the h8 head-
batched Flash attention kernels eliminates the scratch hop (one dequant
per K=V tile, shared across the 8 query heads) and closes the gap to
within ~5% of fp8 on the same prompt. Prefill_tps is unchanged. CUDA
decode is +1.9% to +3.3% across 8K..32K (no regression on that backend).

Quality (teacher-forced PPL + full-vocab KLD + top-k agreement)

This PR also adds ds4-bench --ppl-prompt FILE for teacher-forced
perplexity, plus --quality-emit FILE / --quality-baseline FILE to
dump per-position full-vocab logits and report KL divergence + top-1 /
top-5 agreement vs a saved baseline. Matches the TurboQuant+
asymmetric-kv-compression paper's validation suite.

GX10, IQ2XXS, 255 scored positions, vocab=129280:

prompt 1 (low-entropy code audit):
  fp8     ppl=4.35  baseline
  turbo3  ppl=4.12  KLD mean=0.036 nats  top-1=93.3%  top-5=99.6%

prompt 2 (high-entropy security):
  fp8     ppl=35.28 baseline
  turbo3  ppl=31.83 KLD mean=0.214 nats  top-1=83.1%  top-5=99.6%

Mac Metal cross-backend validation (same harness, M5 Max):

prompt 2: turbo3  ppl=34.66 KLD mean=0.191 nats top-5=99.6%

top-5 agreement >=99.6% on both prompts, both backends. The model
still ranks the right tokens at the top of the distribution; the
argmax just shuffles within the top set. PPL alone is misleading on
small samples (turbo3 reports lower PPL than fp8 due to quantisation
noise acting as regularisation); KLD reveals the actual distribution
drift.

Reproduce

make ds4-bench
./ds4-bench --kv-cache fp8    --quality-emit baseline.bin \
            --ppl-prompt tests/test-vectors/prompts/long_code_audit.txt
./ds4-bench --kv-cache turbo3 --quality-baseline baseline.bin \
            --ppl-prompt tests/test-vectors/prompts/long_code_audit.txt

Tests

make -k test                                   # PASSES with default fp8
./ds4_test --logprob-vectors                   # fp8: bit-identical to main
DS4_TEST_KV_DTYPE=turbo3 ./ds4_test --logprob-vectors
# turbo3: FAILS on short_code_completion step 1 with a single argmax
# mismatch.  Expected: the test asserts strict argmax equality at every
# position vs the official continuation, and turbo3 shuffles top-1
# within the top-5 set in ~5-17% of positions.  This is a distribution-
# drift trade intrinsic to 3-bit cache quantization, not a regression.
#
# Cross-validation against the canonical TurboQuant+ corpus: the
# kl-divergence-results doc in TheTom/turboquant_plus reports turbo3
# same-top-p = 94.31% on Qwen3.5-35B-A3B MoE Q8_0 at ctx=512 vs the
# f16 baseline.  Our top-1 agreement of 93.3% (code prompt) / 83.1%
# (high-entropy security prompt) is in the same envelope, accounting
# for the IQ2XXS base quant noise on top.  llama.cpp's CONTRIBUTING.md
# (cited in the same TQ+ canonical doc) requires KLD for new cache
# types specifically because argmax-equality tests are too brittle for
# 3-bit cache work.
#
# Use ds4-bench --quality-baseline for the KLD-aware comparator that
# captures the actual envelope.
make cpu                                       # CPU portability target builds clean

make cuda-regression was already broken on main before this PR
(upstream 23e1ea5 added comp_kv_f16 to the attention launcher
signature but did not update tests/cuda_long_context_smoke.c, so
the test failed to compile). This PR includes a one-line courtesy
fix that restores make cuda-regression to a buildable state - feel
free to drop that commit if you want to send the fix separately.

The official continuation scorer in gguf-tools/quality-testing/
was not run (collecting fresh official continuations needs a paid
DEEPSEEK API key). Recommend running it locally before merge if
you want belt-and-suspenders quality confirmation; our KLD harness
is the equivalent local proxy.

Files

ds4.c                     enum + dispatch + cache plumbing
ds4.h                     DS4_KV_TURBO3 enum + row-bytes API
ds4_cuda.cu               CUDA pack / dequant / inline-dequant attention
ds4_metal.m               Metal pipelines + launchers
metal/dsv4_turbo3.metal   MSL pack / dequant / Wave M3 attention kernels
ds4_bench.c               --ppl-prompt + --quality-emit/baseline
ds4_cli.c                 --kv-cache flag parsing
ds4_gpu.h                 turbo3 attention launcher decls
ds4_test.c                DS4_TEST_KV_DTYPE env var for regression on turbo3
tests/cuda_long_context_smoke.c  comp_kv_f16 signature fix (courtesy)
Makefile                  cuda-spark default arch sm_120
speed-bench/turbo3/       A/B bench CSVs + README

Why a flag and not the default (or default-replacing fp8)

AGENT.md says "Do not add permanent semantic variants behind flags." I
read this as a guard against shipping two competing production paths with
no clear winner - happy to defer to your read on whether --kv-cache turbo3
crosses that line. My case for shipping it as an opt-in:

  • fp8 is unchanged. Default behaviour, default codepath, default
    performance: identical to main. ./ds4_test --logprob-vectors is
    bit-identical with the default flag value.
  • turbo3 isn't pareto-better than fp8. It trades a measurable
    distribution drift (KLD mean 0.04-0.21 nats vs fp8 baseline, top-1
    argmax shuffles within the top-5 ~7-17% of positions) for a 4.75x
    smaller KV cache. That's a memory-pressure / context-length trade,
    not a free win - long-context users with tight unified-memory budgets
    benefit, short-context users on big machines don't. Removing fp8 or
    flipping the default would force the trade on everyone.
  • The flag is the same surface as --quality. ds4 already exposes
    numerical-fidelity knobs to the CLI; this is one more along that axis.

If you'd rather it ship a different way (default-flipped, gated by a
build-time #define, removed entirely and resubmitted only once the
quality gap closes, etc.), I'm happy to rework - just let me know the
shape you want.

What this PR deliberately doesn't do

  • turbo4 (4-bit Lloyd-Max codebook) - follow-up PR
  • h8 head-batched Flash attention (Metal) - follow-up PR
  • sparse-V skip (Metal + CUDA) - follow-up PR
  • turbo2 (2-bit) - held; ships only if you want the 2-bit dtype

@TheTom TheTom force-pushed the pr/01-cuda-turbo3 branch 3 times, most recently from be2a422 to 5eb01d5 Compare May 24, 2026 18:01
@TheTom TheTom marked this pull request as ready for review May 24, 2026 18:18
@TheTom
Copy link
Copy Markdown
Author

TheTom commented May 24, 2026

Ran two targeted tests on the MLA-stacking concern.

1. Long-context fact retrieval (./ds4_test --long-context) — 30,474-token NIAH-style prompt, asserts model retrieves spelled-out person-number assignments from prose:

./ds4_test --long-context                                # fp8:    PASS
DS4_TEST_KV_DTYPE=turbo3 ./ds4_test --long-context       # turbo3: PASS

Recall is preserved end-to-end. The MLA latent + 3-bit stack doesn't break the test on Mac Metal.

2. KLD vs fp8 baseline as context grows, same long_context_story_prompt.txt, full vocab (129,280), GX10 / CUDA / IQ2XXS:

positions KLD mean (nats) KLD max top-1 top-5
255 0.036 0.554 93.3% 99.6%
2,047 0.033 1.146 93.8% 99.7%
8,191 0.009 1.146 98.3% 99.9%

KLD dropped 3.6x as context grew 2K -> 8K, and top-1 agreement improved 4.5%. Compounding didn't happen at the contexts I tested - looks like sparser long-context attention patterns let the matched-norm L2 scale absorb the quantization noise rather than amplify it.

If you have a workload that you think would actually stress the stack harder (recall over wider haystacks, longer chains, a specific code-completion regression you're worried about), happy to bench it.

@TheTom
Copy link
Copy Markdown
Author

TheTom commented May 24, 2026

FYI in case useful, the follow-up PRs are staged on my fork:

Both branches build green on Mac Metal + GX10 CUDA, turbo3/turbo4 smoke runs pass. Happy to open them as PRs once this lands, or sooner if useful.

@antirez
Copy link
Copy Markdown
Owner

antirez commented May 24, 2026

Thank you @TheTom do I understand correctly that this is only compressing raw SWA ring right now? This means that at 100k context we save only 1% of memory in total, at the same time risking a correctness problem. Or I'm misunderstanding the goal of this PR? Thanks!

@empty-quiver
Copy link
Copy Markdown

empty-quiver commented May 24, 2026

This is neat for those of us trying to do more with less VRAM. Going off of the math in the kv footprint estimator here in the PR, this saves about ~0.28GB of VRAM at 1M context? That seems to match the 1% figure that @antirez is saying. Do you intend to attempt turboquant on the compressed KV rows @TheTom? I'd be curious to see how that affects correctness, it would be neat if it is a small drift.

Along those lines, curious what the correctness difference is between Turboquant at 3 bits vs 4 bits? 4 bits seem like an interesting middle ground.

@TheTom
Copy link
Copy Markdown
Author

TheTom commented May 25, 2026

On the asymptote concern - had Phase 7 (comp-cache compression) sitting locally as the natural follow-up; didn't include here to keep this PR small enough to review. Just pushed it: TheTom:pr/04-phase7-comp-cache (diff). This PR is the foundation (turbo3 enum + raw cache); the follow-up extends to the comp pool, which closes the asymptote.

Measured on Spark GB10 / IQ2XXS, --kv-cache turbo3 --comp-cache turbo3 vs fp8 baseline:

ctx fp8 total turbo3+comp saved
16K 581 MiB 136 MiB 77%
64K 1226 MiB 313 MiB 74%
256K 3806 MiB 1019 MiB 73%
1M 14126 MiB 3846 MiB 73%

Quality with both kv + comp turbo3 active:

  • ds4_test --long-context (30K NIAH recall): PASSES
  • KLD vs fp8 baseline at 8K same prompt: mean=0.0098 nats, top-1=98.10%, top-5=99.95%. Matches this PR's same-prompt KLD at 8K (0.0094 nats), comp-cache compression doesn't compound drift.

Happy to fold the follow-up into this PR, or open it separately once this lands - whatever you prefer.

@sztlink
Copy link
Copy Markdown

sztlink commented May 25, 2026

CUDA side smoke on a consumer Ada card:

  • host: RTX 4090 / WSL2 Ubuntu-24.04
  • driver: 595.79
  • CUDA toolchain: nvcc 13.0.88
  • checkout: this PR at 5eb01d5
  • build: make cuda CUDA_ARCH=sm_89 ...

Result: build passes and make cuda-regression passes.

ds4: CUDA backend initialized on NVIDIA GeForce RTX 4090 (sm_89)
cuda-regression: top-k n_comp=32768 n_tokens=32 elapsed=0.003s
cuda long-context regression: OK

I did not run the model/throughput bench on this box — the WSL2 env only exposes ~31 GiB RAM, so this is just build + CUDA regression coverage for sm_89, not a GX10/GB10 performance reproduction.

Full receipt/log: https://github.com/sztlink/boring-receipts/blob/main/receipts/2026-05-24-4090-ds4-pr243-turbo3-cuda-build-regression.md

@sztlink
Copy link
Copy Markdown

sztlink commented May 25, 2026

Also checked the follow-up comp-cache branch on the same consumer Ada box:

  • branch: TheTom/ds4:pr/04-phase7-comp-cache
  • checkout: ea322b5 (Phase 7.4: drop float comp pool, pack stage->packed directly)
  • host: RTX 4090 / WSL2 Ubuntu-24.04
  • build: make cuda CUDA_ARCH=sm_89 ...

Result: build passes and make cuda-regression passes.

ds4: CUDA backend initialized on NVIDIA GeForce RTX 4090 (sm_89)
cuda-regression: top-k n_comp=32768 n_tokens=32 elapsed=0.003s
cuda long-context regression: OK

Same limitation as above: no model/throughput/footprint reproduction on this box because the WSL2 env only exposes ~31 GiB RAM. This is just cross-hardware build + CUDA-regression coverage for the comp-cache follow-up.

Full receipt/log: https://github.com/sztlink/boring-receipts/blob/main/receipts/2026-05-25-4090-ds4-pr04-phase7-comp-cache-cuda-build-regression.md

@shagghiesuperstar
Copy link
Copy Markdown

M5 Max 128GB Metal Benchmark Results (Independent Reproduction)

PR #243: Add --kv-cache turbo3 (TurboQuant+ 3-bit KV cache)

Hardware: Apple M5 Max, 128 GB unified memory
Model: DeepSeek-V4-Flash-IQ2XXS (80.7 GB GGUF)
Backend: Metal (--backend metal)
Branch: pr-243 (commit 5eb01d5)
Prompt: tests/long_context_story_prompt.txt (30503 tokens, chat-encoded)
ctx-max: 28672, step: 2048, gen-tokens: 128 (greedy, no EOS skip)
KV footprint at ctx=28801: fp8 total=591.48 MiB, turbo3 total=302.90 MiB (4.75x raw shrink, saves 288.58 MiB on SWA ring)

Throughput: fp8 vs turbo3

ctx fp8 prefill t/s turbo3 prefill t/s fp8 decode t/s turbo3 decode t/s decode delta
2048 426.32 466.81 34.38 26.81 -22.0%
4096 425.90 429.68 31.69 24.01 -24.2%
6144 423.96 424.44 30.75 23.56 -23.4%
8192 423.14 422.84 30.71 23.77 -22.6%
10240 411.66 403.45 30.87 23.36 -24.3%
12288 399.37 408.61 30.42 23.64 -22.3%
14336 405.47 404.16 30.20 23.22 -23.1%
16384 395.80 401.72 30.50 23.35 -23.4%
18432 391.88 392.80 29.83 23.03 -22.8%
20480 384.73 383.19 30.29 23.20 -23.4%
22528 377.43 383.79 29.61 22.89 -22.7%
24576 382.00 377.81 29.76 22.94 -22.9%
26624 367.52 374.69 29.41 22.85 -22.3%
28672 371.84 370.90 29.27 22.71 -22.4%

Metal decode regression: ~22-24% slower at all measured frontiers.
This matches the PR authors note that the Wave M3 inline-dequant path
dequants the packed row to scratch per attention call before reusing
the stock fp8 attention kernel. The follow-up h8 head-batched Flash
attention PR is expected to close this gap to ~5%.

Prefill t/s is within plus or minus 3% across all frontiers - no
measurable regression on prefill.

Quality: Teacher-forced PPL + KLD + top-K agreement

Prompt 1: long_context_story_prompt.txt

Metric fp8 (baseline) turbo3
PPL 26.26 25.82
KLD mean (nats) - 0.0980
KLD max (nats) - 1.1487
top-1 agreement - 85.10%
top-5 agreement - 99.22%

Prompt 2: long_context_security_prompt.txt

Metric fp8 (baseline) turbo3
PPL 35.67 31.79
KLD mean (nats) - 0.1935
KLD max (nats) - 5.1627
top-1 agreement - 83.92%
top-5 agreement - 99.22%

top-5 agreement >= 99.2% on both prompts. The model still ranks the
right tokens at the top of the distribution; argmax shuffles within
the top set. PPL alone is misleading on small samples (turbo3 reports
lower PPL due to quantisation noise acting as regularisation); KLD
reveals the actual distribution drift.

Memory savings at peak ctx=28672

KV dtype kvcache_bytes vs fp8
fp8 418,637,200 baseline
turbo3 409,737,232 -2.1%

Note: The CSV reports kvcache_bytes which includes the compressed ring
buffer size. The raw KV shrink is 4.75x as reported by the engine; the
total footprint savings is less dramatic because the compressed ring
buffer (used for KV page eviction) is shared between fp8 and turbo3
modes and dominates at these context sizes. The real savings appear at
longer contexts where the raw KV portion dominates.

Reproduce

git clone https://github.com/antirez/ds4 && cd ds4
git fetch origin pull/243/head:pr-243 && git checkout pr-243
make ds4-bench

MODEL=/path/to/DeepSeek-V4-Flash-IQ2XXS-chat-v2-imatrix.gguf

Throughput:
DS4_LOCK_FILE=/tmp/ds4-bench.lock ./ds4-bench -m MODEL --backend metal --kv-cache fp8 --chat-prompt-file tests/long_context_story_prompt.txt --ctx-max 28672 --csv fp8.csv
DS4_LOCK_FILE=/tmp/ds4-bench.lock ./ds4-bench -m MODEL --backend metal --kv-cache turbo3 --chat-prompt-file tests/long_context_story_prompt.txt --ctx-max 28672 --csv turbo3.csv

Quality:
DS4_LOCK_FILE=/tmp/ds4-bench.lock ./ds4-bench -m MODEL --backend metal --kv-cache fp8 --ppl-prompt tests/long_context_story_prompt.txt --quality-emit baseline.bin
DS4_LOCK_FILE=/tmp/ds4-bench.lock ./ds4-bench -m MODEL --backend metal --kv-cache turbo3 --ppl-prompt tests/long_context_story_prompt.txt --quality-baseline baseline.bin

@TheTom
Copy link
Copy Markdown
Author

TheTom commented May 25, 2026

Thanks @sztlink for the sm_89 build + regression reproductions on both branches - clean cross-hardware confirmation.

And @shagghiesuperstar for the full M5 Max bench, the throughput delta and KLD numbers line up with my published runs on a different prompt. On the decode regression: the h8 head-batched Flash attention fix is staged at TheTom:pr/03-h8-sparse-v (diff) - happy to have you bench it on your M5 Max if you want to verify the gap-close numbers.

@TheTom TheTom force-pushed the pr/01-cuda-turbo3 branch from 5eb01d5 to 39c9a47 Compare May 25, 2026 13:00
@TheTom
Copy link
Copy Markdown
Author

TheTom commented May 25, 2026

Rebased onto current main - merged a fair bit upstream since opening this (DwarfStar rename, PRO variant, runtime-shape refactor that swapped compile-time DS4_N_HEAD_DIM constants for runtime macros). Auto-merged most; two small ds4.h / ds4_bench.c conflicts hand-resolved (kept upstream's inspect_only field + my kv_dtype; restored the footprint-log calls at bench startup).

Mac Metal + GX10 CUDA both build clean, short smoke runs green on Flash. Planning a full regression run today (make cuda-regression, ./ds4_test --logprob-vectors, --long-context NIAH, KLD bench at 8K) to confirm nothing drifted - will post the results here.

PRO variant + turbo3 untested; happy to bench if you have a PRO model handy or want me to skip that combo for now.

TheTom added 9 commits May 25, 2026 08:02
…ference)

Adds --kv-cache turbo3 to ds4: a 3-bit-per-value KV cache layout based
on the TurboQuant+ scheme (Walsh-Hadamard rotation + Lloyd-Max codebook
+ matched-norm L2 scale + per-group FP8 scale byte).  This commit
ships the simulation (CPU reference + CUDA pack/dequant kernels)
behind --kv-cache turbo3, alongside the existing fp8 default.

Footprint: 431 bytes per KV row, a 4.75x reduction vs the float-sim
fp8 path (2048 bytes per row).  Quality preservation matches TQ+
paper expectations on Qwen3.6-A3B IQ2XXS.

Includes A/B bench CSVs from GB10 Spark + reproduction notes under
speed-bench/turbo3/.

Engine open fails fast with a clean error if --kv-cache turbo3 is
combined with the Metal backend (Metal kernel arrives in a later
commit); use --cuda or --cpu for now.
Replaces the float-sim turbo3 path with the real packed 431-byte
per-row layout the TQ+ paper describes:

  signs (8 B) | data (3 bits/value, 24 groups of 8 = 24 B)
  | FP8 scale bytes (24 B) | row L2 norm (fp16, 2 B) ...

with `ds4_kv_row_bytes` exposing the active dtype's row stride and
new CUDA pack/dequant kernels (pack_turbo3_kernel + device-side
dequant helpers callable from attention inner loops).

Cache wiring: the GPU layer_raw_cache pool now holds packed bytes
when --kv-cache turbo3 is active; reads decompress-to-scratch
(superseded by inline-dequant attention kernels in the next commit).

The on-disk warm cache header gains a kv_dtype field (v2) so a cold
run with mismatched dtype rebuilds the cache rather than loading
fp8 bytes into a turbo3 pool.  ds4-bench prints a 3-way KV cache
footprint summary at startup so the user can see the savings.
Removes the "decompress turbo3 row to scratch fp16, then run fp8
attention" hop that the previous commit's pack/dequant kernels
fell back to.  Inline-dequants the packed 431-byte row directly
inside the CUDA attention kernels (decode-token, prefill, indexed
mixed, head-batched online) so the packed bytes stream straight
into the K-dot and V-acc inner loops.

Decode-token + prefill + indexed-mixed-online + tile-batched V-acc
all carry the inline-dequant path.  Tile sizes raised from 4/8 to
16 on the online turbo3 kernels for better simdgroup utilisation.

Result on GB10 Spark (Qwen3.6-A3B IQ2XXS, ds4-bench):
  - turbo3 decode_tps within 1% of fp8 across 2K..32K
  - turbo3 prefill_tps within 2% of fp8
  - 4.75x KV cache memory reduction preserved
The Spark target maps to DGX Spark / GB10, which is Blackwell sm_120.
Previously cuda-spark passed an empty CUDA_ARCH which fell through to
the nvcc default (PTX-JIT against sm_75 / Turing).  With my Phase 2b
inline-dequant kernels this default produced 320-byte stack frames
(register spills to local memory) - measured 6x regression on
attention_decode_mixed_turbo3_kernel (464us per call vs ~80us at
sm_120, which has 0-byte stack frame and 96 registers per thread).

Setting -arch=sm_120 also tightens register usage across the existing
fp8 kernels and avoids the JIT pause on first launch.  No correctness
change; the kernels are unchanged.

ATTN profile delta at 8K context decode, DSV4 IQ2XXS, 300-word prompt:
  attention_decode_mixed_turbo3_kernel  sm_75 464us/call -> sm_120 ~80us/call
Adds three flags to ds4-bench for KV-cache quality validation
that match the TurboQuant+ asymmetric-kv-compression validation suite:

  --ppl-prompt FILE
      Skip the throughput sweep.  Tokenize FILE, walk token by token,
      accumulate -log P(token_t | tokens_<t) from the live logits.
      Reports nll_avg / ppl / scored_tokens.  Use to compare quality
      across --kv-cache dtypes apples-to-apples.

  --quality-emit FILE
      During --ppl-prompt, dump per-position full-vocab raw logits to
      FILE.  Binary format: "DS4Q" magic | u32 vocab | u32 scored
      | (scored * vocab * float32).  Run with --kv-cache fp8 to
      capture a baseline.

  --quality-baseline FILE
      Read FILE (from a prior --quality-emit run) and compare every
      position's logits to the current run.  Reports:
        - KLD(baseline || current)  full-vocab mean and max, in nats
        - top-1 agreement           current argmax == baseline argmax
        - top-5 agreement           current argmax in baseline top-5

Standard usage to validate turbo3 / turbo4 against fp8:

  ./ds4-bench --ppl-prompt PROMPT --kv-cache fp8    --quality-emit baseline.bin
  ./ds4-bench --ppl-prompt PROMPT --kv-cache turbo3 --quality-baseline baseline.bin
  ./ds4-bench --ppl-prompt PROMPT --kv-cache turbo4 --quality-baseline baseline.bin

Sample, GB10 Spark, IQ2XXS, 255 positions, vocab=129280, two prompts
(high-entropy security + low-entropy code audit):

  dtype  | PPL low-entropy / PPL high-entropy | KLD mean | top-5
  fp8    |  4.35 / 35.28                       |     -    | 100%
  turbo4 |  4.30 / 32.53                       |  0.02-0.12 nats | 99.6%
  turbo3 |  4.12 / 31.83                       |  0.04-0.21 nats | 99.6%

Same on Mac Metal: turbo4 top-5 hits 100% on the low-entropy prompt.
Ports the CUDA turbo3 pipeline to Metal so --kv-cache turbo3 works
on Apple Silicon backends.  New MSL kernels in metal/dsv4_turbo3.metal:

  - pack:                 in-place quant of fp16 KV row to 431 packed bytes
  - dequant-to-scratch:   reconstruct fp16 row for the existing fp8
                          attention path (Wave M0/M1)
  - attention decode:     inline-dequant turbo3 row directly in the
                          K-dot and V-acc inner loops, dispatched
                          via decode_heads_turbo3_tensor and
                          decode_mixed_batch_turbo3_heads_tensor
                          (Wave M2-M5)

Lifts the --kv-cache turbo3 + Metal engine-open guard that the earlier
"Metal kernel deferred" commit installed.

The in-place quant on Metal currently delegates to the fp8 pack +
post-pass turbo3 path; a native cooperative-WHT MSL quant is gated
behind DS4_METAL_TURBO3_QUANT_NATIVE=1 (debug only - a known
cooperative-WHT bug in the native kernel is being investigated
separately).  The decode/attention path is fully native.

Bench, M5 Max 128 GB, Qwen3.6-A3B IQ2XXS, "What is 2+2?":
  fp8:     prefill 45.98 t/s, decode 36.36 t/s, ppl 3.3596
  turbo3:  prefill 59.27 t/s (+29%), decode 30.20 t/s, ppl 3.4283 (+2.04%)

KV row 2048 B -> 431 B (4.75x reduction), preserved across the port.
Upstream commit 23e1ea5 ("Store Metal attention compressed KV cache
in F16") added the comp_kv_f16 parameter to the attention launcher
signature but did not update tests/cuda_long_context_smoke.c, so
make cuda-regression has been failing to build on main.

Pass 0 here (the CUDA path stores comp_kv as float32 at default
build settings, matching what the test prepares).  Restores
make cuda-regression to a buildable state on the CUDA backend.
ASCII hyphens (-) only across new comments and docstrings.
Stripped workflow labels ("Phase 1/2/2a/2b/2b Wave M0-M2/2c v2/3a/9") that
referenced an internal phasing scheme meaningless to readers; replaced with
descriptive ASCII phrases or deleted where the surrounding sentence still
reads cleanly.  Removed dead pointers to docs/turbo3-roadmap.md (file is
not in this PR).  Swapped Unicode box-drawing dividers and arrows for
ASCII (===, ->, *, <=) to match antirez's existing section headers.
Shortened a few verbose comment blocks.  Also fixed "GB10" used as machine
name in two spots to "GX10" (the machine; GB10 is the Blackwell chip).
@TheTom TheTom force-pushed the pr/01-cuda-turbo3 branch from 39c9a47 to ca5534f Compare May 25, 2026 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants