Skip to content

Commit 9481870

Browse files
unamedkrclaude
andcommitted
Credit HIGGS for the RHT + scalar grid pattern (per Tim Dettmers feedback)
Tim Dettmers commented in llama.cpp #20969 that 'Vector quantization + Hadamard transform is basically HIGGS' and asked the discussion not to credit our pattern to TurboQuant. He's right. Verified by reading HIGGS (Malinovskii, Panferov, Ilin, Guo, Richtárik, Alistarh, Nov 2024, arXiv:2411.17525): | Aspect | HIGGS | TurboQuant | Variant F (us) | |---------------------|---------------|---------------|----------------| | Application | Weights | KV cache | KV cache | | RHT preprocessing | YES (origin) | yes | yes | | Quantizer | Vector grids | Scalar L-M | Scalar L-M | | Outlier handling | — | 32-channel | — (4bo: 8 ch.) | | Residual stage | — | 1-bit QJL | — (dropped) | The structural pattern (RHT + grid quantization) was introduced for LLM quantization by HIGGS in November 2024, 5 months before the published TurboQuant. TurboQuant adapted it to KV cache with QJL + outliers. Our Variant F dropped the QJL and outlier additions, leaving a structure closer to HIGGS than to the published TurboQuant. Updated: - README.md / README.ko.md References & Citations sections to credit HIGGS prominently, with the lineage explained - bench/results/turboquant_reproduction.md header with the attribution update note linking to Tim Dettmers' comment - All references explicitly state we don't claim our shipped variant is the TurboQuant algorithm — it's our own simplification This is exactly the kind of external feedback that should reshape attribution. The honest credit story is more credible than the inflated one. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent b78ae1c commit 9481870

3 files changed

Lines changed: 17 additions & 11 deletions

File tree

README.ko.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -480,14 +480,15 @@ Linux, macOS, Windows (MSVC/MinGW), iOS, Android, WASM에서 동작합니다.
480480

481481
## 참고 논문 및 인용
482482

483-
quant.cpp는 발표된 연구의 독립 구현체입니다. 학술적 사용 시 원본 논문을 인용해주세요:
483+
quant.cpp는 발표된 연구의 독립 구현체입니다. Variant F 아키텍처 (RHT 전처리 + scalar Lloyd-Max codebook, QJL stage 없음)는 두 prior work의 계보에 위치합니다:
484484

485-
- **TurboQuant** — Zandieh, Daliri, Hadian, Mirrokni. *TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate*. ICLR 2026. [arXiv:2504.19874](https://arxiv.org/abs/2504.19874)
486-
- **PolarQuant***Quantizing KV Caches with Polar Transformation*. AISTATS 2026. [arXiv:2502.02617](https://arxiv.org/abs/2502.02617)
487-
- **QJL***Quantized Johnson-Lindenstrauss Transform for KV Cache Compression*. AAAI 2025. [arXiv:2406.03482](https://arxiv.org/abs/2406.03482)
485+
- **HIGGS** — Malinovskii, Panferov, Ilin, Guo, Richtárik, Alistarh. *Pushing the Limits of Large Language Model Quantization via the Linearity Theorem*. Nov 2024. [arXiv:2411.17525](https://arxiv.org/abs/2411.17525). HIGGS가 **Random Hadamard Transform + MSE-optimal grid quantization** 패턴을 weight 양자화에 도입. 우리 `tq_rht.c` (Walsh-Hadamard + Rademacher)가 이 패턴을 따름. *Tim Dettmers가 [llama.cpp #20969 discussion](https://github.com/ggml-org/llama.cpp/discussions/20969#discussioncomment-16481725)에서 이 점을 지적해주신 데 감사드립니다.*
486+
- **TurboQuant** — Zandieh, Daliri, Hadian, Mirrokni. *TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate*. ICLR 2026. [arXiv:2504.19874](https://arxiv.org/abs/2504.19874). TurboQuant는 rotation 패턴을 **KV cache**에 적용 + 1-bit QJL residual + per-channel outlier handling. 우리 작업은 TurboQuant 직접 포팅으로 시작했으나, 9 라운드 Karpathy 루프로 단순화 (QJL 제거, outlier channel 제거)하여 현재 Variant F가 됨. 우리는 shipped variant가 TurboQuant 알고리즘이라고 주장하지 않습니다 — 경험적으로 도출된 단순화입니다.
487+
- **PolarQuant***Quantizing KV Caches with Polar Transformation*. AISTATS 2026. [arXiv:2502.02617](https://arxiv.org/abs/2502.02617). 우리 `tq_polar.c` baseline의 polar-coordinate KV quantization.
488+
- **QJL***Quantized Johnson-Lindenstrauss Transform for KV Cache Compression*. AAAI 2025. [arXiv:2406.03482](https://arxiv.org/abs/2406.03482). 1-bit sketch building block. `tq_qjl.c` baseline에 사용; Variant F regime에서 attention 점수에 ~0 기여한다는 것을 발견하고 제거.
488489
- [TurboQuant — Google Research 블로그](https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/)
489490

490-
학술 작업에서 quant.cpp를 사용하는 경우, 원 논문과 본 저장소를 함께 인용해주세요.
491+
**정직한 attribution**: Variant F의 구조 (RHT + scalar grid quantization)는 정신적으로 HIGGS에 가장 가깝고, TurboQuant처럼 KV cache에 적용되며, QJL residual과 outlier channel split을 ablation으로 제거. 학술 작업에서 quant.cpp를 사용하면 세 논문 (HIGGS, TurboQuant, PolarQuant)과 본 저장소를 함께 인용해주세요.
491492

492493
---
493494

README.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -495,14 +495,15 @@ Tested extensively (2-bit delta, NF2, online SVD, multi-hash). None reached acce
495495

496496
## References & Citations
497497

498-
quant.cpp is an independent implementation of published research. Please cite the original papers:
498+
quant.cpp is an independent implementation of published research. The Variant F architecture (RHT preprocessing + scalar Lloyd-Max codebook on rotated values, no QJL stage) sits in a lineage that combines two prior works:
499499

500-
- **TurboQuant** — Zandieh, Daliri, Hadian, Mirrokni. *TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate*. ICLR 2026. [arXiv:2504.19874](https://arxiv.org/abs/2504.19874)
501-
- **PolarQuant***Quantizing KV Caches with Polar Transformation*. AISTATS 2026. [arXiv:2502.02617](https://arxiv.org/abs/2502.02617)
502-
- **QJL***Quantized Johnson-Lindenstrauss Transform for KV Cache Compression*. AAAI 2025. [arXiv:2406.03482](https://arxiv.org/abs/2406.03482)
500+
- **HIGGS** — Malinovskii, Panferov, Ilin, Guo, Richtárik, Alistarh. *Pushing the Limits of Large Language Model Quantization via the Linearity Theorem*. Nov 2024. [arXiv:2411.17525](https://arxiv.org/abs/2411.17525). HIGGS introduced the **Random Hadamard Transform + MSE-optimal grid quantization** pattern (for weight quantization). Our `tq_rht.c` Walsh-Hadamard + Rademacher implementation follows this pattern. *Credit to Tim Dettmers ([discussion thread](https://github.com/ggml-org/llama.cpp/discussions/20969#discussioncomment-16481725)) for pointing this out.*
501+
- **TurboQuant** — Zandieh, Daliri, Hadian, Mirrokni. *TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate*. ICLR 2026. [arXiv:2504.19874](https://arxiv.org/abs/2504.19874). TurboQuant applies the rotation pattern to **KV cache** with a 1-bit QJL residual stage and per-channel outlier handling. Our work started as a literal port of TurboQuant; through 9 rounds of Karpathy-loop iteration we simplified it (dropped QJL, dropped outlier channels) into the current Variant F. We do not claim our shipped variant is the TurboQuant algorithm — it is an empirically-derived simplification.
502+
- **PolarQuant***Quantizing KV Caches with Polar Transformation*. AISTATS 2026. [arXiv:2502.02617](https://arxiv.org/abs/2502.02617). The polar-coordinate KV quantization that our `tq_polar.c` baseline implements.
503+
- **QJL***Quantized Johnson-Lindenstrauss Transform for KV Cache Compression*. AAAI 2025. [arXiv:2406.03482](https://arxiv.org/abs/2406.03482). The 1-bit sketch building block. Used in our `tq_qjl.c` baseline; we found it contributed ~zero to attention scores in the Variant F regime and dropped it.
503504
- [Google Research blog post on TurboQuant](https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/)
504505

505-
If you use quant.cpp in academic work, please cite both the underlying paper(s) and this repository.
506+
**Honest attribution**: Variant F's structure (RHT + scalar grid quantization) is closest to HIGGS in spirit, applied to KV cache like TurboQuant, with both the QJL residual and the outlier channel split removed through ablation. If you use quant.cpp in academic work, please cite all three (HIGGS, TurboQuant, PolarQuant) and this repository.
506507

507508
---
508509

bench/results/turboquant_reproduction.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,8 @@
1-
# TurboQuant Paper Reproduction — From "Broken" to "Beats Production"
1+
# Variant F derivation — from TurboQuant literal port to HIGGS-style simplification
2+
3+
> **Important attribution update (2026-04-08)**: Following [Tim Dettmers' comment in llama.cpp #20969](https://github.com/ggml-org/llama.cpp/discussions/20969), we now credit **HIGGS** (Malinovskii et al., Nov 2024, [arXiv:2411.17525](https://arxiv.org/abs/2411.17525)) for the Random Hadamard Transform + scalar grid quantization pattern. The shipped Variant F is structurally closest to HIGGS (RHT + MSE-optimal grids on rotated values), applied to KV cache like TurboQuant, with both the QJL residual stage and the per-channel outlier split removed through ablation. We do **not** claim our shipped variant is the published TurboQuant algorithm — it is an empirically-derived simplification arrived at through 9 Karpathy-loop rounds.
4+
5+
26

37
> Run date: 2026-04-08
48
> Paper: [Zandieh et al., *TurboQuant*, ICLR 2026](https://arxiv.org/abs/2504.19874)

0 commit comments

Comments
 (0)