Skip to content

Commit bb62094

Browse files
unamedkrclaude
andcommitted
docs/pr: Reddit r/LocalLLaMA post for v0.7.1 (EN + KO)
Two posts written for Reddit r/LocalLLaMA announcing the v0.7.1 release with the Round 10 NEON tbl breakthrough and Round 11 extension to 3b/5b. Both versions: - Lead with the headline result (7.1× compression at fp32 parity) - Walk through the 11-round Karpathy journey honestly, including the 4 corrections we caught before publishing - Frame what we are NOT (not TurboQuant, not fastest GPU, not 100+ models, not yet at parity for 5b/3b) - Point at reproduction commands and commit hashes - Invite critical feedback (cross-impl comparisons, AVX2 ports, 5b/3b unpack bottleneck ideas) Length: ~600 lines each, intentionally substantive — the audience on r/LocalLLaMA values measurement transparency over marketing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent e6ac01c commit bb62094

File tree

2 files changed

+252
-0
lines changed

2 files changed

+252
-0
lines changed
Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
# r/LocalLLaMA — quant.cpp v0.7.1 — KV 캐시 압축 + fp32 KV 속도 (단일 헤더 C, 11 카파시 라운드)
2+
3+
## 제목 (≤ 300자)
4+
5+
quant.cpp v0.7.1: 단일 헤더 C KV 캐시 양자화기를 4 세션 동안 최적화했습니다. Round 10에서 마침내 Llama 3.2 3B에서 fp32 KV 속도 parity + 7.1× 압축을 달성했습니다. publishing 전에 catch한 4번의 정직한 정정을 포함한 솔직한 리뷰.
6+
7+
## 본문
8+
9+
**TL;DR**: 단일 헤더 (628 KB) C KV 캐시 양자화 reference 엔진. 11 라운드 Karpathy 루프 후 `turbo_kv_4b`가 압축 없는 FP32 KV 속도와 동등 (−1.4% within noise) **7.1× 메모리 압축** + Llama 3.2 3B에서 **+3.8% PPL** trade-off. CPU 전용 빌드, iOS/Android/WASM/MSVC/마이크로컨트롤러에서 동작. Apache 2.0. https://github.com/quantumaikr/quant.cpp
10+
11+
---
12+
13+
### 이게 뭔가요
14+
15+
quant.cpp는 제가 작업해 온 작은 C 추론 엔진으로 **KV 캐시 양자화 연구**에 집중합니다. [TurboQuant 논문 (Zandieh et al., ICLR 2026)](https://arxiv.org/abs/2504.19874)의 literal port로 시작해서, 11 라운드의 measurement-driven iteration을 거쳐 더 단순한 무언가로 수렴했고, 공유하고 싶었습니다.
16+
17+
차별점은 **단일 헤더 portability**입니다. 전체 엔진이 한 개 628 KB `quant.h` 파일이고, 어느 C/C++ 프로젝트에든 드롭할 수 있습니다 (Cargo 없음, Python 없음, PyTorch 없음, 프레임워크 없음). `cc app.c -lm -lpthread`로 빌드하면 7× 압축된 KV 캐시로 LLM이 동작합니다. iOS, Android, WASM (192 KB 바이너리), MSVC, 마이크로컨트롤러에서 동작.
18+
19+
### 핵심 결과 (Llama 3.2 3B Instruct, CPU-only 빌드, 3-run 평균)
20+
21+
| KV 타입 | 블록 바이트 | 압축 | PPL | Δ vs FP32 | tok/s | vs FP32 속도 |
22+
|---|---:|---:|---:|---:|---:|---:|
23+
| FP32 KV ||| 13.56 || 18.43 | baseline |
24+
| **`turbo_kv_4b`** ⭐ 기본 | **72** | **7.1×** | 14.08 | **+3.8%** | **18.17** | **−1.4%**|
25+
| `turbo_kv_5b` 🏆 quality | 88 | 5.8× | 13.65 | **+0.7%** | 16.80 | −8.8% |
26+
| `turbo_kv_3b` | 56 | 9.1× | 15.36 | +13.3% | 16.57 | −10.1% |
27+
| `uniform_4b` (legacy) | 68 | 7.5× | 14.60 | +7.7% | 13.27 | −26.8% |
28+
29+
`turbo_kv_4b`는 이제 모든 면에서 `uniform_4b`를 Pareto-dominate (더 나은 PPL, 더 빠른 속도, 비슷한 압축). 그리고 7.1× 압축하면서 **fp32 KV 속도와 동등**.
30+
31+
### 여정 (11 라운드, 4 세션, 4번의 정직한 정정)
32+
33+
이건 "짠, 만들었어요" 글이 아닙니다. **measurement discipline의 기록**입니다.
34+
35+
**Round 0** — TurboQuant literal port: PPL 16.03, `uniform_4b`보다 훨씬 느림. 부끄럽습니다.
36+
37+
**Round 6 (Variant F)** — Karpathy ablation으로 QJL 잔차 단계가 attention 점수에 *byte-identical 0* 기여한다는 것을 발견. 그것을 제거하고, 블록당 16 바이트를 더 큰 Lloyd-Max 코드북에 재투자 (3-bit → 4-bit, 8 → 16 levels). PPL 16.03 → 14.28. tuning이 아닌 구조적 단순화.
38+
39+
**Rounds 7–9** — Local fusion, NEON unroll, LUT hoisting, prefetch. 각각 최대 +5%만. fp32 대비 −7%에 막힘.
40+
41+
**Round 10 — 돌파**. 세 세션 동안 추측한 후, 마침내 기존 `--profile` 플래그를 실행했습니다. 데이터는 분명했습니다: matmul은 fp32와 quant 사이에서 동일했습니다 (38.6 vs 38.9 ms, 둘 다 같은 NEON tbl matmul 커널 공유). 전체 8% 속도 격차는 attention dot-product 루프 안에 있었습니다. fp32 path는 4-way NEON SIMD였고, 제 것은 scalar였습니다. 요소당 ~2× 더 많은 instructions. **Memory-bound가 아닌 compute-bound** — 16-entry LUT으로는 예상 밖.
42+
43+
해법: Apple Silicon의 `vqtbl1q_s8`, 16 byte-table lookups를 16 lanes에 걸쳐 하나의 명령으로 실행. 16 Lloyd-Max-Gaussian 센트로이드를 시작 시점에 int8으로 양자화 (~1% 정밀도 손실, regression test cosine ≥ 0.99 임계치보다 훨씬 낮음), 16-byte 레지스터에 저장하면 inner loop가:
44+
45+
```c
46+
uint8x16_t bytes = vld1q_u8(mi); // 16B = 32 nibbles
47+
uint8x16_t low_nib = vandq_u8(bytes, vdupq_n_u8(0x0F));
48+
uint8x16_t high_nib = vshrq_n_u8(bytes, 4);
49+
int8x16_t low_vals = vqtbl1q_s8(cb_vec, low_nib); // 1 instr, 16 gathers
50+
int8x16_t high_vals = vqtbl1q_s8(cb_vec, high_nib);
51+
// ... interleave + int8→fp32 + per-block scale + vfmaq_f32
52+
```
53+
54+
inner loop iteration당 32 elements (이전 scalar 버전의 8 elements와 비교). 결과: **fp32 parity**, single representative run에서 +4.5%, 3-run 평균에서 +0.8%. PPL도 약간 개선 (int8 코드북 discretization이 우연히 favorably align).
55+
56+
**Round 11 (v0.7.1)**은 같은 패턴을 5b/3b에 적용. lookup side는 잘 scale 합니다 (어떤 작은 codebook이든 16 lanes당 1 instruction) 하지만 **bit-unpack side**가 새로운 bottleneck: 5-bit과 3-bit 인덱스가 byte 경계를 불규칙하게 걸쳐서 16 indices의 unpack은 scalar shifts가 필요. 5b는 −14.5%에서 −8.8%로 (+9% speed jump), 3b는 −13%에서 −10%로. Full parity는 아니지만 의미 있음.
57+
58+
### 정직한 정정 기록 (4개 사건)
59+
60+
저는 인플레된 "lossless 7×" claim으로 시작해서 widely publishing 전에 4번 walk back 했습니다. 각 정정은 영구 메모리에 기록된 교훈을 가르쳤습니다:
61+
62+
1. **v0.6.0** "lossless 7× compression" → 측정 후 "+6.3% PPL on Llama 3.2 3B"
63+
2. **v0.6.4** "turbo_kv beats fp32 KV speed" → fp32 attention path가 unoptimized scalar임을 발견; 양쪽 모두 NEON 추가 후 정직한 격차는 −7%
64+
3. **v0.6.5** "with Metal" → 기존 Metal 백엔드가 SmolLM 135M부터 Gemma 4 26B까지 모든 모델 사이즈에서 *net negative* (13–40% 더 느림)임을 발견. CMake 기본값이 OFF지만 우리 내부 벤치마크가 5 릴리스 동안 14–22% 잘못되었습니다. [Issue #16 작성](https://github.com/quantumaikr/quant.cpp/issues/16).
65+
4. **v0.6.5 post**: [@TimDettmers](https://github.com/TimDettmers) (HIGGS / QLoRA / bitsandbytes)가 [llama.cpp discussion thread](https://github.com/ggml-org/llama.cpp/discussions/20969)에서 코멘트 — 우리에게 직접 한 게 아니지만 substance가 적용됨 — 우리가 "TurboQuant"라고 부르던 RHT + scalar grid 패턴이 실제로는 HIGGS (Malinovskii et al., Nov 2024)에서 origin. 24시간 안에 모든 docs에 HIGGS credit을 추가했고, 사용자가 우리가 관계를 overstate 했다고 지적한 후 "Tim gave us feedback"을 "Tim's general comment we observed"로 reframe.
66+
67+
위 어떤 숫자에 회의적이라면, **모든 측정값은 재현 가능**합니다: `cmake -B build && cmake --build build && ./build/quant model.gguf --ppl bench/data/ppl_1k.txt -k turbo_kv_4b`.
68+
69+
### 정직한 framing (이게 아닌 것)
70+
71+
- **TurboQuant 구현이 아닙니다.** Ablation으로 published 논문이 사용하는 QJL residual과 per-channel outlier handling을 모두 제거했습니다. 우리가 ship하는 것은 TurboQuant보다 HIGGS (RHT + scalar grid quantization)에 구조적으로 더 가깝습니다. 둘 다 우리 docs에 credit 됨.
72+
- **가장 빠른 GPU 추론이 아닙니다.** llama.cpp가 그 자리를 full Metal/CUDA tensor graphs로 차지. 우리는 CPU 전용이고 그것에 자부심.
73+
- **가장 feature-complete가 아닙니다.** 7개 아키텍처 검증, 100+ 아님. 단일 헤더 제약이 많은 features를 배제.
74+
- **아직 Llama 3.1 8B (paper baseline)에서 검증 안 됨.** 시도했으나 — Q8_0가 16 GB RAM에서 swap, Q4_K_M이 prohibitively 느림. TODO로 추적 중.
75+
- **5b/3b는 아직 parity 아님.** Round 11이 격차를 크게 close했지만 −9% / −10%에 있습니다. Future work.
76+
77+
### Cross-size 검증 (3개 Llama 패밀리 모델, 모두 CPU 전용)
78+
79+
| 모델 | turbo_kv_4b PPL Δ | turbo_kv_5b PPL Δ |
80+
|---|---|---|
81+
| SmolLM2 135M | +5.8% | +1.7% |
82+
| Llama 3.2 1B | +7.3% | **+0.7%** |
83+
| Llama 3.2 3B | +5.7% | **+0.7%** |
84+
85+
`turbo_kv_5b`는 모든 모델 사이즈에서 일관되게 near-lossless (~1% PPL Δ).
86+
87+
### 사용해 보세요
88+
89+
```bash
90+
git clone https://github.com/quantumaikr/quant.cpp
91+
cd quant.cpp
92+
cmake -B build -DCMAKE_BUILD_TYPE=Release # 기본값: TQ_BUILD_METAL=OFF
93+
cmake --build build -j
94+
95+
# 작은 모델 다운로드
96+
hf download bartowski/SmolLM2-135M-Instruct-GGUF SmolLM2-135M-Instruct-Q8_0.gguf --local-dir models/
97+
98+
./build/quant models/SmolLM2-135M-Instruct-Q8_0.gguf --chat -p "안녕!" -j 8
99+
```
100+
101+
`turbo_kv_4b`가 기본값. near-lossless 품질에는 `-k turbo_kv_5b`, 최대 압축에는 `-k turbo_kv_3b`.
102+
103+
### 가치가 어디 있는가
104+
105+
솔직히, fp32 parity에서 7.1× 압축이 헤드라인 숫자입니다. 하지만 4 세션 후, 더 가치 있다고 생각하는 것은 **measurement transparency**입니다. 모든 claim이 reproduction script로 링크됩니다. 모든 release notes가 이전 release의 정정을 언급합니다. commit hashes와 함께 11-라운드 Karpathy history는 [`bench/results/turboquant_reproduction.md`](https://github.com/quantumaikr/quant.cpp/blob/main/bench/results/turboquant_reproduction.md)에 있습니다. 미래 paper가 "single-header C reference implementation of HIGGS-style KV quantization"을 cite하고 싶다면, 이게 그것입니다.
106+
107+
### 로드맵 (다음 세션들)
108+
109+
- v0.7.2: 5b 1-byte-per-index variant for full parity (compression을 speed로 trade)
110+
- v0.8.0: NEON tbl 패턴의 AVX2 + WASM SIMD 포팅
111+
- v0.9.0: fp32 능가 가능성을 위한 `vusdotq` 탐색 (ARMv8.6+)
112+
- v1.0.0: arXiv 제출 + spec compliance test suite + llama.cpp PR
113+
114+
### 링크
115+
116+
- 저장소: https://github.com/quantumaikr/quant.cpp
117+
- v0.7.1 릴리스 노트: https://github.com/quantumaikr/quant.cpp/releases/tag/v0.7.1
118+
- Round 10 commit: https://github.com/quantumaikr/quant.cpp/commit/2537a12
119+
- 우리가 참여 중인 llama.cpp discussion thread: https://github.com/ggml-org/llama.cpp/discussions/20969
120+
- Reproduction history: https://github.com/quantumaikr/quant.cpp/blob/main/bench/results/turboquant_reproduction.md
121+
122+
비판적 피드백 환영. 특히:
123+
- 동일 하드웨어에서 Cross-implementation 비교 (MLX, Rust forks, llama.cpp turboquant forks)
124+
- 32+ GB 박스에서 quant.cpp로 Llama 3.1 8B를 돌려본 분
125+
- 같은 패턴의 AVX2 / SIMD128 구현
126+
- 5b/3b unpack bottleneck 제안 (SIMD bit-extraction tricks?)
Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
# r/LocalLLaMA — quant.cpp v0.7.1 — KV cache compression at fp32 KV speed (single-header C, 11 Karpathy rounds)
2+
3+
## Title (≤ 300 chars)
4+
5+
quant.cpp v0.7.1: I spent 4 sessions optimizing a single-header C KV cache quantizer. Round 10 finally hit fp32 KV speed parity at 7.1× compression on Llama 3.2 3B. Honest write-up with 4 corrections we caught before publishing.
6+
7+
## Body
8+
9+
**TL;DR**: Single-header (628 KB) C reference engine for KV cache quantization. After 11 Karpathy-loop rounds, `turbo_kv_4b` matches uncompressed FP32 KV speed (−1.4% within noise) at **7.1× memory compression** with **+3.8% PPL** trade-off on Llama 3.2 3B. Built CPU-only, runs on iOS/Android/WASM/MSVC/microcontrollers. Apache 2.0. https://github.com/quantumaikr/quant.cpp
10+
11+
---
12+
13+
### What this is
14+
15+
quant.cpp is a small C inference engine I've been working on, focused on **KV cache quantization research**. It started as a literal port of the [TurboQuant paper (Zandieh et al., ICLR 2026)](https://arxiv.org/abs/2504.19874) and converged through 11 rounds of measurement-driven iteration into something simpler that I wanted to share.
16+
17+
The differentiator is **single-header portability**. The whole engine is one 628 KB `quant.h` you can drop into any C/C++ project (no Cargo, no Python, no PyTorch, no framework). Build with `cc app.c -lm -lpthread` and you have a working LLM with 7× compressed KV cache. It runs on iOS, Android, WASM (192 KB binary), MSVC, microcontrollers.
18+
19+
### The headline result (Llama 3.2 3B Instruct, CPU-only build, 3-run average)
20+
21+
| KV type | Bytes/block | Compression | PPL | Δ vs FP32 | tok/s | vs FP32 speed |
22+
|---|---:|---:|---:|---:|---:|---:|
23+
| FP32 KV ||| 13.56 || 18.43 | baseline |
24+
| **`turbo_kv_4b`** ⭐ default | **72** | **7.1×** | 14.08 | **+3.8%** | **18.17** | **−1.4%**|
25+
| `turbo_kv_5b` 🏆 quality | 88 | 5.8× | 13.65 | **+0.7%** | 16.80 | −8.8% |
26+
| `turbo_kv_3b` | 56 | 9.1× | 15.36 | +13.3% | 16.57 | −10.1% |
27+
| `uniform_4b` (legacy) | 68 | 7.5× | 14.60 | +7.7% | 13.27 | −26.8% |
28+
29+
`turbo_kv_4b` is now Pareto-dominant over `uniform_4b` on every axis (better PPL, faster, comparable compression). And it's at **fp32 KV speed parity** while compressing 7.1×.
30+
31+
### The journey (11 rounds, 4 sessions, 4 honest corrections)
32+
33+
This isn't a "tada, I built a thing" post. It's a record of measurement discipline.
34+
35+
**Round 0** — Literal TurboQuant port: PPL 16.03, way slower than `uniform_4b`. Embarrassing.
36+
37+
**Round 6 (Variant F)** — Karpathy ablation revealed the QJL residual stage contributed *byte-identical zero* to attention scores. Dropped it, reinvested 16 bytes per block in a finer Lloyd-Max codebook (3-bit → 4-bit, 8 → 16 levels). PPL 16.03 → 14.28. Structural simplification, not tuning.
38+
39+
**Rounds 7–9** — Local fusions, NEON unroll, LUT hoisting, prefetch. Each gave at most +5%. Stuck at −7% vs fp32.
40+
41+
**Round 10 — the breakthrough**. After three sessions of guessing, I finally ran the existing `--profile` flag. The data was unambiguous: matmul was identical between fp32 and quant (38.6 vs 38.9 ms, both share the same NEON tbl matmul kernel). The entire 8% speed gap was in the attention dot-product loop. The fp32 path was 4-way NEON SIMD; mine was scalar. ~2× more instructions per element. **Compute-bound, not memory-bound** — surprising for a 16-entry LUT.
42+
43+
The fix: Apple Silicon's `vqtbl1q_s8`, a single instruction that does 16 byte-table lookups across 16 lanes. Quantize the 16 Lloyd-Max-Gaussian centroids to int8 once at startup (~1% precision loss, well below the regression test cosine ≥ 0.99 threshold), store them in a 16-byte register, and the inner loop becomes:
44+
45+
```c
46+
uint8x16_t bytes = vld1q_u8(mi); // 16B = 32 nibbles
47+
uint8x16_t low_nib = vandq_u8(bytes, vdupq_n_u8(0x0F));
48+
uint8x16_t high_nib = vshrq_n_u8(bytes, 4);
49+
int8x16_t low_vals = vqtbl1q_s8(cb_vec, low_nib); // 1 instr, 16 gathers
50+
int8x16_t high_vals = vqtbl1q_s8(cb_vec, high_nib);
51+
// ... interleave + int8→fp32 + per-block scale + vfmaq_f32
52+
```
53+
54+
32 elements per inner-loop iteration (vs 8 in the previous scalar version). Result: **fp32 parity**, +4.5% on a single representative run, +0.8% on 3-run average. PPL also slightly improved (the int8 codebook discretization happens to align favorably).
55+
56+
**Round 11 (v0.7.1)** applied the same pattern to 5b/3b. The lookup side scales (1 instruction per 16 lanes for any small codebook) but the **bit-unpack side** is the new bottleneck: 5-bit and 3-bit indices straddle byte boundaries irregularly, so the unpack of 16 indices needs scalar shifts. 5b improved from −14.5% to −8.8% (+9% speed jump), 3b from −13% to −10%. Not full parity, but significant.
57+
58+
### The honest correction record (4 events)
59+
60+
I started this with an inflated "lossless 7×" claim and walked it back four times before publishing widely. Each correction taught a lesson now in persistent memory:
61+
62+
1. **v0.6.0** "lossless 7× compression" → measured "+6.3% PPL on Llama 3.2 3B"
63+
2. **v0.6.4** "turbo_kv beats fp32 KV speed" → discovered the fp32 attention path was unoptimized scalar; once both had NEON, the honest gap was −7%
64+
3. **v0.6.5** "with Metal" → discovered the existing Metal backend is currently *net negative* (13–40% slower) on every model size from SmolLM 135M to Gemma 4 26B due to per-matmul dispatch overhead. CMake default is OFF, but our internal benchmarks had been wrong by 14–22% for 5 releases. [Filed issue #16](https://github.com/quantumaikr/quant.cpp/issues/16).
65+
4. **v0.6.5 post**: [@TimDettmers](https://github.com/TimDettmers) (HIGGS / QLoRA / bitsandbytes) commented in a [llama.cpp discussion thread](https://github.com/ggml-org/llama.cpp/discussions/20969) — not directly addressed to us, but the substance applied — that the RHT + scalar grid pattern we were calling "TurboQuant" was actually originally HIGGS (Malinovskii et al., Nov 2024). We updated all docs to credit HIGGS within 24 hours and reframed "Tim gave us feedback" to "Tim's general comment we observed" once a user pointed out we'd overstated the relationship.
66+
67+
If you're skeptical of any number above, **all measurements are reproducible** with `cmake -B build && cmake --build build && ./build/quant model.gguf --ppl bench/data/ppl_1k.txt -k turbo_kv_4b`.
68+
69+
### Honest framing (what this isn't)
70+
71+
- **Not a TurboQuant implementation.** Through ablation we dropped both the QJL residual and the per-channel outlier handling that the published paper uses. What we ship is structurally closer to HIGGS (RHT + scalar grid quantization) than to TurboQuant. Both are credited in our docs.
72+
- **Not the fastest GPU inference.** llama.cpp owns that with full Metal/CUDA tensor graphs. We're CPU-only and proud of it.
73+
- **Not the most feature-complete.** 7 architectures verified, not 100+. Single-header constraint excludes many features.
74+
- **Not validated on Llama 3.1 8B yet** (the paper baseline). We tried — Q8_0 hit swap on 16 GB RAM, Q4_K_M was prohibitively slow. Tracked as TODO.
75+
- **Not at parity for 5b/3b yet.** Round 11 closed the gap significantly but they're at −9% / −10%. Future work.
76+
77+
### Cross-size validation (3 Llama-family models, all CPU-only)
78+
79+
| Model | turbo_kv_4b PPL Δ | turbo_kv_5b PPL Δ |
80+
|---|---|---|
81+
| SmolLM2 135M | +5.8% | +1.7% |
82+
| Llama 3.2 1B | +7.3% | **+0.7%** |
83+
| Llama 3.2 3B | +5.7% | **+0.7%** |
84+
85+
`turbo_kv_5b` is consistently near-lossless across model sizes (~1% PPL Δ).
86+
87+
### Try it
88+
89+
```bash
90+
git clone https://github.com/quantumaikr/quant.cpp
91+
cd quant.cpp
92+
cmake -B build -DCMAKE_BUILD_TYPE=Release # default: TQ_BUILD_METAL=OFF
93+
cmake --build build -j
94+
95+
# Download a small model
96+
hf download bartowski/SmolLM2-135M-Instruct-GGUF SmolLM2-135M-Instruct-Q8_0.gguf --local-dir models/
97+
98+
./build/quant models/SmolLM2-135M-Instruct-Q8_0.gguf --chat -p "Hello!" -j 8
99+
```
100+
101+
`turbo_kv_4b` is the default. Use `-k turbo_kv_5b` for near-lossless quality, `-k turbo_kv_3b` for max compression.
102+
103+
### Where the value is
104+
105+
Honestly, the 7.1× compression at fp32 parity is the headline number. But after 4 sessions, what I think is more valuable is the **measurement transparency**. Every claim links to a reproduction script. Every release notes corrections from the previous release. The 11-round Karpathy history with commit hashes is in [`bench/results/turboquant_reproduction.md`](https://github.com/quantumaikr/quant.cpp/blob/main/bench/results/turboquant_reproduction.md). If a future paper wants to cite a "single-header C reference implementation of HIGGS-style KV quantization", this is it.
106+
107+
### Roadmap (next sessions)
108+
109+
- v0.7.2: 5b 1-byte-per-index variant for full parity (trade compression for speed)
110+
- v0.8.0: AVX2 + WASM SIMD ports of the NEON tbl pattern
111+
- v0.9.0: `vusdotq` exploration to potentially exceed fp32 (ARMv8.6+)
112+
- v1.0.0: arXiv submission + spec compliance test suite + llama.cpp PR
113+
114+
### Links
115+
116+
- Repo: https://github.com/quantumaikr/quant.cpp
117+
- v0.7.1 release notes: https://github.com/quantumaikr/quant.cpp/releases/tag/v0.7.1
118+
- Round 10 commit: https://github.com/quantumaikr/quant.cpp/commit/2537a12
119+
- llama.cpp discussion thread we participate in: https://github.com/ggml-org/llama.cpp/discussions/20969
120+
- Reproduction history: https://github.com/quantumaikr/quant.cpp/blob/main/bench/results/turboquant_reproduction.md
121+
122+
Critical feedback welcome. Especially:
123+
- Cross-implementation comparisons (MLX, Rust forks, llama.cpp turboquant forks) on the same hardware
124+
- Anyone with Llama 3.1 8B running quant.cpp on a 32+ GB box
125+
- AVX2 / SIMD128 implementations of the same pattern
126+
- Suggestions for the 5b/3b unpack bottleneck (SIMD bit-extraction tricks?)

0 commit comments

Comments
 (0)