turboquant

Star

Here are 64 public repositories matching this topic...

quantumaikr / quant.cpp

Star

LLM inference with 7x longer context. Pure C, zero dependencies. Lossless KV cache compression + single-header library.

embeddable transformer pure-c quantization delta-compression kv-cache llm llm-inference gguf turboquant

Updated Apr 8, 2026
C

arozanov / turboquant-mlx

Star

TurboQuant KV cache compression for MLX with fused Metal kernels. 4.6x compression at 98% FP16 speed.

metal quantization mlx kv-cache apple-silicon llm turboquant

Updated Apr 2, 2026
Python

Based on the implementation of Google's TurboQuant (ICLR 2026) — Quansloth brings elite KV cache compression to local LLM inference. Quansloth is a fully private, air-gapped AI server that runs massive context models natively on consumer hardware with ease

cuda turboquant quansloth vram-wall

Updated Apr 6, 2026
Python

Alberto-Codes / turboquant-vllm

Star

TurboQuant KV cache compression plugin for vLLM — asymmetric K/V, 8 models validated, consumer GPUs

compression transformer triton quantization inference-optimization kv-cache llm vllm consumer-gpu turboquant

Updated Apr 8, 2026
Python

back2matching / turboquant

Star

First open-source TurboQuant KV cache compression for LLM inference. Drop-in for HuggingFace. pip install turboquant.

machine-learning compression gpu transformers inference pytorch quantization vram huggingface kv-cache llm turboquant

Updated Mar 30, 2026
Python

mindtro / semafold

Star

Vector compression with TurboQuant codecs for embeddings, retrieval, and KV-cache. 10x compression, pure NumPy core — optional GPU acceleration via PyTorch (CUDA/MPS) or MLX (Metal).

retrieval quantization vector-database kv-cache llm-inference embedding-compression turboquant vector-compression qjl semafold

Updated Apr 1, 2026
Python

Firmamento-Technologies / TurboQuant

Star

Near-optimal vector quantization from Google's ICLR 2026 paper — 95% recall, 5x compression, zero preprocessing, pure Python FAISS replacement

Updated Mar 28, 2026
Python

artalis-io / bitnet.c

Star

Minimal, zero-dependency LLM inference in pure C11. CPU-first with NEON/AVX2 SIMD. Flash MoE (pread + LRU expert cache). TurboQuant 3-bit KV compression (8.9x less memory per session). 20+ GGUF quant formats. Compiles to WASM.

c neon wasm inference simd moe avx2 quantization kv-cache cpu-inference llm gguf turboquant

Updated Mar 28, 2026
C

Lucien2468 / Ollama-TurboQuant-Integration

Star

TurboQuant: Native 3-Bit Quantization for Ollama - Achieve 25-28% better compression than Q4_0 while maintaining high-speed CPU inference. Experimentally integrated into Ollama with custom GGML kernels for LLM efficiency.

llama quantization ggml ollama turboquant

Updated Apr 4, 2026
Go

Argonaut790 / fused-turboquant

Star

Fused Triton kernels for TurboQuant KV cache compression — 2-4 bit quantization with RHT rotation. Drop-in HuggingFace & vLLM integration. Up to 4.9x KV cache compression for Llama, Qwen, Mistral, and more.

Updated Apr 1, 2026
Python

carlosfundora / llama.cpp-1-bit-turbo

Star

HIP/ROCm fork of llama.cpp optimized for AMD gfx1030/RDNA2 architecture with support for PrismML's Bonsai Q1_0_G128 '1-bit' models, TurboQuant TQ3_0 KV cache, and EAGLE3 speculative decoding.

hip quantization bonsai rocm amd-gpu llama-cpp gguf rdna2 turboquant prismml gfx1030

Updated Apr 7, 2026
C++

savka777 / orbit

Star

your ai, your rules. — local AI desktop app with hardware-aware model matching, threaded conversations, and TurboQuant integration. no cloud, no subscription, no data leaving your device.

electron desktop-app macos open-source privacy ai llm local-ai ollama turboquant

Updated Mar 30, 2026
TypeScript

yzamari / turboQuantPlayground

Star

TurboQuant (ICLR 2026) ported to Apple Silicon — KV cache compression with MLX Metal kernels + PyTorch CPU

machine-learning deep-learning metal transformers inference pytorch attention quantization mlx iclr kv-cache apple-silicon llm llm-inference turboquant

Updated Mar 31, 2026
Python

Sggin1 / spark-ai-containers

Star

Docker containers for AI models on NVIDIA DGX Spark (GB10, SM121, aarch64). TurboQuant KV cache compression + mamba-ssm aarch64 build.

aarch64 blackwell kv-cache vllm nvfp4 dgx-spark mamba-ssm sm121 turboquant

Updated Mar 31, 2026
Python

LostBeard / SpawnDev.ILGPU.ML

Sponsor

Star

Hardware-agnostic machine learning infrastructure for .NET. Implements high-performance neural network layers in C# that are transpiled to run on WebGPU, CUDA, OpenCL, WebGL, CPU, and Wasm via SpawnDev.ILGPU. Optimized for Blazor WebAssembly and native GPU execution.

Updated Apr 5, 2026
WGSL

rookiemann / vllm-windows-build

Star

Native Windows build of vLLM v0.17.1 with Triton support and TurboQuant KV cache compression — Qwen 3.5, Llama 4, and more. No WSL, no Docker. Pre-built wheel + patchset for MSVC 2022 + CUDA 12.6.

windows gpu cuda inference pytorch triton msvc quantization kv-cache llm fp8 vllm turboquant

Updated Apr 4, 2026
Python

Ryuketsukami / turboquant-compression

Star

Near-optimal vector quantization for LLM KV cache compression. Python implementation of TurboQuant (ICLR 2026) — PolarQuant + QJL for 3-bit quantization with minimal accuracy loss and up to 8x memory reduction.

Updated Mar 28, 2026
Python

carlosfundora / sglang-1-bit-turbo

Star

ROCm/HIP fork of SGLang with TurboQuant tq2/tq3/tq4 KV cache, Triton and radix-cache serving, EAGLE3 speculative decoding, P-EAGLE checkpoint support, and PrismML Bonsai 1-bit GGUF compatibility on gfx1030/RDNA2.

triton hip bonsai rocm amd-gpu gguf speculative-decoding sglang rdna2 eagle3 turboquant prismml gfx1030 p-eagle radix-cache

Updated Apr 8, 2026
Python

jimliddle / turboquant-amd-vulkan

Star

A TurboQuant implementation with Llama.cpp for AMD with Vulkan runtime

amd vulkan llms kvcache turboquant

Updated Apr 1, 2026
C++

chahero / turboquant-experiments

Star

Interactive Benchmarking Tool for TurboQuant KV Cache Compression. Supports 2-4 bit quantization with Real-time Metrics

nlp machine-learning deep-learning pytorch transformer mistral vector-quantization model-compression inference-optimization kv-cache llm vllm qwen iclr-2026 turboquant

Updated Mar 28, 2026
Python

Improve this page

Add a description, image, and links to the turboquant topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the turboquant topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

turboquant

Here are 64 public repositories matching this topic...

quantumaikr / quant.cpp

arozanov / turboquant-mlx

PacifAIst / Quansloth

Alberto-Codes / turboquant-vllm

back2matching / turboquant

mindtro / semafold

Firmamento-Technologies / TurboQuant

artalis-io / bitnet.c

Lucien2468 / Ollama-TurboQuant-Integration

Argonaut790 / fused-turboquant

carlosfundora / llama.cpp-1-bit-turbo

savka777 / orbit

yzamari / turboQuantPlayground

Sggin1 / spark-ai-containers

LostBeard / SpawnDev.ILGPU.ML

rookiemann / vllm-windows-build

Ryuketsukami / turboquant-compression

carlosfundora / sglang-1-bit-turbo

jimliddle / turboquant-amd-vulkan

chahero / turboquant-experiments

Improve this page

Add this topic to your repo