BitNet-rs is a high-performance Rust inference engine for 1-bit BitNet LLMs.
- SIMD/CUDA/Metal/Vulkan kernels — AVX2/AVX-512/NEON on CPU; CUDA (
gpu), Metal (metal, macOS), Vulkan (vulkan), Intel Arc OpenCL (opencl) GPU backends - Multiple quantization formats — I2_S BitNet32-F16, I2_S QK256 (GGML 256-element blocks), TL1, TL2, IQ2_S via FFI
- Cross-validation — per-token cosine-similarity comparison against Microsoft's C++ reference (>0.99)
- Honest-compute receipts — schema v1.0.0 with 8 validation gates;
compute_pathmust be"real" - Chat templates — 59+ template variants (LLaMA-3, Phi-4, Qwen, Gemma, Mistral, DeepSeek, and more); auto-detected from GGUF metadata or tokenizer path
- SLM model support — load and run Phi-4, Qwen, Gemma, Mistral, LLaMA, and SmolLM2 via SafeTensors (quickstart guide)
- SafeTensors → GGUF export —
bitnet-st2ggufpreserves F16 LayerNorm weights
v0.2.1-dev (pre-alpha): QK256 uses scalar kernels (~0.1 tok/s on 2B models); use
--max-tokens 4–16for validation. AVX2 dequantization is merged; ≥3× uplift planned. Significant correctness, performance, and validation work remains.
# 1. Download a model
cargo run -p xtask -- download-model --id microsoft/bitnet-b1.58-2B-4T-gguf
# 2. Run inference (always specify --no-default-features --features cpu|gpu)
RUST_LOG=warn cargo run -p bitnet-cli --no-default-features --features cpu,full-cli -- run \
--model models/microsoft-bitnet-b1.58-2B-4T-gguf/ggml-model-i2_s.gguf \
--tokenizer models/microsoft-bitnet-b1.58-2B-4T-gguf/tokenizer.json \
--prompt "What is 2+2?" --max-tokens 8
# 3. Interactive chat
RUST_LOG=warn cargo run -p bitnet-cli --no-default-features --features cpu,full-cli -- chat \
--model models/microsoft-bitnet-b1.58-2B-4T-gguf/ggml-model-i2_s.gguf \
--tokenizer models/microsoft-bitnet-b1.58-2B-4T-gguf/tokenizer.jsonWorkspace default features are empty — always pass
--no-default-features --features cpu(orgpu).bitnet-clidefaults tocpu,full-cliwhen built standalone.
| Feature | State | Notes |
|---|---|---|
| CPU inference — I2_S BitNet32 | ✅ | Production path; 10–20× faster than QK256 scalar |
| CPU inference — I2_S QK256 | ✅ | Scalar kernels (~0.1 tok/s on 2B); AVX2 foundation merged |
| GPU inference — CUDA | 🔶 | Scaffolded; receipt validation pending |
| GPU inference — Metal | 🧪 | Feature gate + kernel stubs; not validated end-to-end |
| GPU inference — Vulkan | 🧪 | Runtime probing compiled; not validated end-to-end |
| GPU inference — Intel oneAPI | 🧪 | Intel CPU/GPU feature gate; not validated end-to-end |
| AMD ROCm detection | 🧪 | Device detection only; inference kernels not yet validated |
| GPU HAL — multi-backend | 🔧 | bitnet-gpu-hal: OpenCL, Vulkan, Metal, ROCm backends; ~780 tests (scaffold; CPU-only validation) |
| Interactive chat (REPL) | ✅ | /help, /clear, /metrics, auto-template detection |
| Cross-validation vs C++ | ✅ | Cosine similarity > 0.99, per-token comparison |
| Honest-compute receipts | ✅ | Schema v1.0.0, 8 validation gates |
| Strict mode | ✅ | Runtime guards prevent mock fallback |
| SafeTensors → GGUF export | ✅ | bitnet-st2gguf with F16 LayerNorm preservation |
| Server / HTTP API | 🚧 | Health endpoints wired; inference endpoints have TODOs |
BitNet-rs supports inference on multiple GPU platforms:
| Backend | Feature Flag | Status | Hardware |
|---|---|---|---|
| NVIDIA CUDA | --features gpu |
🔶 Alpha | GeForce/Tesla/A100+ |
| Intel Arc (OpenCL) | --features opencl |
🧪 Experimental | Arc A770/A750 |
| AMD ROCm | --features rocm |
🧪 Scaffold | Unvalidated target: RDNA3-class AMD GPUs |
| Vulkan | --features vulkan |
🧪 Scaffold | Any Vulkan 1.3 GPU |
| Apple Metal | --features metal |
🧪 Scaffold | M1/M2/M3+ |
| WebGPU | N/A (sub-crate only) | 🧪 Experimental | Browser/wgpu (bitnet-wgpu) |
| CPU (SIMD) | --features cpu |
✅ Production | x86-64/ARM64 |
# Install Intel compute runtime (Ubuntu)
sudo apt install intel-opencl-icd clinfo
# Build with Intel GPU support
cargo build --release --no-default-features --features opencl,full-cli
# Run inference
cargo run --release --no-default-features --features opencl,full-cli -- run \
--model models/model.gguf --device opencl --prompt "Hello" --max-tokens 32See docs/INTEL_GPU_SETUP.md for detailed setup instructions.
--device auto # Auto-detect best available (default)
--device cpu # Force CPU
--device cuda # Force NVIDIA CUDA
--device opencl # Force Intel OpenCL
--device vulkan # Force VulkanData flows top-to-bottom through the workspace:
bitnet-tokenizers ──────────────────────────────────────┐
│
bitnet-models (GGUF loader, dual I2_S flavor detection) │
└── bitnet-quantization (I2_S / TL1 / TL2 / IQ2_S) │
└── bitnet-kernels (AVX2 / AVX-512 / NEON / CUDA)│
▼
bitnet-inference (autoregressive engine)
├── bitnet-logits (temperature / top-k / top-p)
├── bitnet-sampling (greedy, nucleus, repetition penalty)
├── bitnet-generation (decode loop, stop criteria)
├── bitnet-prompt-templates (59+ template variants; auto-detection)
└── bitnet-receipts (honest-compute receipt schema)
│
┌──────────────┴──────────────┐
bitnet-cli bitnet-server
SRP microcrates (bitnet-logits, bitnet-sampling, bitnet-generation, bitnet-engine-core, bitnet-device-probe, bitnet-gguf, bitnet-prompt-templates, bitnet-receipts) keep coupling low and are re-exported from their original locations for zero breaking changes.
bitnet-opencl— Intel GPU compute via OpenCL 3.0bitnet-vulkan— Cross-vendor Vulkan computebitnet-wgpu/bitnet-wgpu-runner— WebGPU/WGSL compute shadersbitnet-rocm— AMD ROCm/HIP backendbitnet-metal— Apple Metal computebitnet-gpu-hal— Unified Hardware Abstraction Layer (includes Level Zero backend module)
Organised by Diátaxis:
| Section | Contents |
|---|---|
| Tutorials | Getting started, first inference, tokenizer discovery |
| How-to | Install, run inference, export GGUF, cross-validate, validate models |
| Explanation | Architecture, quantization formats, dual-backend cross-val, feature flags |
| Reference | CLI flags, environment variables, API, quantization support |
Key guides: Quickstart · SLM models · Environment variables · GPU setup · Intel GPU setup · C++ cross-validation · Quantization support · Validation gates · Honest-compute receipts · QK256 usage · macOS 26 Apple Silicon roadmap
cargo build --no-default-features --features cpu # CPU (development)
cargo build --no-default-features --features gpu # GPU (requires CUDA 12.x)
RUSTFLAGS="-C target-cpu=native -C opt-level=3 -C lto=thin" \
cargo build --release --no-default-features --features cpu,full-cli # optimised release
# Nix (reproducible, identical to CI)
nix develop && nix build .#bitnet-cli && nix flake check| Flag | Purpose |
|---|---|
cpu |
SIMD-optimised CPU inference (AVX2 / AVX-512 / NEON) |
gpu |
Umbrella GPU feature — enables all compiled GPU backends |
cuda |
CUDA acceleration (preferred; requires CUDA 12.x); backward-compat alias for gpu |
metal |
Metal GPU backend (macOS/iOS Apple Silicon) |
vulkan |
Vulkan compute backend (cross-platform) |
ffi |
C++ FFI bridge for cross-validation |
fixtures |
GGUF fixture-based integration tests (test-only) |
full-cli |
Enable all CLI subcommands |
rocm |
AMD ROCm/HIP inference backend (experimental; kernels not yet validated end-to-end) |
npu |
NPU detection via bitnet-device-probe |
opencl |
Intel Arc OpenCL backend (experimental; bitnet-opencl crate) |
Always use the unified GPU predicate in Rust code:
#[cfg(any(feature = "gpu", feature = "cuda"))]# Run all enabled tests (recommended — 5-minute timeout)
cargo nextest run --workspace --no-default-features --features cpu
# CI profile (4 threads, no retries)
cargo nextest run --profile ci
# Skip slow QK256 scalar-kernel tests
BITNET_SKIP_SLOW_TESTS=1 cargo nextest run --workspace --no-default-features --features cpu
# BDD compile-coverage check
cargo run -p xtask -- grid-check
# Fixture-based integration tests
cargo test -p bitnet-models --test qk256_dual_flavor_tests --no-default-features --features fixtures
# Lint before pushing
cargo fmt --all && cargo clippy --all-targets --no-default-features --features cpu -- -D warnings
# Quick local CI smoke test (replicates the 4 required CI gates)
./ci/local.shThe suite has tens of thousands of tests spanning unit, property-based (proptest), snapshot (insta), fixture, fuzz (109 targets; 37 in nightly CI matrix), and BDD grid categories. ~2,800+ tests are intentionally #[ignore]-d — TDD scaffolds, resource-gated tests, slow tests, and crossval tests. See #[ignore = "..."] justification strings.
See docs/development/test-suite.md for full details.
See CONTRIBUTING.md. Issues and pull requests welcome.
Before opening a PR, run:
# Option 1: Quick local CI smoke test (recommended)
./ci/local.sh
# Option 2: Manual checks
cargo fmt --all && cargo clippy --all-targets --no-default-features --features cpu -- -D warnings
cargo nextest run --workspace --no-default-features --features cpuNote: ~2,800+ tests are intentionally #[ignore]-d. This is expected — they are TDD scaffolds, resource-gated tests (model files, GPU hardware), slow tests, and crossval tests. See #[ignore = "..."] justification strings.
Dual-licensed under MIT and Apache 2.0.