Skip to content

EffortlessMetrics/BitNet-rs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3,193 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

bitnet-rs

CI MSRV Rust 2024 License

BitNet-rs is a high-performance Rust inference engine for 1-bit BitNet LLMs.

Features

  • SIMD/CUDA/Metal/Vulkan kernels — AVX2/AVX-512/NEON on CPU; CUDA (gpu), Metal (metal, macOS), Vulkan (vulkan), Intel Arc OpenCL (opencl) GPU backends
  • Multiple quantization formats — I2_S BitNet32-F16, I2_S QK256 (GGML 256-element blocks), TL1, TL2, IQ2_S via FFI
  • Cross-validation — per-token cosine-similarity comparison against Microsoft's C++ reference (>0.99)
  • Honest-compute receipts — schema v1.0.0 with 8 validation gates; compute_path must be "real"
  • Chat templates — 59+ template variants (LLaMA-3, Phi-4, Qwen, Gemma, Mistral, DeepSeek, and more); auto-detected from GGUF metadata or tokenizer path
  • SLM model support — load and run Phi-4, Qwen, Gemma, Mistral, LLaMA, and SmolLM2 via SafeTensors (quickstart guide)
  • SafeTensors → GGUF exportbitnet-st2gguf preserves F16 LayerNorm weights

v0.2.1-dev (pre-alpha): QK256 uses scalar kernels (~0.1 tok/s on 2B models); use --max-tokens 4–16 for validation. AVX2 dequantization is merged; ≥3× uplift planned. Significant correctness, performance, and validation work remains.

Quick Start

# 1. Download a model
cargo run -p xtask -- download-model --id microsoft/bitnet-b1.58-2B-4T-gguf

# 2. Run inference  (always specify --no-default-features --features cpu|gpu)
RUST_LOG=warn cargo run -p bitnet-cli --no-default-features --features cpu,full-cli -- run \
  --model  models/microsoft-bitnet-b1.58-2B-4T-gguf/ggml-model-i2_s.gguf \
  --tokenizer models/microsoft-bitnet-b1.58-2B-4T-gguf/tokenizer.json \
  --prompt "What is 2+2?" --max-tokens 8

# 3. Interactive chat
RUST_LOG=warn cargo run -p bitnet-cli --no-default-features --features cpu,full-cli -- chat \
  --model  models/microsoft-bitnet-b1.58-2B-4T-gguf/ggml-model-i2_s.gguf \
  --tokenizer models/microsoft-bitnet-b1.58-2B-4T-gguf/tokenizer.json

Workspace default features are empty — always pass --no-default-features --features cpu (or gpu). bitnet-cli defaults to cpu,full-cli when built standalone.

Status

Feature State Notes
CPU inference — I2_S BitNet32 Production path; 10–20× faster than QK256 scalar
CPU inference — I2_S QK256 Scalar kernels (~0.1 tok/s on 2B); AVX2 foundation merged
GPU inference — CUDA 🔶 Scaffolded; receipt validation pending
GPU inference — Metal 🧪 Feature gate + kernel stubs; not validated end-to-end
GPU inference — Vulkan 🧪 Runtime probing compiled; not validated end-to-end
GPU inference — Intel oneAPI 🧪 Intel CPU/GPU feature gate; not validated end-to-end
AMD ROCm detection 🧪 Device detection only; inference kernels not yet validated
GPU HAL — multi-backend 🔧 bitnet-gpu-hal: OpenCL, Vulkan, Metal, ROCm backends; ~780 tests (scaffold; CPU-only validation)
Interactive chat (REPL) /help, /clear, /metrics, auto-template detection
Cross-validation vs C++ Cosine similarity > 0.99, per-token comparison
Honest-compute receipts Schema v1.0.0, 8 validation gates
Strict mode Runtime guards prevent mock fallback
SafeTensors → GGUF export bitnet-st2gguf with F16 LayerNorm preservation
Server / HTTP API 🚧 Health endpoints wired; inference endpoints have TODOs

GPU Multi-Backend Support

BitNet-rs supports inference on multiple GPU platforms:

Backend Feature Flag Status Hardware
NVIDIA CUDA --features gpu 🔶 Alpha GeForce/Tesla/A100+
Intel Arc (OpenCL) --features opencl 🧪 Experimental Arc A770/A750
AMD ROCm --features rocm 🧪 Scaffold Unvalidated target: RDNA3-class AMD GPUs
Vulkan --features vulkan 🧪 Scaffold Any Vulkan 1.3 GPU
Apple Metal --features metal 🧪 Scaffold M1/M2/M3+
WebGPU N/A (sub-crate only) 🧪 Experimental Browser/wgpu (bitnet-wgpu)
CPU (SIMD) --features cpu ✅ Production x86-64/ARM64

Quick Start (Intel Arc)

# Install Intel compute runtime (Ubuntu)
sudo apt install intel-opencl-icd clinfo

# Build with Intel GPU support
cargo build --release --no-default-features --features opencl,full-cli

# Run inference
cargo run --release --no-default-features --features opencl,full-cli -- run \
  --model models/model.gguf --device opencl --prompt "Hello" --max-tokens 32

See docs/INTEL_GPU_SETUP.md for detailed setup instructions.

Device Selection

--device auto     # Auto-detect best available (default)
--device cpu      # Force CPU
--device cuda     # Force NVIDIA CUDA
--device opencl   # Force Intel OpenCL
--device vulkan   # Force Vulkan

Architecture

Data flows top-to-bottom through the workspace:

bitnet-tokenizers ──────────────────────────────────────┐
                                                         │
bitnet-models  (GGUF loader, dual I2_S flavor detection) │
  └── bitnet-quantization  (I2_S / TL1 / TL2 / IQ2_S)  │
        └── bitnet-kernels (AVX2 / AVX-512 / NEON / CUDA)│
                                                         ▼
                        bitnet-inference  (autoregressive engine)
                          ├── bitnet-logits       (temperature / top-k / top-p)
                          ├── bitnet-sampling     (greedy, nucleus, repetition penalty)
                          ├── bitnet-generation   (decode loop, stop criteria)
                          ├── bitnet-prompt-templates  (59+ template variants; auto-detection)
                          └── bitnet-receipts     (honest-compute receipt schema)
                                                         │
                                          ┌──────────────┴──────────────┐
                                     bitnet-cli                  bitnet-server

SRP microcrates (bitnet-logits, bitnet-sampling, bitnet-generation, bitnet-engine-core, bitnet-device-probe, bitnet-gguf, bitnet-prompt-templates, bitnet-receipts) keep coupling low and are re-exported from their original locations for zero breaking changes.

GPU Backend Crates

  • bitnet-opencl — Intel GPU compute via OpenCL 3.0
  • bitnet-vulkan — Cross-vendor Vulkan compute
  • bitnet-wgpu / bitnet-wgpu-runner — WebGPU/WGSL compute shaders
  • bitnet-rocm — AMD ROCm/HIP backend
  • bitnet-metal — Apple Metal compute
  • bitnet-gpu-hal — Unified Hardware Abstraction Layer (includes Level Zero backend module)

Documentation

Organised by Diátaxis:

Section Contents
Tutorials Getting started, first inference, tokenizer discovery
How-to Install, run inference, export GGUF, cross-validate, validate models
Explanation Architecture, quantization formats, dual-backend cross-val, feature flags
Reference CLI flags, environment variables, API, quantization support

Key guides: Quickstart · SLM models · Environment variables · GPU setup · Intel GPU setup · C++ cross-validation · Quantization support · Validation gates · Honest-compute receipts · QK256 usage · macOS 26 Apple Silicon roadmap

Building

cargo build --no-default-features --features cpu           # CPU (development)
cargo build --no-default-features --features gpu           # GPU (requires CUDA 12.x)
RUSTFLAGS="-C target-cpu=native -C opt-level=3 -C lto=thin" \
  cargo build --release --no-default-features --features cpu,full-cli  # optimised release

# Nix (reproducible, identical to CI)
nix develop && nix build .#bitnet-cli && nix flake check

Feature flags

Flag Purpose
cpu SIMD-optimised CPU inference (AVX2 / AVX-512 / NEON)
gpu Umbrella GPU feature — enables all compiled GPU backends
cuda CUDA acceleration (preferred; requires CUDA 12.x); backward-compat alias for gpu
metal Metal GPU backend (macOS/iOS Apple Silicon)
vulkan Vulkan compute backend (cross-platform)
ffi C++ FFI bridge for cross-validation
fixtures GGUF fixture-based integration tests (test-only)
full-cli Enable all CLI subcommands
rocm AMD ROCm/HIP inference backend (experimental; kernels not yet validated end-to-end)
npu NPU detection via bitnet-device-probe
opencl Intel Arc OpenCL backend (experimental; bitnet-opencl crate)

Always use the unified GPU predicate in Rust code:

#[cfg(any(feature = "gpu", feature = "cuda"))]

Testing

# Run all enabled tests (recommended — 5-minute timeout)
cargo nextest run --workspace --no-default-features --features cpu

# CI profile (4 threads, no retries)
cargo nextest run --profile ci

# Skip slow QK256 scalar-kernel tests
BITNET_SKIP_SLOW_TESTS=1 cargo nextest run --workspace --no-default-features --features cpu

# BDD compile-coverage check
cargo run -p xtask -- grid-check

# Fixture-based integration tests
cargo test -p bitnet-models --test qk256_dual_flavor_tests --no-default-features --features fixtures

# Lint before pushing
cargo fmt --all && cargo clippy --all-targets --no-default-features --features cpu -- -D warnings

# Quick local CI smoke test (replicates the 4 required CI gates)
./ci/local.sh

The suite has tens of thousands of tests spanning unit, property-based (proptest), snapshot (insta), fixture, fuzz (109 targets; 37 in nightly CI matrix), and BDD grid categories. ~2,800+ tests are intentionally #[ignore]-d — TDD scaffolds, resource-gated tests, slow tests, and crossval tests. See #[ignore = "..."] justification strings.

See docs/development/test-suite.md for full details.

Contributing

See CONTRIBUTING.md. Issues and pull requests welcome.

Before opening a PR, run:

# Option 1: Quick local CI smoke test (recommended)
./ci/local.sh

# Option 2: Manual checks
cargo fmt --all && cargo clippy --all-targets --no-default-features --features cpu -- -D warnings
cargo nextest run --workspace --no-default-features --features cpu

Note: ~2,800+ tests are intentionally #[ignore]-d. This is expected — they are TDD scaffolds, resource-gated tests (model files, GPU hardware), slow tests, and crossval tests. See #[ignore = "..."] justification strings.

License

Dual-licensed under MIT and Apache 2.0.

About

Rust inference engine for 1-bit BitNet LLMs (GGUF + llama.cpp compatible).

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors