A modular Rust inference runtime for Llama-family models. Built on oxidizedMLX for Metal/CPU acceleration, with integrations for oxidizedRAG and oxidizedgraph.
llama.rs uses a "narrow waist" design: the llama-engine crate defines the core LlamaEngine trait that all other crates depend on. Implementations can swap CPU/Metal/FFI backends without changing application code.
See docs/ARCHITECTURE.md for the full design.
| Crate | Description |
|---|---|
llama-engine |
Narrow-waist engine trait and core types |
llama-tokenizer |
Deterministic text-to-token conversion |
llama-models |
Model architectures (Llama/Qwen/Mistral) |
llama-runtime |
Backend selection and execution (oxidizedMLX) |
llama-sampling |
Sampling strategies (temperature, top-k/p) |
llama-kv |
KV cache management and paging |
docs/ROADMAP.md— Epic + user stories + milestones (tracks #1, #2)docs/PROGRESS_REPORT.md— Progress: milestones A–E, Milestone A checklist with PRs, Epic 1–5 statusdocs/MILESTONE_A.md— Milestone A "Hello Inference" checklist and next (tiny model forward pass)docs/ARCHITECTURE.md— Modular architecture, crate boundaries, invariantsdocs/TEST_STRATEGY.md— TDD plan: unit/property/golden/parity/perf testing.github/LABELS.md— Recommended gh labels (cursor,llama.rs,docs,roadmap,priority,rag); create with./scripts/create-labels.sh
# Install just (task runner)
cargo install just
# Run all checks
just ci
# Individual commands
just fmt # format code
just clippy # lint
just test # run tests
just check # type-check- Start here:
CONTRIBUTING.md - Agent/operator instructions:
CODEX.md
MIT