Skip to content

stevedores-org/llama.rs

Repository files navigation

llama.rs

A modular Rust inference runtime for Llama-family models. Built on oxidizedMLX for Metal/CPU acceleration, with integrations for oxidizedRAG and oxidizedgraph.

Architecture

llama.rs uses a "narrow waist" design: the llama-engine crate defines the core LlamaEngine trait that all other crates depend on. Implementations can swap CPU/Metal/FFI backends without changing application code.

See docs/ARCHITECTURE.md for the full design.

Workspace Crates

Crate Description
llama-engine Narrow-waist engine trait and core types
llama-tokenizer Deterministic text-to-token conversion
llama-models Model architectures (Llama/Qwen/Mistral)
llama-runtime Backend selection and execution (oxidizedMLX)
llama-sampling Sampling strategies (temperature, top-k/p)
llama-kv KV cache management and paging

Planning Docs

Development

# Install just (task runner)
cargo install just

# Run all checks
just ci

# Individual commands
just fmt        # format code
just clippy     # lint
just test       # run tests
just check      # type-check

Contributing

License

MIT

About

No description, website, or topics provided.

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors