a LLM inference engine to run on consumer hardware.
- each request has its own state and you can choose to disable or enable kv cache on a request level.
- supports continous batching where the decode step is batched to max out gpu, and requests get recycled as soon as it is completed from the active batch to allow new requests to enter the batch.
- the prefill step is batched using a smaller batch size to avoid OOMs
- vectorized operations to improve batch performance.
- prefix caching to save on expensive prefill computation when there is shared system prompts or few shot examples etc.
- 3 lines to invoke the engine and run inference!
- first download the weights for llama 1B.
hf_hub_download(
repo_id=f"meta-llama/Llama-3.2-1B-Instruct",
filename="model.safetensors",
local_dir=f"Llama-3.2-1B-Instruct"
)
- invoke the engine.
from tachyon.engine.llm import Engine
engine = Engine("meta-llama/Llama-3.2-1B-Instruct")
print(engine.generate_text("Explain AGI")) # for single request
#for multiple requests
outputs = engine.generate_text([
"Explain AGI",
"What is vLLM?",
"Tell me about SGLang"
])
for o in outputs:
print(o)
You can spin up the server and then use the openai library to invoke the model as well -
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="anything"
)
resp = client.chat.completions.create(
model="llama-3.2-1b-instruct",
messages=[{"role": "user", "content": "Hello"}],
stream=False,
)
- benchmark script
python3 benchmark.py
| implementation | tokens generated | time taken | tok/s |
|---|---|---|---|
| naive torch | 3031 | 233.171 s | 13 tok/s |
| naive torch with kv cache | 3200 | 37.771 s | 84.72 tok/s |
| static batching | 31309 | 369.081 s | 84.83 tok/s |
| continuous batching (bs=10) | 30600 | 111.657 s | 274.05 tok/s |
| continuous batching (bs=30) | 29000 | 89.755 s | 323.10 tok/s |
| continuous batching (bs=50) | 29800 | 87.442 s | 340.80 tok/s |
| continuous batching (bs=50) with vectorized ops and batched prefill | 30500 | 71.731 s | 425.20 tok/s |
| prefix caching with similar requests | 36100 | 54.469 s | 662.76 toks/s |
- implement llama 3 family model.
- make it into a serving engine.
- write a benchmark script and check latency and throughput.
- add kv cache.
- continous batching
- prefix caching
- add an openai compatible api server
- test the effect of torch compile (open PR, come back to this)
- paged attention
- more techniques
