GitHub - JINO-ROHIT/tachyon: a LLM inference engine to run on consumer hardware

tachyon

a LLM inference engine to run on consumer hardware.

each request has its own state and you can choose to disable or enable kv cache on a request level.
supports continous batching where the decode step is batched to max out gpu, and requests get recycled as soon as it is completed from the active batch to allow new requests to enter the batch.
the prefill step is batched using a smaller batch size to avoid OOMs
vectorized operations to improve batch performance.
prefix caching to save on expensive prefill computation when there is shared system prompts or few shot examples etc.
3 lines to invoke the engine and run inference!

Usage

first download the weights for llama 1B.

hf_hub_download(
    repo_id=f"meta-llama/Llama-3.2-1B-Instruct",
    filename="model.safetensors",
    local_dir=f"Llama-3.2-1B-Instruct"
)

invoke the engine.

from tachyon.engine.llm import Engine
engine = Engine("meta-llama/Llama-3.2-1B-Instruct")
print(engine.generate_text("Explain AGI")) # for single request

#for multiple requests
outputs = engine.generate_text([
    "Explain AGI",
    "What is vLLM?",
    "Tell me about SGLang"
])

for o in outputs:
    print(o)

OpenAI compatible server

You can spin up the server and then use the openai library to invoke the model as well -

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="anything"
)

resp = client.chat.completions.create(
    model="llama-3.2-1b-instruct",
    messages=[{"role": "user", "content": "Hello"}],
    stream=False,
)

benchmark script

python3 benchmark.py

current benchmarks(rtx 4060 ti 16GB)

implementation	tokens generated	time taken	tok/s
naive torch	3031	233.171 s	13 tok/s
naive torch with kv cache	3200	37.771 s	84.72 tok/s
static batching	31309	369.081 s	84.83 tok/s
continuous batching (bs=10)	30600	111.657 s	274.05 tok/s
continuous batching (bs=30)	29000	89.755 s	323.10 tok/s
continuous batching (bs=50)	29800	87.442 s	340.80 tok/s
continuous batching (bs=50) with vectorized ops and batched prefill	30500	71.731 s	425.20 tok/s
prefix caching with similar requests	36100	54.469 s	662.76 toks/s

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
assets		assets
tachyon		tachyon
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmark.py		benchmark.py
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tachyon

Usage

OpenAI compatible server

current benchmarks(rtx 4060 ti 16GB)

to-do

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tachyon

Usage

OpenAI compatible server

current benchmarks(rtx 4060 ti 16GB)

to-do

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages