lightning-mlx

🔥 The fastest local AI engine for Apple Silicon. Optimized for agentic use.

It's Rapid-MLX with MTPLX inspired work.

Raw Decode Benchmarks

The Lightning MLX MTPLX raw-decode rows were run with explicit max-performance benchmark settings:

Model	mlx-lm	oMLX	Rapid MLX	Lightning MLX (MTPLX)
Qwen3.6-27B	29.80 tok/s	31.80 tok/s	32.37 tok/s	70.35 tok/s
Qwen3.6-35B	110.37 tok/s	114.59 tok/s	106.00 tok/s	226.01 tok/s

Used:

lightning-mlx bench qwen3.6-27b \
--num-prompts 3 --max-tokens 512 --disable-prefix-cache \
--max-num-seqs 1 --prefill-batch-size 1 --completion-batch-size 1 \
--prefill-step-size 8192 --mtp-num-draft-tokens 3 --mtp-optimistic

lightning-mlx bench qwen3.6-35b \
--num-prompts 3 --max-tokens 512 --disable-prefix-cache \
--max-num-seqs 1 --prefill-batch-size 1 --completion-batch-size 1 \
--prefill-step-size 8192 --mtp-num-draft-tokens 3 --mtp-optimistic

Agentic Benchmarks

create the snake game using react and typescript

Model	Metric	oMLX	Rapid MLX	Lightning MLX (MTPLX)
Qwen3.6-27B	All Turns	13.94 tok/s	13.49 tok/s	26.47 tok/s
Qwen3.6-27B	Long	20.35 tok/s	28.02 tok/s	38.60 tok/s
Qwen3.6-27B	Short	9.67 tok/s	7.42 tok/s	20.40 tok/s
Qwen3.6-35B	All Turns	49.60 tok/s	27.73 tok/s	64.85 tok/s
Qwen3.6-35B	Long	76.77 tok/s	26.52 tok/s	117.40 tok/s
Qwen3.6-35B	Short	35.11 tok/s	29.18 tok/s	52.50 tok/s

More Model Benchmarks

You can check for more benchmarks (for non-optmized models) in Rapid-MLX.

N-gram Speculation (Qwen3.6-35B-A3B)

On the 35B-A3B preset, n-gram (prompt-lookup) drafting is layered on top of MTP for +18% throughput on mixed reasoning + tool-use workloads (vs. MTP-only). Auto-enabled for qwen3.6-35b and qwen3.6-35b-8bit.

Highlights:

<think>-aware and <tool_call>-aware state machines: drafts everywhere by default but skips inside <tool_call>...</tool_call> regions where structure repeats but content varies.
Adaptive K based on n-gram match confidence (prior occurrence count): wide drafts for strong matches, narrow drafts for weak ones.
Hybrid verify: append one MTP-head draft after the n-gram tail to capture extra ground when n-gram drafts all accept.
Self-tuning: per-request running acceptance suppresses drafting on bad fits; global auto-disable when MTP is already strong (≥0.85) and n-gram is weak (≤0.50). Guarantees no regression vs. the MTP-only baseline.

Tunable via --enable-ngram / --disable-ngram and --ngram-* flags on lightning-mlx serve and bench.

Install

python3 -m pip install git+https://github.com/samuelfaj/lightning-mlx.git

Local checkout:

git clone https://github.com/samuelfaj/lightning-mlx.git
cd lightning-mlx
python3 -m pip install -e .

Self-contained install:

curl -fsSL https://raw.githubusercontent.com/samuelfaj/lightning-mlx/main/install.sh | bash

Then verify:

lightning-mlx --help

Start Fast

Best optimized models:

lightning-mlx serve qwen3.6-27b
lightning-mlx serve qwen3.6-35b
lightning-mlx serve qwen3.6-35b-8bit

qwen3.6-35b-8bit mirrors the qwen3.6-35b preset (MTP, n-gram, port 8010, tool/reasoning parsers) but routes to the 8-bit MTPLX-optimized weights for higher quality on memory-rich Macs.

Local model path works too:

lightning-mlx serve /path/to/Qwen3.6-27B-MTPLX-Optimized-Speed

Run it without keeping a terminal open:

lightning-mlx serve mlx-community/Qwen3.5-4B-MLX-4bit --daemon

Daemon mode starts a detached supervisor, writes logs under ~/.lightning-mlx/logs/, and restarts the server if the model process exits unexpectedly.

lightning-mlx status
lightning-mlx tui <PID-or-model-name>
lightning-mlx kill <PID-or-model-name>

Use status to see running daemons, tui to attach the live monitor, and kill to stop by supervisor PID, server PID, alias, or model name.

Convert Local MTPLX Models

Use convert-mtplx to package a local model with an MTPLX MTP sidecar. This is useful when the serving model is quantized, but the MTP tensors come from the original full model:

lightning-mlx convert-mtplx \
  /path/to/Qwen3.6-35B-A3B-4bit \
  --mtp-source /path/to/Qwen3.6-35B-A3B

By default, the output is written next to the source model as:

/path/to/Qwen3.6-35B-A3B-4bit-MTPLX-Optimized-Speed

Then serve it normally:

lightning-mlx serve /path/to/Qwen3.6-35B-A3B-4bit-MTPLX-Optimized-Speed

For Qwen3.6-35B-A3B, the default temperature is 0.6. Thinking stays enabled by default for agentic tool use. Pass --no-thinking only when you explicitly want to disable it.

Use It Like OpenAI

curl http://localhost:8010/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local",
    "messages": [
      {"role": "user", "content": "Write a tiny Python HTTP server."}
    ],
    "stream": true
  }'

What You Get

2.75x faster short agentic turns in the benchmark fixture.
1.96x higher all-turn throughput versus the MLX baseline.
+18% throughput on Qwen3.6-35B-A3B with n-gram + MTP stacked speculation.
Successful artifact generation where baseline timed out.
OpenAI-compatible API for local tools, agents, editors, and CLIs.
Apple Silicon first: built around MLX and local Mac inference.
MTPLX optimized preset behind one simple command.

Name		Name	Last commit message	Last commit date
Latest commit History 480 Commits
.claude		.claude
.github		.github
docs		docs
evals		evals
examples		examples
harness		harness
reports		reports
reviews		reviews
scripts		scripts
tests		tests
vllm_mlx		vllm_mlx
.all-contributorsrc		.all-contributorsrc
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
GOAL.md		GOAL.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
REPORT.md		REPORT.md
ROADMAP.md		ROADMAP.md
TODO.md		TODO.md
benchmark_all_prompt_lookup.py		benchmark_all_prompt_lookup.py
gsm8k_qwen3_0.6b_results.json		gsm8k_qwen3_0.6b_results.json
install.sh		install.sh
mise.toml		mise.toml
plan.md		plan.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lightning-mlx

Raw Decode Benchmarks

Agentic Benchmarks

More Model Benchmarks

N-gram Speculation (Qwen3.6-35B-A3B)

Install

Start Fast

Convert Local MTPLX Models

Use It Like OpenAI

What You Get

Built On

About

Uh oh!

Releases 6

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

lightning-mlx

Raw Decode Benchmarks

Agentic Benchmarks

More Model Benchmarks

N-gram Speculation (Qwen3.6-35B-A3B)

Install

Start Fast

Convert Local MTPLX Models

Use It Like OpenAI

What You Get

Built On

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages