SLM Server

Unified LLM server with model-ID-based routing. Designed for Apple Silicon (M1/M2/M3/M4) using MLX and llama.cpp backends.

Architecture

Client Request (port 8000)
    ↓
Routing Service (FastAPI, port 8000)
    ↓ (reads model ID from request body)
Backend Model Servers (MLX/llama.cpp on ports 8501, 8502, ...)

Requirements

macOS with Apple Silicon — required for MLX backend; llama.cpp works cross-platform
Python 3.12+
uv package manager
For llama.cpp: llama-server on PATH (e.g. brew install llama.cpp) — required for rerank and Qwen3.5/newer architectures

Quick Start

Install dependencies:

git clone <repository-url>
cd slm_server
uv sync --extra mlx        # For MLX backend
uv sync --extra llamacpp   # For llama.cpp backend (Python server fallback)

Configure models:

cp config/models.yaml.example config/models.yaml
# Edit config/models.yaml with your model paths

Start all services:

./start.sh

Or start individually:

# Terminal 1: Start backend servers
uv run python -m slm_server backends

# Terminal 2: Start routing service
uv run python -m slm_server router

For detailed setup, see SETUP.md.

Configuration

Copy config/models.yaml.example to config/models.yaml and set your model paths. Each model entry maps a role name to a server instance.

All Configuration Fields

Field	Required	Default	Description
`id`	yes	—	Model identifier used for routing (must match `model` field in requests)
`backend`	yes	—	`mlx` or `llamacpp`
`port`	yes	—	Port for this model's backend server (must be unique)
`model_path`	yes	—	Local path to model file/directory, or Hugging Face model ID (MLX only for HF IDs)
`default_timeout`	yes	—	Request timeout in seconds
`quantization`	yes	—	Quantization level (e.g. `8bit`, `Q8_0`, `f16`) — informational for MLX; affects KV cache defaults for llamacpp
`model_type`	no	`lm`	`lm`, `multimodal`, `image-generation`, `image-edit`, `embeddings`, `rerank`, or `whisper`
`context_length`	no	model default	Maximum context length; omit to use the model's built-in default
`max_concurrency`	no	`1`	Maximum concurrent requests
`host`	no	`0.0.0.0`	Host the backend server binds to
`enabled`	no	`true`	Set to `false` to skip this model on startup
`supports_function_calling`	no	`false`	Reported in `/v1/models` response

MLX-only fields (passed to mlx-openai-server launch):

Field	Default	Description
`enable_auto_tool_choice`	`false`	Pass `--enable-auto-tool-choice` to mlx-openai-server
`tool_call_parser`	`null`	Parser for tool calls. Options: `qwen3`, `glm4_moe`, `qwen3_coder`, `qwen3_moe`, `qwen3_next`, `qwen3_vl`, `harmony`, `minimax_m2`
`reasoning_parser`	`null`	Parser for reasoning/thinking tokens. Options: `qwen3`, `glm4_moe`, `qwen3_moe`, `qwen3_next`, `qwen3_vl`, `harmony`, `minimax_m2`
`config_name`	`flux-schnell` / `flux-kontext-dev`	Config name for `image-generation` or `image-edit` model types

llama.cpp-only fields (passed to llama-server or llama_cpp.server):

Field	Default	Description
`chat_template_kwargs`	`null`	Dict passed as `--chat-template-kwargs` (e.g. `{enable_thinking: true}` for Qwen3.5)
`temp`	—	Sampling temperature
`top_p`	—	Top-p sampling
`top_k`	—	Top-k sampling
`min_p`	—	Min-p sampling
`cache_type_k`	—	KV cache type for K (e.g. `q8_0`, `f16`)
`cache_type_v`	—	KV cache type for V (e.g. `q8_0`, `f16`)
`flash_attn`	—	Flash attention (`true` / `false`)
`kv_unified`	—	Unified KV cache — native `llama-server` only
`fit`	—	`--fit` flag — native `llama-server` only

Model Path

Two formats are accepted:

Hugging Face model ID (MLX backend only): downloaded automatically on first use
```
model_path: "mlx-community/Qwen3-8B-MLX-8bit"
```
Local path: directory containing a .gguf (llamacpp) or model files (MLX), or a direct path to a .gguf file
```
model_path: "/path/to/models/Qwen3.5-9B-GGUF"
```

For llamacpp with a directory, the server picks the first .gguf file found (alphabetically). Hugging Face model IDs are not supported for llamacpp — use a local path.

API

The routing service exposes OpenAI-compatible endpoints on port 8000.

`POST /v1/chat/completions`

Standard chat completions. The model field in the request body selects the backend:

{
  "model": "qwen/qwen3-4b-2507",
  "messages": [{"role": "user", "content": "Hello"}]
}

The router also injects chat_template_kwargs from config into the request body if set and not already present.

`POST /v1/responses`

Responses API with automatic fallback. The router first tries /v1/responses on the backend. If the backend returns 404 or 422, it converts the request to /v1/chat/completions format and retries.

`POST /v1/embeddings`

OpenAI-compatible embeddings. Requires a model with model_type: embeddings and backend: llamacpp. The backend is started with --embedding (native llama-server) or --embedding true (Python server).

{
  "model": "Qwen/Qwen3-Embedding-0.6B",
  "input": "Hello, world"
}

curl -s http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model":"Qwen/Qwen3-Embedding-0.6B","input":"test"}' | jq

MLX embedding models are also supported: set backend: mlx and model_type: embeddings.

`POST /v1/rerank`

Reranking. Requires model_type: rerank, backend: llamacpp, and native llama-server on PATH. The backend is started with --embedding --pooling rank --reranking. The Python llama_cpp.server does not support rerank.

Request body follows the llama.cpp server rerank format (query + documents).

`GET /v1/models`

Lists all configured models and their settings (id, backend, port, model_type, context_length, quantization, supports_function_calling).

`GET /v1/backends/health`

Health status of all configured backends:

{
  "standard": {
    "status": "healthy",
    "model_id": "qwen/qwen3-4b-2507",
    "backend": "mlx",
    "port": 8501
  },
  "reasoning": {
    "status": "unreachable",
    "error": "Connection refused - backend not running"
  }
}

Possible statuses: healthy, unreachable, timeout, unhealthy, error, disabled.

`GET /health`

Router health check.

Backend Details

MLX

Install: uv sync --extra mlx
Requires mlx-openai-server command (installed via the extra)
Accepts Hugging Face model IDs (auto-downloads) or local model directories
Apple Silicon only

llama.cpp

Install: uv sync --extra llamacpp (installs llama-cpp-python[server] as fallback)
Native llama-server (e.g. brew install llama.cpp) is used automatically when found on PATH and is required for:
- model_type: rerank
- Models with newer architectures (Qwen3.5, etc.) not yet supported by the PyPI build
- kv_unified and fit flags
When native llama-server is not found, falls back to python -m llama_cpp.server
Requires local .gguf files — Hugging Face model IDs are not supported

Troubleshooting

Check backend health

curl http://localhost:8000/v1/backends/health | jq

Model not found

Verify the id in config/models.yaml matches the model field in your request exactly
Check that enabled is not set to false

Port already in use

lsof -i :8501

Each model must have a unique port. Config validation warns about port conflicts on startup.

Backend not starting

Check /v1/backends/health to see which backends are down
Ensure model paths are correct and files exist
For llamacpp: verify llama-server is on PATH (which llama-server)
Check logs for error messages

"unknown model architecture" error (llamacpp)

The PyPI build of llama-cpp-python may not support newer model architectures. Install native llama-server:

brew install llama.cpp

The server detects it on PATH and uses it automatically.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
config		config
docs		docs
src/slm_server		src/slm_server
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SETUP.md		SETUP.md
nginx.conf.example		nginx.conf.example
pyproject.toml		pyproject.toml
start.sh		start.sh
stop.sh		stop.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SLM Server

Architecture

Requirements

Quick Start

Configuration

All Configuration Fields

Model Path

API

`POST /v1/chat/completions`

`POST /v1/responses`

`POST /v1/embeddings`

`POST /v1/rerank`

`GET /v1/models`

`GET /v1/backends/health`

`GET /health`

Backend Details

MLX

llama.cpp

Troubleshooting

Check backend health

Model not found

Port already in use

Backend not starting

"unknown model architecture" error (llamacpp)

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SLM Server

Architecture

Requirements

Quick Start

Configuration

All Configuration Fields

Model Path

API

POST /v1/chat/completions

POST /v1/responses

POST /v1/embeddings

POST /v1/rerank

GET /v1/models

GET /v1/backends/health

GET /health

Backend Details

MLX

llama.cpp

Troubleshooting

Check backend health

Model not found

Port already in use

Backend not starting

"unknown model architecture" error (llamacpp)

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`POST /v1/chat/completions`

`POST /v1/responses`

`POST /v1/embeddings`

`POST /v1/rerank`

`GET /v1/models`

`GET /v1/backends/health`

`GET /health`

Packages