OpenAI-compatible text-to-speech that runs entirely on your machine. Powered by Supertonic-3. Streams sentence-by-sentence. Talks to the OpenAI Python SDK, Pipecat, LiveKit Agents, OpenWebUI, or anything else that speaks the OpenAI TTS protocol — point it at
http://localhost:8000/v1and go. Or openhttp://localhost:8000/in your browser and use the built-in console.
| ⌂ 100% local | ⌫ Drop-in for OpenAI | ≋ ~6–10× real-time |
| No cloud, no API keys, no telemetry. Runs on CPU, Apple Silicon (CoreML), or NVIDIA (CUDA). | One base_url change and your existing OpenAI SDK / Pipecat / LiveKit code keeps working. |
~450 ms first-byte on an M4 Pro. Streams audio before synthesis finishes — perfect for voice agents. |
- Performance · How it compares
- Install · Quick start · Web console · Docker · CLI
- Endpoints · WebSocket TTS · Observability · Voices · Languages
- Use from: Python SDK · Pipecat · LiveKit
- Tuning · Troubleshooting · Limitations · License
Each row below is measured: same 8 mixed-length English utterances sent back-to-back to POST /v1/audio/speech (voice=alloy, response_format=pcm, default total_steps=8), aggregates read straight from GET /metrics/summary.
| Hardware | TTFB p50 | TTFB p95 | RTF p50 | RTF p95 | ≈ real-time |
|---|---|---|---|---|---|
| Apple M4 Pro · CoreML | 515 ms | 807 ms | 0.166 | 0.255 | ~6× |
| NVIDIA RTX 5090 · CUDA | 115 ms | 119 ms | 0.034 | 0.070 | ~30× |
Apple M4 Pro · CPU (Docker, --total-steps 4) |
~1.6 s | — | ~0.40 | — | ~2.5× |
- TTFB = wall-clock from request to first audio byte (lower is better; sub-second feels live for voice agents).
- RTF = synth wall-time ÷ audio duration (lower is better;
0.1means 10× faster than real-time). - The 5090 row's tight p50→p95 spread (only 4 ms) is from CUDA's predictable kernel times; the M4 Pro's CoreML EP shows a wider spread because CoreML partitions the graph and falls back some ops to CPU.
Other open-source TTS servers and the two cloud APIs voice-agent builders most often default to:
| Local | HTTP stream | WebSocket | Metrics | Languages | Quality | Cost | |
|---|---|---|---|---|---|---|---|
| supertonic-server (this) | ✅ | sentence | ✅ | /metrics + UI |
31 | high | free |
| Kokoro-FastAPI | ✅ | sentence | ❌ | ❌ | ~9 | high | free |
| openedai-speech (Piper) | ✅ | sentence | ❌ | ❌ | ~30 | mid | free |
| openedai-speech (XTTS) | ✅ | sentence | ❌ | ❌ | 17 | high | free |
| ElevenLabs API | ❌ | yes | ✅ | n/a (cloud) | 29+ | top | paid |
| OpenAI TTS API | ❌ | yes | Realtime API | n/a (cloud) | 57+ | high | paid |
"HTTP stream = sentence" means audio is emitted to the client as each sentence finishes synthesizing — what Pipecat and LiveKit consume natively. The WebSocket transport adds a persistent connection with text-delta input and cancel-based interruption — useful for LLM → TTS pipelines in voice agents. Speed numbers are above in the Performance section.
# pip
pip install supertonic-server
# uv (recommended — fast, isolated)
uv venv --python 3.12
uv pip install supertonic-serverOptional extras:
pip install "supertonic-server[pipecat]" # adds pipecat-ai for the Pipecat example
pip install "supertonic-server[dev]" # adds pytest, httpx, openai for developmentgit clone https://github.com/ARahim3/supertonic-server.git
cd supertonic-server
uv venv --python 3.12
uv pip install -e ".[dev]"# 1. Run the server (first run downloads the model — ~250 MB, one-time)
supertonic-server --port 8000
# 2a. Open the web console
open http://localhost:8000/ # macOS — or just visit it in your browser
# 2b. …or call it directly
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input":"Hello, world.","voice":"alloy","response_format":"mp3"}' \
--output hello.mp3--device auto is the default and picks the best available execution provider:
CUDA (if onnxruntime-gpu is installed and a GPU is present) → CoreML (macOS) → CPU.
Open http://localhost:8000/ in your browser. Built-in single-page console with two views — console (testing voices, tuning parameters, copying ready-to-paste integration snippets) and observatory (live server-side metrics; see Observability below).
In the console view:
- 13 OpenAI voice aliases (
alloy,nova,echo, …) + 10 native Supertonic voices (F1-F5, M1-M5), grouped in the picker - All 31 languages with human names (
English (en),Korean (ko), …) - Speed slider (0.5×–2.0×) and diffusion-step slider (4–16)
mp3 · wav · pcmoutput format selector- Streams via HTTP/1.1 chunked transfer; live TTFB and bytes-received counter while waiting
- HTML5 audio player with seek + canvas waveform + download button
- Live code-snippet panel — 6 tabs: curl · Python (OpenAI) · Python (httpx) · Pipecat · LiveKit · JavaScript
In the observatory view:
- Active synthesis count, total requests, requests-per-second (1m window), errors, audio + bytes served
- p50 / p95 / p99 TTFB and RTF over the recent buffer
- Waterfall feed of recent requests (newest first) with per-row TTFB-vs-stream bar; click any row for full detail
- 1 Hz polling against
/metrics/summaryand/metrics/recent; pause toggle to freeze the view
Dark theme is primary; a light-mode toggle persists in localStorage along with the active view. Run with --no-ui to disable the console route entirely (the API endpoints are unaffected).
Two images are shipped — pick the one that matches your hardware. The mounted volume in each example caches the model weights so subsequent starts skip the ~250 MB download.
Works on every platform — Linux / macOS / Windows containers, x86_64 / arm64. Includes the CPU build of onnxruntime only; no CUDA libraries.
docker build -t supertonic-server .
docker run --rm -p 8000:8000 -v supertonic-cache:/root/.cache supertonic-serverThe default device is auto, which on this image resolves to CPU (CoreML / CUDA execution providers aren't installed). If you pass -e SUPERTONIC_DEVICE=cuda here, the server will log a warning and fall back to CPU — use the CUDA image below for real GPU acceleration.
Linux-only. Bundles a CUDA 12.4 runtime, cuDNN 9, and onnxruntime-gpu. The build step self-checks that CUDAExecutionProvider is actually available inside the image, so a broken build fails fast instead of silently running on CPU at runtime.
Host prerequisites:
- NVIDIA driver ≥ 545 (any driver compatible with the CUDA 12.4 runtime)
nvidia-container-toolkitinstalled and registered with the Docker daemon — see NVIDIA's install guide- Docker Desktop on macOS / Windows cannot pass through NVIDIA GPUs; this image is Linux-only
Sanity-check the host first (must print your GPU details, not an error):
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smiThen build and run:
docker build -f Dockerfile.cuda -t supertonic-server:cuda .
docker run --rm --gpus all -p 8000:8000 \
-v supertonic-cache:/root/.cache \
supertonic-server:cudaThe image sets SUPERTONIC_DEVICE=cuda in its environment, so no extra flags are needed.
Verify that the GPU is actually being used — the startup log should include:
INFO supertonic_server.engine ONNX providers: ['CUDAExecutionProvider', 'CPUExecutionProvider']
If you see ['CPUExecutionProvider'] only, the GPU isn't reachable from inside the container — re-check the nvidia-smi step above.
supertonic-server --help
--host TEXT Bind address.
--port INTEGER Bind port.
--device [auto|cpu|coreml|cuda] ONNX execution provider.
--model [supertonic|supertonic-2|supertonic-3]
--model-dir PATH Local model cache dir.
--voice TEXT Default voice (F1-F5, M1-M5).
--lang TEXT Default language code.
--speed FLOAT Default speed (0.5..2.0).
--total-steps INTEGER Diffusion steps (4..16). Lower = faster.
--intra-threads INTEGER ONNX intra-op threads.
--inter-threads INTEGER ONNX inter-op threads.
--max-concurrent INTEGER Concurrent synthesis ops.
--no-warmup Skip startup warmup.
--warmup-text TEXT Custom warmup utterance.
--no-ui Disable the built-in web console at /.
--log-level TEXT debug | info | warning | error.
--reload Auto-reload (dev only).
Every CLI flag also reads from SUPERTONIC_* environment variables (e.g. SUPERTONIC_PORT=9000).
Body:
The response is HTTP/1.1 chunked transfer — audio bytes stream out as each sentence finishes synthesizing. Useful headers:
X-Sample-Rate: 44100X-Voice: F1(the actual Supertonic voice selected, after alias resolution)X-Language: enX-Audio-Encoding: pcm_s16le_44100_1ch(PCM only)
Returns every accepted voice name (OpenAI aliases + Supertonic IDs) with the underlying Supertonic voice each one maps to.
OpenAI-style model list (returns supertonic-3 plus tts-1, tts-1-hd,
gpt-4o-mini-tts as aliases so clients that hard-code those names work).
{"status":"ok","model":"supertonic-3","sample_rate":44100,"voices":[…],"languages":[…],"ws_enabled":true}
WS /v1/audio/speech/stream— see WebSocket TTS.GET /metrics,GET /metrics/summary,GET /metrics/recent— see Observability.
For voice agents pipelining LLM → TTS, the WebSocket endpoint skips per-utterance TCP/TLS setup, takes text deltas as the LLM emits them, and supports explicit cancellation for user interruptions. It's an extension — not part of the OpenAI surface — and the protocol shape mirrors ElevenLabs'.
Endpoint: WS /v1/audio/speech/stream
client → server server → client
{ "type": "config", { "type": "ready", ... }
"voice": "alloy", "lang": "en", { "type": "config_ack", "config": {...} }
"speed": 1.05, "total_steps": 8 } { "type": "audio_start", "ttfb_ms": ... }
{ "type": "text", "text": "Hello, " } { "type": "audio", "chunk": "<base64 PCM>" }
{ "type": "text", "text": "world." } { "type": "audio_end", "stats": {...} }
{ "type": "flush" } { "type": "cancelled" }
{ "type": "cancel" } { "type": "error", "message": "..." }
{ "type": "close" }
Audio chunks are base64-encoded little-endian int16 PCM at the server's sample rate (44.1 kHz mono). flush triggers synthesis of the buffered text; cancel interrupts the in-flight stream and drops the buffer. A complete Python client lives in examples/ws_smoke.py.
Disable the endpoint with SUPERTONIC_WS_ENABLED=0.
Cloud TTS APIs can't show you what's happening server-side. We run locally — we can. Every synthesis (HTTP or WebSocket) is recorded in an in-process ring buffer (configurable, default 100), with live percentile aggregates and three endpoints to read them.
Scrape it from Prometheus / Grafana / Datadog Agent / anything that speaks the text format. No client lib required.
# HELP supertonic_requests_total Number of synthesis requests since process start, by terminal status
# TYPE supertonic_requests_total counter
supertonic_requests_total{status="ok"} 142
supertonic_requests_total{status="error"} 0
supertonic_requests_total{status="cancelled"} 1
# HELP supertonic_ttfb_ms Time-to-first-byte (ms) over the recent window
# TYPE supertonic_ttfb_ms summary
supertonic_ttfb_ms{quantile="0.5"} 481.4
supertonic_ttfb_ms{quantile="0.95"} 661.7
supertonic_ttfb_ms{quantile="0.99"} 684.4
# HELP supertonic_rtf Real-time factor over the recent window
# TYPE supertonic_rtf summary
supertonic_rtf{quantile="0.5"} 0.20
supertonic_rtf{quantile="0.95"} 0.24
supertonic_rtf{quantile="0.99"} 0.31
Plus supertonic_active_synth, supertonic_bytes_total, supertonic_audio_seconds_total, supertonic_uptime_seconds, supertonic_rps_1m, supertonic_error_rate.
Same data as /metrics, JSON-shaped. Used by the Observatory tab in the web console; also handy for custom dashboards.
Recent RequestRecords, newest-first. Each record includes: text snippet (first ~80 chars), text length, voice, lang, format, status (ok / cancelled / error), ttfb_ms, total_ms, bytes, audio_s, rtf, error, and transport (http / ws).
SUPERTONIC_OBSERVABILITY_BUFFER_SIZE (default 100, range 10..10000) sets the ring-buffer depth. Cumulative totals (requests, bytes, audio seconds) survive ring-buffer eviction; percentiles slide with the buffer. No persistence — restart clears the in-memory state.
10 Supertonic presets + OpenAI's 13 standard voice names mapped onto them:
| OpenAI alias | Supertonic | OpenAI alias | Supertonic |
|---|---|---|---|
| alloy | F1 | marin | F3 |
| coral | F2 | nova | F4 |
| sage | F5 | shimmer | F2 |
| verse | F1 | onyx | M1 |
| ash | M1 | ballad | M2 |
| cedar | M3 | echo | M4 |
| fable | M5 |
F1–F5 and M1–M5 also pass through unchanged.
31 supported language codes plus na (fallback): en, ko, ja, ar, bg, cs, da, de, el, es, et, fi, fr, hi, hr, hu, id, it, lt, lv, nl, pl, pt, ro, ru, sk, sl, sv, tr, uk, vi, na.
Pass via the lang field, e.g. {"input": "안녕하세요.", "lang": "ko"}.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
audio = client.audio.speech.create(
model="supertonic-3",
voice="alloy",
input="Drop-in replacement for OpenAI TTS.",
response_format="mp3",
)
audio.stream_to_file("hello.mp3")from pipecat.services.openai.tts import OpenAITTSService, OpenAITTSSettings
tts = OpenAITTSService(
api_key="not-needed",
base_url="http://localhost:8000/v1",
settings=OpenAITTSSettings(model="supertonic-3", voice="nova"),
sample_rate=44100, # supertonic-3 native rate
)
# Plug into any Pipecat pipeline as the TTS service.A standalone smoke test (no full pipeline) lives at examples/pipecat_smoke.py.
Any LiveKit openai.TTS plugin works the same way:
from livekit.plugins import openai
tts = openai.TTS(
base_url="http://localhost:8000/v1",
api_key="not-needed",
model="supertonic-3",
voice="nova",
)--total-steps 4— lower diffusion steps, faster but slightly less expressive.--total-steps 12— higher quality, ~50% slower.--max-concurrent 2— allow two simultaneous syntheses (default 1 to avoid CPU thrashing).--device cpu— skip CoreML/CUDA even when available (more predictable cold start).
First request is very slow (multi-second). Either you started with --no-warmup, or you're using --reload and a save triggered a reload. Restart without --no-warmup; the warmup synthesis pre-compiles the CoreML/CUDA graphs so the first real request lands warm. Typical warmup is 1–3 s on CoreML and 2–3 s on Ampere/Ada/Hopper NVIDIA cards.
Server warmup itself is taking ~60 s on first-ever boot. This is the Blackwell-specific (sm_120, e.g. RTX 5090) one-time cost: onnxruntime-gpu JIT-builds CUDA kernels for the new compute capability and caches them at ~/.nv/ComputeCache/. Every boot after that warms in ~2.5 s. Older NVIDIA cards (Ampere/Ada/Hopper) ship pre-built kernels and don't pay this.
Model download is slow or hangs on first run. The Supertonic-3 weights (~250 MB across 26 files) are pulled from Hugging Face into ~/.cache/supertonic3/. Check your network, or pre-download:
huggingface-cli download Supertone/supertonic-3 --local-dir ~/.cache/supertonic3Audio sounds chipmunked, slowed down, or pitched wrong. A downstream consumer assumed a different sample rate. supertonic-server emits 44100 Hz mono int16. In Pipecat, pass sample_rate=44100 to OpenAITTSService. In LiveKit, configure the audio source for 44.1 kHz. The X-Sample-Rate response header announces this explicitly.
Pipecat logs OpenAI TTS only supports 24000Hz sample rate. Current rate of 44100Hz may cause issues. Cosmetic — Pipecat hard-codes the OpenAI cloud rate. Audio still flows correctly at 44.1 kHz; the rate is set by our sample_rate=44100 constructor argument.
CoreML does not support shapes with dimension values of 0 warnings on macOS. Cosmetic. ONNX Runtime falls back the unsupported subgraphs to CPU; everything still works.
Context leak detected, msgtracer returned -1 on macOS. Cosmetic noise from Apple's tracer. Ignore.
400 from /v1/audio/speech. Body validation failure. Check that voice is in the Voices table or a direct F#/M# ID, lang is in the Languages list (or na), and response_format is one of mp3, wav, pcm.
Empty or truncated audio. The client closed the connection mid-stream. The server cancels pending synthesis but lets any in-flight chunk finish (no way to interrupt a running ONNX call). Subsequent requests are unaffected.
Container is using CPU even with --gpus all / -e SUPERTONIC_DEVICE=cuda. The default Dockerfile ships the CPU build of onnxruntime, so the GPU is unreachable regardless of what you set SUPERTONIC_DEVICE to. The startup log will print:
WARNING supertonic_server.config device=cuda requested but CUDAExecutionProvider is not available …
Build Dockerfile.cuda instead — it bundles onnxruntime-gpu and a CUDA 12 runtime. Also check on the host that docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi works; if that fails, the nvidia-container-toolkit setup is the actual problem.
- Only
mp3,wav,pcmresponse formats over HTTP. (Opus/AAC/FLAC are TODO.) - The WebSocket endpoint emits PCM only (base64-encoded int16). Add MP3/Opus over WS if a downstream user needs it — open an issue.
- No voice cloning at runtime — use Supertone's separate Voice Builder for that.
- Diffusion pipeline is per-chunk, so we stream at sentence granularity, not sub-sentence. This is the standard granularity Pipecat / LiveKit expect.
- Observability is in-process: restart clears the metrics buffer. For long-term retention, scrape
/metricsfrom Prometheus or similar.
- Server code: MIT
- Supertonic-3 model weights: OpenRAIL-M (downloaded automatically from Hugging Face on first run)


{ "model": "supertonic-3", // any string; informational "input": "Text to speak (up to 20k chars).", "voice": "alloy", // see Voices below "response_format": "mp3", // "mp3" | "wav" | "pcm" "speed": 1.05, // 0.5..2.0 "lang": "en", // extension: 31 codes, see below "total_steps": 8 // extension: 4..16 }