OpenAI-compatible F5-TTS server. Production-ready, checkpoint-agnostic, GDPR-clean. 10 archetype voices included.
Runs in production at agoracosmica.org, a German non-profit philosophical-dialogue platform. The server code is MIT. Public F5-TTS checkpoints are CC-BY-NC-4.0, see Checkpoint licensing.
git clone https://github.com/chipmates/f5-server
cd f5-server
docker compose up -d
# First start downloads ~1.35 GB checkpoint from HuggingFace, then runs forever.
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input":"Die Philosophie lehrt uns, dass wahre Weisheit nicht im Wissen liegt.","voice":"hyperion"}' \
--output speech.mp3That's the full setup. The OpenAI Python SDK works too, see Using with the OpenAI SDK.
The server code in this repo is MIT-licensed. The model checkpoints are not.
Every publicly released F5-TTS weight file we know of (the upstream SWivid/F5-TTS base and every German fine-tune including hvoss-techfak, aihpi, marduk-ra, cduvenhorst) is released under CC-BY-NC-4.0, which prohibits commercial use. This is an upstream constraint inherited from the Emilia training dataset, not something specific to this repo.
| Artifact | License | Commercial use? |
|---|---|---|
Code in this repo (serve.py, Dockerfile, etc.) |
MIT | ✅ Yes |
SWivid/F5-TTS (upstream base, English) |
CC-BY-NC-4.0 | ❌ No |
hvoss-techfak/F5-TTS-German (this server's default) |
CC-BY-NC-4.0 | ❌ No |
aihpi/F5-TTS-German |
CC-BY-NC-4.0 | ❌ No |
| Your own checkpoint trained on MIT/Apache/CC0 data | (your choice) | ✅ Yes |
The server is checkpoint-agnostic. Point F5_DEFAULT_CKPT at any F5-TTS-compatible .safetensors file trained with the F5TTS_Base config (see The F5TTS_Base trap). The default hvoss config exists so docker-compose works out of the box for non-commercial use.
On checkpoint load, the server logs a license notice for any checkpoint in the registry marked NC. There is no UI banner, by design. Bring-your-own checkpoints do not trigger the notice.
For full detail and commercial-use guidance, see docs/checkpoint-licensing.md.
- OpenAI
/v1/audio/speechendpoint: drop in as a base URL on the OpenAI Python SDK and it works. JSON in, audio out. - 10 German archetype voices bundled in
refs/(~4 MB total). Male and female × elder, mentor, baritone, commanding, narrator, intellectual variants. - GDPR-clean voice cloning:
POST /tts_cloneuploads audio, generates output, deletes the upload from disk and module caches in afinallyblock. No persistence. - Docker Compose with first-run HuggingFace checkpoint auto-download.
up -dand curl. - Native F5 endpoints (
/tts,/tts_clone) retained for curl-friendly testing. - CORS support, env-configurable (default
*for hobby use, scope down for production). - Multi-instance scale-out via
docker-compose.scale.yml, the same pattern that runs the agoracosmica.org fleet. - Healthchecks, structured logging, graceful checkpoint loading.
Not a code fork. The server installs stock pip install f5-tts==1.1.5 and wraps it with the following additions:
| Addition | What it does | Upstream? |
|---|---|---|
| SoMaJo sentence chunker (DE + EN) | Replaces the upstream regex chunker, which breaks on Dr., z. B., 3. Januar, Prof. and similar tokens, producing 1-2 char chunks that F5 renders as garbage. |
Not yet (PR planned) |
| Equal-power crossfade | cos / sin curves replace the upstream linear fades, eliminating the ~3 dB seam dip audible between chunks. |
Not yet (PR planned) |
| Voice tensor cache | Voice references preprocessed at startup and held in memory. Skips disk and pydub preprocessing on every request. |
No |
GDPR ephemeral /tts_clone |
Uploaded audio deleted from disk and F5 module caches in a finally block. |
No |
| OpenAI-shape adapter | POST /v1/audio/speech works with the OpenAI Python SDK. |
No |
| Auto HF checkpoint download | First-run pulls the checkpoint from HuggingFace. No manual setup. | No |
F5TTS_Base config-trap warning |
The hvoss-techfak/F5-TTS-German checkpoint must be loaded with F5TTS_Base, not the pip default F5TTS_v1_Base. Wrong config produces gibberish. |
No (silent footgun upstream) |
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-required", # No auth by default. Front with nginx + bearer for production.
)
with client.audio.speech.with_streaming_response.create(
model="f5-tts", # Ignored. The server uses F5_DEFAULT_CKPT.
voice="hyperion", # See /v1/audio/voices for the full list.
input="Die Philosophie lehrt uns, dass wahre Weisheit nicht im Wissen liegt.",
response_format="mp3",
speed=1.0,
) as resp:
resp.stream_to_file("out.mp3")Works identically with the JS / TS SDK, LangChain, LiteLLM, and any OpenAI-compatible wrapper.
{
"model": "f5-tts", // Ignored. For SDK compatibility.
"input": "Text to synthesize",
"voice": "hyperion", // Cosmic name or legacy slug. See /v1/audio/voices.
"response_format": "mp3", // wav | mp3 | opus | flac | aac | pcm
"speed": 1.0, // 0.5 to 2.0
"language": "de" // de | en. Auto from voice if omitted.
}Returns audio bytes with the right Content-Type. The X-Inference-Ms header reports inference time.
{
"voices": [
{
"id": "lyra",
"slug": "f1_warm_wise_v1",
"name": "Lyra",
"character": "warm-wise",
"language": "de",
"gender": "female",
"rank": 1,
"tags": ["female", "wise", "warm"]
}
]
}The voice field accepts either the cosmic id (e.g., lyra) or the legacy slug (e.g., f1_warm_wise_v1). Slug aliases match the chipmates/agoracosmica Local Mode voice catalog, so client code targeting either server works without changes.
Useful for terminal testing without crafting a JSON body. Returns WAV by default. Pass response_format=mp3 for compressed.
curl -G http://localhost:8000/tts \
--data-urlencode 'text=Hallo, das ist ein Test.' \
--data-urlencode 'voice=hyperion' \
--data-urlencode 'response_format=mp3' \
--output test.mp3curl -X POST http://localhost:8000/tts_clone \
-F text="Text in the cloned voice" \
-F ref_text="What is spoken in the reference audio" \
-F ref_audio=@your_voice_sample.wav \
--output clone.wavUploaded audio is deleted from disk and all F5-TTS module caches in a finally block after generation. No on-disk persistence of user audio. This is the GDPR path agoracosmica.org uses in production.
{"status": "ok", "version": "0.1.0", "checkpoint": "hvoss", "voices": 10, "cached_voices": 10}Ten German philosophical-archetype voices, recorded specifically for agoracosmica.org. Each voice has a cosmic name (the public API identifier) and a technical slug (a legacy alias matching the chipmates/agoracosmica Local Mode catalog). Reference audio and the spoken reference text live in voices.json.
Female voices, ranked from internal curation:
| Cosmic name | Character | Slug alias | Tags |
|---|---|---|---|
lyra |
warm-wise | f1_warm_wise_v1 |
female, wise, warm |
astra |
commanding | f5_deep_commanding_v2 |
female, commanding, deep |
vega |
wise | f1_warm_wise_v2 |
female, wise, warm |
andromeda |
intellectual | f2_commanding_thinker_v1 |
female, commanding, thinker |
ceres |
nurturing | f1_warm_mentor_v2 |
female, mentor, warm |
Male voices:
| Cosmic name | Character | Slug alias | Tags |
|---|---|---|---|
solaris |
narrator | m3_rich_narrator_v3 |
male, narrator, rich |
umbra |
baritone | m5_rich_baritone_v1 |
male, baritone, rich |
phoenix |
elder | m1_warm_elder_v3 |
male, elder, warm, deep |
hyperion |
elder (default) | m1_warm_elder_v2 |
male, elder, warm, deep |
corvus |
intellectual | m3_intellectual_v2 |
male, intellectual, thoughtful |
To add your own voice: drop your_voice.wav into refs/, add an entry to voices.json with the cosmic id, technical slug (optional), and the spoken reference text, then restart.
All via environment variables (see .env.example):
| Variable | Default | Purpose |
|---|---|---|
F5_DEFAULT_CKPT |
hvoss |
Checkpoint name. Known: hvoss, aihpi. Or an absolute path to your own. |
F5_DEFAULT_LANGUAGE |
de |
de or en. SoMaJo handles both. Other languages fall through to regex. |
F5_SPEED |
1.0 |
Engine-native pace (~180 WPM DE). Lower is slower. |
CORS_ALLOW_ORIGINS |
* |
CORS origins. Comma-separated for production scoping. |
OMP_NUM_THREADS |
4 |
Threads per instance. Keep low when running multiple instances. |
PORT |
8000 |
Server port. |
This server runs the German voice path at agoracosmica.org, a non-profit philosophical-dialogue platform serving daily users since March 2026.
- Capacity: load-tested at N=400 concurrent BBA-shape sessions with zero server errors.
- Fleet pattern: dual-instance EU deployment, four F5 workers per host, sustained ~150 concurrent German sessions per shift.
- GDPR profile: the server does not log request text. The
/tts_cloneendpoint deletes uploaded audio from disk and F5 module caches in afinallyblock. No on-disk persistence by default. The legal jurisdiction of inference is determined by where you host the server.
F5-TTS is the speed tier in this fleet. Internal blind comparisons on German philosophical content favor Qwen3-TTS for quality, so the agoracosmica.org primary path routes through Qwen and F5 handles overflow capacity and fast-fallback. F5 alone is excellent for conversational German and high-throughput batch synthesis. If you are building a brand-voice experience for production users, A/B against alternatives.
This deserves its own section because it bites everyone who tries pip install f5-tts with a German checkpoint.
# Correct: produces intelligible German speech
from f5_tts.api import F5TTS
model = F5TTS(model="F5TTS_Base", ckpt_file="hvoss_german.safetensors")
# Wrong: produces gibberish (garbled, fast, unintelligible)
model = F5TTS(ckpt_file="hvoss_german.safetensors") # defaults to F5TTS_v1_BaseThe hvoss-techfak/F5-TTS-German checkpoint was trained with the F5TTS_Base config, not the newer F5TTS_v1_Base. The two configs differ in pe_attn_head (1 vs null) and text_mask_padding (False vs True). Loading with the wrong config means wrong positional encoding. The model loses spatial awareness and outputs garbled audio.
This server passes model="F5TTS_Base" explicitly in serve.py:get_model. Any future checkpoint added to HF_CHECKPOINTS that was trained on F5TTS_v1_Base would need to set the right config.
F5-TTS 1.1.5's built-in EPSS (Empirically Pruned Step Sampling, Interspeech 2025) step sampling only activates at NFE 7. At 8 it silently falls back to uniform linspace and you lose roughly 14% throughput. This server pins NFE 7 by default and warns if you override it.
EPSS is upstream. Anyone who pip install f5-tts already has it. The contribution here is configuring it correctly and documenting the footgun. If you set nfe_step=8 thinking "more is better," the server logs:
WARNING: nfe_step=8 disables EPSS schedule. EPSS = 7-step at zero quality loss.
See README "EPSS: configured correctly, not implemented".
The CORS config means you can expose this server on homelab.local:8000, point a browser-based client (running on a separate device such as an iPad, work laptop, or phone) at it via https://yourapp.com settings, and the audio never leaves your LAN. Same pattern as chipmates/agoracosmica Local Mode: hosted UI, self-hosted inference, zero cloud TTS dependency.
For Tailscale users: the server binds 0.0.0.0 by default. Any device on your tailnet can reach it once you forward port 8000.
- CPU-bound at ~60%, GPU-bound at ~40%. Adding instances scales better than adding threads. Four instances × four cores beats one instance × sixteen threads (verified 5.3 vs 4.7 req/s on a 16-core box).
- Voice tensor cache at startup: every entry in
voices.jsongets its reference audio preprocessed (silence trim, normalize) and converted to a tensor, then held in memory. The first inference request for a known voice skips ~150 ms of disk and pydub work. - CUDA cache eviction (
torch.cuda.empty_cache()) after each inference mitigates a ~200 MB/instance/day VRAM creep that the PyTorch caching allocator can produce across many varying-shape ODE solver runs. Few-ms overhead, eliminates the weekly-restart need. - No
torch.compile: tested on F5's DiT, makes inference slower (130 ms to 230 ms). The ODE solver calls the transformer 14 times per request, so the compilation wrapper overhead exceeds the kernel fusion gains. - MPS support: stock
f5-ttssupports Apple Silicon via PyTorch MPS, though we have not benchmarked it. CPU-only inference is sub-realtime (RTF > 1) in community reports, not viable for live response.
- Crossfade fix as an upstream PR to
SWivid/F5-TTS(single-language-agnostic, strict improvement). - SoMaJo chunker as an opt-in upstream PR (DE + EN today, needs a CJK fallback for upstream compatibility).
- Multi-checkpoint switch at runtime (currently single-checkpoint per instance. The API accepts
ckptbut model swap-out is not optimized). - Bearer-token auth (currently none. Production users front it with nginx + token).
- Prometheus
/metricsendpoint (production uses one, needs cleanup for OSS).
Issues and pull requests are welcome. The two upstream contributions planned for SWivid/F5-TTS (crossfade, SoMaJo chunker) live in serve.py as inline shims for now. Happy to coordinate if you want to land them upstream first.
- Code (this repo): MIT, see LICENSE.
- Bundled voice references (
refs/*.wav): MIT, recorded specifically for this project. - Model checkpoints auto-downloaded by docker-compose: licensed separately by their creators. All currently-supported checkpoints are CC-BY-NC-4.0. See docs/checkpoint-licensing.md.
- SWivid et al. for the F5-TTS architecture and base implementation.
- hvoss-techfak for the German F5-TTS checkpoint training.
- aihpi for an alternative German F5-TTS checkpoint training.
- Mozilla Common Voice for the open German speech dataset used in upstream German checkpoint training.
- SoMaJo for the German + English sentence tokenizer.
- The Interspeech 2025 EPSS paper for the step-sampling optimization that F5-TTS 1.1.5 ships.
The 10 reference voices in refs/ were synthesized iteratively using Qwen3-TTS, then curated to represent archetypal speaking styles. None are recordings of real people or derived from third-party voice samples.
Built and stress-tested in production for agoracosmica.org. Open-sourced for any team that needs an OpenAI-shaped voice cloning server without the cloud.