f5-server

OpenAI-compatible F5-TTS server. Production-ready, checkpoint-agnostic, GDPR-clean. 10 archetype voices included.

Runs in production at agoracosmica.org, a German non-profit philosophical-dialogue platform. The server code is MIT. Public F5-TTS checkpoints are CC-BY-NC-4.0, see Checkpoint licensing.

Quickstart

git clone https://github.com/chipmates/f5-server
cd f5-server
docker compose up -d
# First start downloads ~1.35 GB checkpoint from HuggingFace, then runs forever.

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input":"Die Philosophie lehrt uns, dass wahre Weisheit nicht im Wissen liegt.","voice":"hyperion"}' \
  --output speech.mp3

That's the full setup. The OpenAI Python SDK works too, see Using with the OpenAI SDK.

Checkpoint licensing: read this first

The server code in this repo is MIT-licensed. The model checkpoints are not.

Every publicly released F5-TTS weight file we know of (the upstream SWivid/F5-TTS base and every German fine-tune including hvoss-techfak, aihpi, marduk-ra, cduvenhorst) is released under CC-BY-NC-4.0, which prohibits commercial use. This is an upstream constraint inherited from the Emilia training dataset, not something specific to this repo.

Artifact	License	Commercial use?
Code in this repo (`serve.py`, `Dockerfile`, etc.)	MIT	✅ Yes
`SWivid/F5-TTS` (upstream base, English)	CC-BY-NC-4.0	❌ No
`hvoss-techfak/F5-TTS-German` (this server's default)	CC-BY-NC-4.0	❌ No
`aihpi/F5-TTS-German`	CC-BY-NC-4.0	❌ No
Your own checkpoint trained on MIT/Apache/CC0 data	(your choice)	✅ Yes

The server is checkpoint-agnostic. Point F5_DEFAULT_CKPT at any F5-TTS-compatible .safetensors file trained with the F5TTS_Base config (see The F5TTS_Base trap). The default hvoss config exists so docker-compose works out of the box for non-commercial use.

On checkpoint load, the server logs a license notice for any checkpoint in the registry marked NC. There is no UI banner, by design. Bring-your-own checkpoints do not trigger the notice.

For full detail and commercial-use guidance, see docs/checkpoint-licensing.md.

What you get

OpenAI /v1/audio/speech endpoint: drop in as a base URL on the OpenAI Python SDK and it works. JSON in, audio out.
10 German archetype voices bundled in refs/ (~4 MB total). Male and female × elder, mentor, baritone, commanding, narrator, intellectual variants.
GDPR-clean voice cloning: POST /tts_clone uploads audio, generates output, deletes the upload from disk and module caches in a finally block. No persistence.
Docker Compose with first-run HuggingFace checkpoint auto-download. up -d and curl.
Native F5 endpoints (/tts, /tts_clone) retained for curl-friendly testing.
CORS support, env-configurable (default * for hobby use, scope down for production).
Multi-instance scale-out via docker-compose.scale.yml, the same pattern that runs the agoracosmica.org fleet.
Healthchecks, structured logging, graceful checkpoint loading.

What's different from upstream `SWivid/F5-TTS`

Not a code fork. The server installs stock pip install f5-tts==1.1.5 and wraps it with the following additions:

Addition	What it does	Upstream?
SoMaJo sentence chunker (DE + EN)	Replaces the upstream regex chunker, which breaks on `Dr.`, `z. B.`, `3. Januar`, `Prof.` and similar tokens, producing 1-2 char chunks that F5 renders as garbage.	Not yet (PR planned)
Equal-power crossfade	`cos / sin` curves replace the upstream linear fades, eliminating the ~3 dB seam dip audible between chunks.	Not yet (PR planned)
Voice tensor cache	Voice references preprocessed at startup and held in memory. Skips disk and `pydub` preprocessing on every request.	No
GDPR ephemeral `/tts_clone`	Uploaded audio deleted from disk and F5 module caches in a `finally` block.	No
OpenAI-shape adapter	`POST /v1/audio/speech` works with the OpenAI Python SDK.	No
Auto HF checkpoint download	First-run pulls the checkpoint from HuggingFace. No manual setup.	No
`F5TTS_Base` config-trap warning	The `hvoss-techfak/F5-TTS-German` checkpoint must be loaded with `F5TTS_Base`, not the pip default `F5TTS_v1_Base`. Wrong config produces gibberish.	No (silent footgun upstream)

Using with the OpenAI SDK

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-required",  # No auth by default. Front with nginx + bearer for production.
)

with client.audio.speech.with_streaming_response.create(
    model="f5-tts",     # Ignored. The server uses F5_DEFAULT_CKPT.
    voice="hyperion",   # See /v1/audio/voices for the full list.
    input="Die Philosophie lehrt uns, dass wahre Weisheit nicht im Wissen liegt.",
    response_format="mp3",
    speed=1.0,
) as resp:
    resp.stream_to_file("out.mp3")

Works identically with the JS / TS SDK, LangChain, LiteLLM, and any OpenAI-compatible wrapper.

API

`POST /v1/audio/speech`: OpenAI-shape

{
  "model": "f5-tts",           // Ignored. For SDK compatibility.
  "input": "Text to synthesize",
  "voice": "hyperion",         // Cosmic name or legacy slug. See /v1/audio/voices.
  "response_format": "mp3",    // wav | mp3 | opus | flac | aac | pcm
  "speed": 1.0,                // 0.5 to 2.0
  "language": "de"             // de | en. Auto from voice if omitted.
}

Returns audio bytes with the right Content-Type. The X-Inference-Ms header reports inference time.

`GET /v1/audio/voices`: list available voices

{
  "voices": [
    {
      "id": "lyra",
      "slug": "f1_warm_wise_v1",
      "name": "Lyra",
      "character": "warm-wise",
      "language": "de",
      "gender": "female",
      "rank": 1,
      "tags": ["female", "wise", "warm"]
    }
  ]
}

The voice field accepts either the cosmic id (e.g., lyra) or the legacy slug (e.g., f1_warm_wise_v1). Slug aliases match the chipmates/agoracosmica Local Mode voice catalog, so client code targeting either server works without changes.

`GET /tts`: native F5 GET (curl-friendly)

Useful for terminal testing without crafting a JSON body. Returns WAV by default. Pass response_format=mp3 for compressed.

curl -G http://localhost:8000/tts \
  --data-urlencode 'text=Hallo, das ist ein Test.' \
  --data-urlencode 'voice=hyperion' \
  --data-urlencode 'response_format=mp3' \
  --output test.mp3

`POST /tts_clone`: voice cloning from arbitrary audio (GDPR-clean)

curl -X POST http://localhost:8000/tts_clone \
  -F text="Text in the cloned voice" \
  -F ref_text="What is spoken in the reference audio" \
  -F ref_audio=@your_voice_sample.wav \
  --output clone.wav

Uploaded audio is deleted from disk and all F5-TTS module caches in a finally block after generation. No on-disk persistence of user audio. This is the GDPR path agoracosmica.org uses in production.

`GET /health`

{"status": "ok", "version": "0.1.0", "checkpoint": "hvoss", "voices": 10, "cached_voices": 10}

Voices included

Ten German philosophical-archetype voices, recorded specifically for agoracosmica.org. Each voice has a cosmic name (the public API identifier) and a technical slug (a legacy alias matching the chipmates/agoracosmica Local Mode catalog). Reference audio and the spoken reference text live in voices.json.

Female voices, ranked from internal curation:

Cosmic name	Character	Slug alias	Tags
`lyra`	warm-wise	`f1_warm_wise_v1`	female, wise, warm
`astra`	commanding	`f5_deep_commanding_v2`	female, commanding, deep
`vega`	wise	`f1_warm_wise_v2`	female, wise, warm
`andromeda`	intellectual	`f2_commanding_thinker_v1`	female, commanding, thinker
`ceres`	nurturing	`f1_warm_mentor_v2`	female, mentor, warm

Male voices:

Cosmic name	Character	Slug alias	Tags
`solaris`	narrator	`m3_rich_narrator_v3`	male, narrator, rich
`umbra`	baritone	`m5_rich_baritone_v1`	male, baritone, rich
`phoenix`	elder	`m1_warm_elder_v3`	male, elder, warm, deep
`hyperion`	elder (default)	`m1_warm_elder_v2`	male, elder, warm, deep
`corvus`	intellectual	`m3_intellectual_v2`	male, intellectual, thoughtful

To add your own voice: drop your_voice.wav into refs/, add an entry to voices.json with the cosmic id, technical slug (optional), and the spoken reference text, then restart.

Configuration

All via environment variables (see .env.example):

Variable	Default	Purpose
`F5_DEFAULT_CKPT`	`hvoss`	Checkpoint name. Known: `hvoss`, `aihpi`. Or an absolute path to your own.
`F5_DEFAULT_LANGUAGE`	`de`	`de` or `en`. SoMaJo handles both. Other languages fall through to regex.
`F5_SPEED`	`1.0`	Engine-native pace (~180 WPM DE). Lower is slower.
`CORS_ALLOW_ORIGINS`	`*`	CORS origins. Comma-separated for production scoping.
`OMP_NUM_THREADS`	`4`	Threads per instance. Keep low when running multiple instances.
`PORT`	`8000`	Server port.

Production pedigree

This server runs the German voice path at agoracosmica.org, a non-profit philosophical-dialogue platform serving daily users since March 2026.

Capacity: load-tested at N=400 concurrent BBA-shape sessions with zero server errors.
Fleet pattern: dual-instance EU deployment, four F5 workers per host, sustained ~150 concurrent German sessions per shift.
GDPR profile: the server does not log request text. The /tts_clone endpoint deletes uploaded audio from disk and F5 module caches in a finally block. No on-disk persistence by default. The legal jurisdiction of inference is determined by where you host the server.

F5-TTS is the speed tier in this fleet. Internal blind comparisons on German philosophical content favor Qwen3-TTS for quality, so the agoracosmica.org primary path routes through Qwen and F5 handles overflow capacity and fast-fallback. F5 alone is excellent for conversational German and high-throughput batch synthesis. If you are building a brand-voice experience for production users, A/B against alternatives.

The `F5TTS_Base` trap

This deserves its own section because it bites everyone who tries pip install f5-tts with a German checkpoint.

# Correct: produces intelligible German speech
from f5_tts.api import F5TTS
model = F5TTS(model="F5TTS_Base", ckpt_file="hvoss_german.safetensors")

# Wrong: produces gibberish (garbled, fast, unintelligible)
model = F5TTS(ckpt_file="hvoss_german.safetensors")  # defaults to F5TTS_v1_Base

The hvoss-techfak/F5-TTS-German checkpoint was trained with the F5TTS_Base config, not the newer F5TTS_v1_Base. The two configs differ in pe_attn_head (1 vs null) and text_mask_padding (False vs True). Loading with the wrong config means wrong positional encoding. The model loses spatial awareness and outputs garbled audio.

This server passes model="F5TTS_Base" explicitly in serve.py:get_model. Any future checkpoint added to HF_CHECKPOINTS that was trained on F5TTS_v1_Base would need to set the right config.

EPSS: configured correctly, not implemented

F5-TTS 1.1.5's built-in EPSS (Empirically Pruned Step Sampling, Interspeech 2025) step sampling only activates at NFE 7. At 8 it silently falls back to uniform linspace and you lose roughly 14% throughput. This server pins NFE 7 by default and warns if you override it.

EPSS is upstream. Anyone who pip install f5-tts already has it. The contribution here is configuring it correctly and documenting the footgun. If you set nfe_step=8 thinking "more is better," the server logs:

WARNING: nfe_step=8 disables EPSS schedule. EPSS = 7-step at zero quality loss.
         See README "EPSS: configured correctly, not implemented".

LAN / homelab deployment

The CORS config means you can expose this server on homelab.local:8000, point a browser-based client (running on a separate device such as an iPad, work laptop, or phone) at it via https://yourapp.com settings, and the audio never leaves your LAN. Same pattern as chipmates/agoracosmica Local Mode: hosted UI, self-hosted inference, zero cloud TTS dependency.

For Tailscale users: the server binds 0.0.0.0 by default. Any device on your tailnet can reach it once you forward port 8000.

Architecture notes

CPU-bound at ~60%, GPU-bound at ~40%. Adding instances scales better than adding threads. Four instances × four cores beats one instance × sixteen threads (verified 5.3 vs 4.7 req/s on a 16-core box).
Voice tensor cache at startup: every entry in voices.json gets its reference audio preprocessed (silence trim, normalize) and converted to a tensor, then held in memory. The first inference request for a known voice skips ~150 ms of disk and pydub work.
CUDA cache eviction (torch.cuda.empty_cache()) after each inference mitigates a ~200 MB/instance/day VRAM creep that the PyTorch caching allocator can produce across many varying-shape ODE solver runs. Few-ms overhead, eliminates the weekly-restart need.
No torch.compile: tested on F5's DiT, makes inference slower (130 ms to 230 ms). The ODE solver calls the transformer 14 times per request, so the compilation wrapper overhead exceeds the kernel fusion gains.
MPS support: stock f5-tts supports Apple Silicon via PyTorch MPS, though we have not benchmarked it. CPU-only inference is sub-realtime (RTF > 1) in community reports, not viable for live response.

Roadmap

Crossfade fix as an upstream PR to SWivid/F5-TTS (single-language-agnostic, strict improvement).
SoMaJo chunker as an opt-in upstream PR (DE + EN today, needs a CJK fallback for upstream compatibility).
Multi-checkpoint switch at runtime (currently single-checkpoint per instance. The API accepts ckpt but model swap-out is not optimized).
Bearer-token auth (currently none. Production users front it with nginx + token).
Prometheus /metrics endpoint (production uses one, needs cleanup for OSS).

Contributing

Issues and pull requests are welcome. The two upstream contributions planned for SWivid/F5-TTS (crossfade, SoMaJo chunker) live in serve.py as inline shims for now. Happy to coordinate if you want to land them upstream first.

License

Code (this repo): MIT, see LICENSE.
Bundled voice references (refs/*.wav): MIT, recorded specifically for this project.
Model checkpoints auto-downloaded by docker-compose: licensed separately by their creators. All currently-supported checkpoints are CC-BY-NC-4.0. See docs/checkpoint-licensing.md.

Acknowledgements

SWivid et al. for the F5-TTS architecture and base implementation.
hvoss-techfak for the German F5-TTS checkpoint training.
aihpi for an alternative German F5-TTS checkpoint training.
Mozilla Common Voice for the open German speech dataset used in upstream German checkpoint training.
SoMaJo for the German + English sentence tokenizer.
The Interspeech 2025 EPSS paper for the step-sampling optimization that F5-TTS 1.1.5 ships.

The 10 reference voices in refs/ were synthesized iteratively using Qwen3-TTS, then curated to represent archetypal speaking styles. None are recordings of real people or derived from third-party voice samples.

Built and stress-tested in production for agoracosmica.org. Open-sourced for any team that needs an OpenAI-shaped voice cloning server without the cloud.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

f5-server

Quickstart

Checkpoint licensing: read this first

What you get

What's different from upstream `SWivid/F5-TTS`

Using with the OpenAI SDK

API

`POST /v1/audio/speech`: OpenAI-shape

`GET /v1/audio/voices`: list available voices

`GET /tts`: native F5 GET (curl-friendly)

`POST /tts_clone`: voice cloning from arbitrary audio (GDPR-clean)

`GET /health`

Voices included

Configuration

Production pedigree

The `F5TTS_Base` trap

EPSS: configured correctly, not implemented

LAN / homelab deployment

Architecture notes

Roadmap

Contributing

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
docs		docs
refs		refs
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.scale.yml		docker-compose.scale.yml
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
serve.py		serve.py
voices.json		voices.json

Folders and files

Latest commit

History

Repository files navigation

f5-server

Quickstart

Checkpoint licensing: read this first

What you get

What's different from upstream SWivid/F5-TTS

Using with the OpenAI SDK

API

POST /v1/audio/speech: OpenAI-shape

GET /v1/audio/voices: list available voices

GET /tts: native F5 GET (curl-friendly)

POST /tts_clone: voice cloning from arbitrary audio (GDPR-clean)

GET /health

Voices included

Configuration

Production pedigree

The F5TTS_Base trap

EPSS: configured correctly, not implemented

LAN / homelab deployment

Architecture notes

Roadmap

Contributing

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

What's different from upstream `SWivid/F5-TTS`

`POST /v1/audio/speech`: OpenAI-shape

`GET /v1/audio/voices`: list available voices

`GET /tts`: native F5 GET (curl-friendly)

`POST /tts_clone`: voice cloning from arbitrary audio (GDPR-clean)

`GET /health`

The `F5TTS_Base` trap

Packages