Skip to content

chipmates/f5-server

Repository files navigation

f5-server

OpenAI-compatible F5-TTS server. Production-ready, checkpoint-agnostic, GDPR-clean. 10 archetype voices included.

Runs in production at agoracosmica.org, a German non-profit philosophical-dialogue platform. The server code is MIT. Public F5-TTS checkpoints are CC-BY-NC-4.0, see Checkpoint licensing.

Quickstart

git clone https://github.com/chipmates/f5-server
cd f5-server
docker compose up -d
# First start downloads ~1.35 GB checkpoint from HuggingFace, then runs forever.

curl -X POST http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input":"Die Philosophie lehrt uns, dass wahre Weisheit nicht im Wissen liegt.","voice":"hyperion"}' \
  --output speech.mp3

That's the full setup. The OpenAI Python SDK works too, see Using with the OpenAI SDK.


Checkpoint licensing: read this first

The server code in this repo is MIT-licensed. The model checkpoints are not.

Every publicly released F5-TTS weight file we know of (the upstream SWivid/F5-TTS base and every German fine-tune including hvoss-techfak, aihpi, marduk-ra, cduvenhorst) is released under CC-BY-NC-4.0, which prohibits commercial use. This is an upstream constraint inherited from the Emilia training dataset, not something specific to this repo.

Artifact License Commercial use?
Code in this repo (serve.py, Dockerfile, etc.) MIT ✅ Yes
SWivid/F5-TTS (upstream base, English) CC-BY-NC-4.0 ❌ No
hvoss-techfak/F5-TTS-German (this server's default) CC-BY-NC-4.0 ❌ No
aihpi/F5-TTS-German CC-BY-NC-4.0 ❌ No
Your own checkpoint trained on MIT/Apache/CC0 data (your choice) ✅ Yes

The server is checkpoint-agnostic. Point F5_DEFAULT_CKPT at any F5-TTS-compatible .safetensors file trained with the F5TTS_Base config (see The F5TTS_Base trap). The default hvoss config exists so docker-compose works out of the box for non-commercial use.

On checkpoint load, the server logs a license notice for any checkpoint in the registry marked NC. There is no UI banner, by design. Bring-your-own checkpoints do not trigger the notice.

For full detail and commercial-use guidance, see docs/checkpoint-licensing.md.


What you get

  • OpenAI /v1/audio/speech endpoint: drop in as a base URL on the OpenAI Python SDK and it works. JSON in, audio out.
  • 10 German archetype voices bundled in refs/ (~4 MB total). Male and female × elder, mentor, baritone, commanding, narrator, intellectual variants.
  • GDPR-clean voice cloning: POST /tts_clone uploads audio, generates output, deletes the upload from disk and module caches in a finally block. No persistence.
  • Docker Compose with first-run HuggingFace checkpoint auto-download. up -d and curl.
  • Native F5 endpoints (/tts, /tts_clone) retained for curl-friendly testing.
  • CORS support, env-configurable (default * for hobby use, scope down for production).
  • Multi-instance scale-out via docker-compose.scale.yml, the same pattern that runs the agoracosmica.org fleet.
  • Healthchecks, structured logging, graceful checkpoint loading.

What's different from upstream SWivid/F5-TTS

Not a code fork. The server installs stock pip install f5-tts==1.1.5 and wraps it with the following additions:

Addition What it does Upstream?
SoMaJo sentence chunker (DE + EN) Replaces the upstream regex chunker, which breaks on Dr., z. B., 3. Januar, Prof. and similar tokens, producing 1-2 char chunks that F5 renders as garbage. Not yet (PR planned)
Equal-power crossfade cos / sin curves replace the upstream linear fades, eliminating the ~3 dB seam dip audible between chunks. Not yet (PR planned)
Voice tensor cache Voice references preprocessed at startup and held in memory. Skips disk and pydub preprocessing on every request. No
GDPR ephemeral /tts_clone Uploaded audio deleted from disk and F5 module caches in a finally block. No
OpenAI-shape adapter POST /v1/audio/speech works with the OpenAI Python SDK. No
Auto HF checkpoint download First-run pulls the checkpoint from HuggingFace. No manual setup. No
F5TTS_Base config-trap warning The hvoss-techfak/F5-TTS-German checkpoint must be loaded with F5TTS_Base, not the pip default F5TTS_v1_Base. Wrong config produces gibberish. No (silent footgun upstream)

Using with the OpenAI SDK

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-required",  # No auth by default. Front with nginx + bearer for production.
)

with client.audio.speech.with_streaming_response.create(
    model="f5-tts",     # Ignored. The server uses F5_DEFAULT_CKPT.
    voice="hyperion",   # See /v1/audio/voices for the full list.
    input="Die Philosophie lehrt uns, dass wahre Weisheit nicht im Wissen liegt.",
    response_format="mp3",
    speed=1.0,
) as resp:
    resp.stream_to_file("out.mp3")

Works identically with the JS / TS SDK, LangChain, LiteLLM, and any OpenAI-compatible wrapper.

API

POST /v1/audio/speech: OpenAI-shape

{
  "model": "f5-tts",           // Ignored. For SDK compatibility.
  "input": "Text to synthesize",
  "voice": "hyperion",         // Cosmic name or legacy slug. See /v1/audio/voices.
  "response_format": "mp3",    // wav | mp3 | opus | flac | aac | pcm
  "speed": 1.0,                // 0.5 to 2.0
  "language": "de"             // de | en. Auto from voice if omitted.
}

Returns audio bytes with the right Content-Type. The X-Inference-Ms header reports inference time.

GET /v1/audio/voices: list available voices

{
  "voices": [
    {
      "id": "lyra",
      "slug": "f1_warm_wise_v1",
      "name": "Lyra",
      "character": "warm-wise",
      "language": "de",
      "gender": "female",
      "rank": 1,
      "tags": ["female", "wise", "warm"]
    }
  ]
}

The voice field accepts either the cosmic id (e.g., lyra) or the legacy slug (e.g., f1_warm_wise_v1). Slug aliases match the chipmates/agoracosmica Local Mode voice catalog, so client code targeting either server works without changes.

GET /tts: native F5 GET (curl-friendly)

Useful for terminal testing without crafting a JSON body. Returns WAV by default. Pass response_format=mp3 for compressed.

curl -G http://localhost:8000/tts \
  --data-urlencode 'text=Hallo, das ist ein Test.' \
  --data-urlencode 'voice=hyperion' \
  --data-urlencode 'response_format=mp3' \
  --output test.mp3

POST /tts_clone: voice cloning from arbitrary audio (GDPR-clean)

curl -X POST http://localhost:8000/tts_clone \
  -F text="Text in the cloned voice" \
  -F ref_text="What is spoken in the reference audio" \
  -F ref_audio=@your_voice_sample.wav \
  --output clone.wav

Uploaded audio is deleted from disk and all F5-TTS module caches in a finally block after generation. No on-disk persistence of user audio. This is the GDPR path agoracosmica.org uses in production.

GET /health

{"status": "ok", "version": "0.1.0", "checkpoint": "hvoss", "voices": 10, "cached_voices": 10}

Voices included

Ten German philosophical-archetype voices, recorded specifically for agoracosmica.org. Each voice has a cosmic name (the public API identifier) and a technical slug (a legacy alias matching the chipmates/agoracosmica Local Mode catalog). Reference audio and the spoken reference text live in voices.json.

Female voices, ranked from internal curation:

Cosmic name Character Slug alias Tags
lyra warm-wise f1_warm_wise_v1 female, wise, warm
astra commanding f5_deep_commanding_v2 female, commanding, deep
vega wise f1_warm_wise_v2 female, wise, warm
andromeda intellectual f2_commanding_thinker_v1 female, commanding, thinker
ceres nurturing f1_warm_mentor_v2 female, mentor, warm

Male voices:

Cosmic name Character Slug alias Tags
solaris narrator m3_rich_narrator_v3 male, narrator, rich
umbra baritone m5_rich_baritone_v1 male, baritone, rich
phoenix elder m1_warm_elder_v3 male, elder, warm, deep
hyperion elder (default) m1_warm_elder_v2 male, elder, warm, deep
corvus intellectual m3_intellectual_v2 male, intellectual, thoughtful

To add your own voice: drop your_voice.wav into refs/, add an entry to voices.json with the cosmic id, technical slug (optional), and the spoken reference text, then restart.

Configuration

All via environment variables (see .env.example):

Variable Default Purpose
F5_DEFAULT_CKPT hvoss Checkpoint name. Known: hvoss, aihpi. Or an absolute path to your own.
F5_DEFAULT_LANGUAGE de de or en. SoMaJo handles both. Other languages fall through to regex.
F5_SPEED 1.0 Engine-native pace (~180 WPM DE). Lower is slower.
CORS_ALLOW_ORIGINS * CORS origins. Comma-separated for production scoping.
OMP_NUM_THREADS 4 Threads per instance. Keep low when running multiple instances.
PORT 8000 Server port.

Production pedigree

This server runs the German voice path at agoracosmica.org, a non-profit philosophical-dialogue platform serving daily users since March 2026.

  • Capacity: load-tested at N=400 concurrent BBA-shape sessions with zero server errors.
  • Fleet pattern: dual-instance EU deployment, four F5 workers per host, sustained ~150 concurrent German sessions per shift.
  • GDPR profile: the server does not log request text. The /tts_clone endpoint deletes uploaded audio from disk and F5 module caches in a finally block. No on-disk persistence by default. The legal jurisdiction of inference is determined by where you host the server.

F5-TTS is the speed tier in this fleet. Internal blind comparisons on German philosophical content favor Qwen3-TTS for quality, so the agoracosmica.org primary path routes through Qwen and F5 handles overflow capacity and fast-fallback. F5 alone is excellent for conversational German and high-throughput batch synthesis. If you are building a brand-voice experience for production users, A/B against alternatives.

The F5TTS_Base trap

This deserves its own section because it bites everyone who tries pip install f5-tts with a German checkpoint.

# Correct: produces intelligible German speech
from f5_tts.api import F5TTS
model = F5TTS(model="F5TTS_Base", ckpt_file="hvoss_german.safetensors")

# Wrong: produces gibberish (garbled, fast, unintelligible)
model = F5TTS(ckpt_file="hvoss_german.safetensors")  # defaults to F5TTS_v1_Base

The hvoss-techfak/F5-TTS-German checkpoint was trained with the F5TTS_Base config, not the newer F5TTS_v1_Base. The two configs differ in pe_attn_head (1 vs null) and text_mask_padding (False vs True). Loading with the wrong config means wrong positional encoding. The model loses spatial awareness and outputs garbled audio.

This server passes model="F5TTS_Base" explicitly in serve.py:get_model. Any future checkpoint added to HF_CHECKPOINTS that was trained on F5TTS_v1_Base would need to set the right config.

EPSS: configured correctly, not implemented

F5-TTS 1.1.5's built-in EPSS (Empirically Pruned Step Sampling, Interspeech 2025) step sampling only activates at NFE 7. At 8 it silently falls back to uniform linspace and you lose roughly 14% throughput. This server pins NFE 7 by default and warns if you override it.

EPSS is upstream. Anyone who pip install f5-tts already has it. The contribution here is configuring it correctly and documenting the footgun. If you set nfe_step=8 thinking "more is better," the server logs:

WARNING: nfe_step=8 disables EPSS schedule. EPSS = 7-step at zero quality loss.
         See README "EPSS: configured correctly, not implemented".

LAN / homelab deployment

The CORS config means you can expose this server on homelab.local:8000, point a browser-based client (running on a separate device such as an iPad, work laptop, or phone) at it via https://yourapp.com settings, and the audio never leaves your LAN. Same pattern as chipmates/agoracosmica Local Mode: hosted UI, self-hosted inference, zero cloud TTS dependency.

For Tailscale users: the server binds 0.0.0.0 by default. Any device on your tailnet can reach it once you forward port 8000.

Architecture notes

  • CPU-bound at ~60%, GPU-bound at ~40%. Adding instances scales better than adding threads. Four instances × four cores beats one instance × sixteen threads (verified 5.3 vs 4.7 req/s on a 16-core box).
  • Voice tensor cache at startup: every entry in voices.json gets its reference audio preprocessed (silence trim, normalize) and converted to a tensor, then held in memory. The first inference request for a known voice skips ~150 ms of disk and pydub work.
  • CUDA cache eviction (torch.cuda.empty_cache()) after each inference mitigates a ~200 MB/instance/day VRAM creep that the PyTorch caching allocator can produce across many varying-shape ODE solver runs. Few-ms overhead, eliminates the weekly-restart need.
  • No torch.compile: tested on F5's DiT, makes inference slower (130 ms to 230 ms). The ODE solver calls the transformer 14 times per request, so the compilation wrapper overhead exceeds the kernel fusion gains.
  • MPS support: stock f5-tts supports Apple Silicon via PyTorch MPS, though we have not benchmarked it. CPU-only inference is sub-realtime (RTF > 1) in community reports, not viable for live response.

Roadmap

  • Crossfade fix as an upstream PR to SWivid/F5-TTS (single-language-agnostic, strict improvement).
  • SoMaJo chunker as an opt-in upstream PR (DE + EN today, needs a CJK fallback for upstream compatibility).
  • Multi-checkpoint switch at runtime (currently single-checkpoint per instance. The API accepts ckpt but model swap-out is not optimized).
  • Bearer-token auth (currently none. Production users front it with nginx + token).
  • Prometheus /metrics endpoint (production uses one, needs cleanup for OSS).

Contributing

Issues and pull requests are welcome. The two upstream contributions planned for SWivid/F5-TTS (crossfade, SoMaJo chunker) live in serve.py as inline shims for now. Happy to coordinate if you want to land them upstream first.

License

  • Code (this repo): MIT, see LICENSE.
  • Bundled voice references (refs/*.wav): MIT, recorded specifically for this project.
  • Model checkpoints auto-downloaded by docker-compose: licensed separately by their creators. All currently-supported checkpoints are CC-BY-NC-4.0. See docs/checkpoint-licensing.md.

Acknowledgements

  • SWivid et al. for the F5-TTS architecture and base implementation.
  • hvoss-techfak for the German F5-TTS checkpoint training.
  • aihpi for an alternative German F5-TTS checkpoint training.
  • Mozilla Common Voice for the open German speech dataset used in upstream German checkpoint training.
  • SoMaJo for the German + English sentence tokenizer.
  • The Interspeech 2025 EPSS paper for the step-sampling optimization that F5-TTS 1.1.5 ships.

The 10 reference voices in refs/ were synthesized iteratively using Qwen3-TTS, then curated to represent archetypal speaking styles. None are recordings of real people or derived from third-party voice samples.

Built and stress-tested in production for agoracosmica.org. Open-sourced for any team that needs an OpenAI-shaped voice cloning server without the cloud.

About

Production F5-TTS server with OpenAI-compatible API. MIT code, bring-your-own-checkpoint.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors