Extract lyrics from any audio file, automatically.
WatESez is a FastAPI service that takes an audio file and gives you back the lyrics. You upload a song, and it handles everything behind the scenes: isolating the vocals, transcribing them to text, and saving the result so you can grab it anytime later.
It also generates an acoustic fingerprint for every track you upload. That means if you send the same song twice, it'll recognize it and skip the work. No wasted processing.
flowchart LR
A[Upload Audio] --> B[Queue & Fingerprint]
B --> C[Remove Noise / Isolate Vocals]
C --> D[Transcribe with Whisper]
D --> E[Store in Database]
E --> F[Retrieve by Fingerprint]
Here's what happens step by step:
- You send an audio file to
POST /lyrics/register. It gets queued up right away and you get a202 Acceptedback instantly. No waiting around. - An acoustic fingerprint is generated using Chromaprint. This is how WatESez knows if it's seen this song before.
- The instrumental gets stripped out with
audio-separator, leaving only the vocals. faster-whisperlistens to those isolated vocals and transcribes them into time-stamped lyrics, with language detection included.- Everything gets saved to PostgreSQL. You can pull the lyrics back anytime with
GET /lyrics/{fingerprint}.
A background worker handles the queue asynchronously, so the API stays snappy even when multiple files are processing at once.
- Python 3.13 or newer
- A running PostgreSQL instance
- FFmpeg in your PATH
-
Clone the repo
git clone https://github.com/ErwanHeschung/WatESez.git cd WatESez -
Install dependencies
This project uses uv to manage packages.
uv sync
-
Set up your config
Check
app/configs/settings.pyfor the available settings. You'll want to point it at your PostgreSQL database at minimum. -
Run migrations
alembic upgrade head
-
Start the server
uv run python -m app.main
The API runs on
http://localhost:8100. You can also check out the interactive docs at/docs.
Register an audio file:
curl -X POST http://localhost:8100/lyrics/register \
-F "file=@mysong.mp3"Grab the lyrics once processing is done:
curl http://localhost:8100/lyrics/{fingerprint}WatESez is production-ready with a fully dockerized setup including PostgreSQL, FFmpeg, and Chromaprint for audio fingerprinting.
-
Copy the example environment file:
cp .env.example .env
-
CUSTOMIZE FOR PRODUCTION:
- Change all passwords in
.env(especiallyDB_PASSWORD) - Consider using a secrets manager or Docker secrets in production
- Adjust
WHISPER_MODELandSEPARATE_MODELif needed for your use case - Set appropriate values for
AUDIO_QUEUE_MAX_SIZEbased on your hardware
- Change all passwords in
-
Start the stack:
docker compose up -d
This will:
- Start PostgreSQL on port 5433 (5432 inside container)
- Build and run the WatESez app on port 8100
- Automatically run database migrations on startup
- Apply healthchecks to ensure service readiness
-
The API will be available at
http://localhost:8100. Interactive docs at/docs.
If you want to rebuild after code changes:
docker compose up -d --buildTo view logs:
docker compose logs -fTo stop and remove containers:
docker compose downYou may see these warnings in logs - they are safe to ignore:
locale: not found/no usable system locales were found(PostgreSQL in Alpine container)GPU device discovery failed: .../sys/class/drm/card0/device/vendor(ONNX Runtime in container without GPU)
These do not affect functionality - the app will use CPU fallbacks automatically.
- Default credentials in
.env.exampleare intentionally weak - you must change them for any non-trivial deployment - The healthcheck endpoint (
/health) ensures the app is ready before marking containers healthy - Audio files persist between container restarts via the storage volume mount
- Model files persist via the ai_models_cache volume to avoid repeated downloads
- Security: The app runs in the container;
SERVICE_HOST=0.0.0.0is correct for Docker (binds to all interfaces)
| Component | Technology |
|---|---|
| Framework | FastAPI |
| Database | PostgreSQL + SQLAlchemy |
| Migrations | Alembic |
| Noise Removal | audio-separator |
| Speech-to-Text | faster-whisper |
| Fingerprinting | Chromaprint |
| Package Manager | uv |
- Audio Fingerprint Similarity: Implement perceptual hashing and similarity detection to identify covers, remixes, and different versions of the same song beyond exact fingerprint matches
- Manual Lyrics Editing: Add
PATCH /lyrics/{fingerprint}endpoint to allow users to correct transcription errors or add metadata like song titles and artists
