translation.demo.mp4
Real-time medical speech translation running entirely on your Mac. No cloud, no API keys, no data leaves your device.
Speak English into your microphone, hear the Spanish translation spoken back. You get a written transcript of both sides for free.
Three open-source models run sequentially on Apple Silicon through MLX:
| Stage | Model | Size | What it does |
|---|---|---|---|
| Speech-to-text | Voxtral Realtime 4B | 2.4 GB | Transcribes English speech |
| Translation | TranslateGemma 4B | 2.3 GB | Translates English to Spanish |
| Text-to-speech | Kokoro 82M | 330 MB | Speaks the Spanish translation |
Turn detection uses Silero VAD (2 MB ONNX model on CPU) to detect when you stop talking and trigger the pipeline.
Total memory footprint is ~8 GB. Runs on any Apple Silicon Mac with 16 GB+ unified memory.
Requires macOS on Apple Silicon (M1/M2/M3/M4) and Python 3.11+.
# Install espeak (needed for Spanish text-to-speech phonemization)
brew install espeak-ng
# Install spaCy English model
uv pip install en_core_web_sm@https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl
# Install dependencies
uv syncModels download automatically from Hugging Face on first run (~5 GB total).
Live mic (default) — speak English, hear Spanish:
uv run python pipeline.pyFrom file — translate an audio file:
uv run python pipeline.py --file recording.wavPress Ctrl+C to stop.
The pipeline currently translates English to Spanish. To change the target language:
- Update the translation prompt in
translate()(changetarget_lang_codeand the instruction text) - Update the TTS voice and language code in the model constants (
TTS_VOICE,lang_code)
Kokoro supports English, Spanish, French, Hindi, Italian, Portuguese, Japanese, and Mandarin. TranslateGemma supports many more language pairs.
- Runs on Apple Silicon only (MLX requirement)
- One speaker at a time (no overlapping speech)
- English to Spanish only (easily configurable)
- Not a medical device
MIT