OpenAI-compatible API server for Microsoft's VibeVoice ASR model with intelligent batching and speaker diarization.
- OpenAI-compatible API - works with OpenAI's Python client
- Intelligent batching - 2-4x throughput improvement
- Speaker diarization - identifies who spoke when
- 60-minute audio support
- Structured response with timestamps and speaker information
# Clone repository
git clone https://github.com/voxcom-us/vibevoice-asr-api.git
cd vibevoice-asr-api
# Create virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Install VibeVoice
pip install git+https://github.com/microsoft/VibeVoice.gitcp .env.example .env
# Edit .env to customize settingsThe server automatically loads .env if it exists.
python server.pycurl http://localhost:8000/healthResponse:
{
"status": "healthy",
"model_loaded": true,
"queue_depth": 0,
"device": "cuda",
"device_info": {
"type": "cuda",
"memory_allocated_gb": 14.2,
"memory_reserved_gb": 15.0
}
}curl -X POST http://localhost:8000/v1/audio/transcriptions \
-F "file=@audio.wav"python client_example.py audio.wavfrom openai import OpenAI
client = OpenAI(
api_key="not-needed",
base_url="http://localhost:8000/v1"
)
with open("audio.wav", "rb") as f:
transcript = client.audio.transcriptions.create(
model="vibevoice-asr",
file=f
)
print(transcript.text)
print(transcript.segments) # With timestamps and speaker IDs{
"output": "assistant\n[{\"Start\":0,\"End\":4.03,\"Content\":\"[Silence]\"},{\"Start\":4.03,\"End\":6.22,\"Speaker\":0,\"Content\":\"Hello.\"}]\n",
"text": "Hello.",
"segments": [
{
"start_time": 0,
"end_time": 4.03,
"text": "[Silence]"
},
{
"start_time": 4.03,
"end_time": 6.22,
"speaker_id": 0,
"text": "Hello."
}
],
"generation_time": 1.40
}Fields:
output: Raw model output with full structured datatext: Clean transcription text (filters out[Silence],[Music],[Noise])segments: Array of segments with timestamps and optional speaker IDsgeneration_time: Processing time in seconds
All configuration is done via .env file. Copy .env.example to .env and customize:
# Model
MODEL_PATH=microsoft/VibeVoice-ASR
DEVICE=cuda # cuda, mps, cpu, or auto (auto-detects if not set)
ATTN_IMPLEMENTATION=sdpa # sdpa (recommended), eager, flash_attention_2
# Batch Processing
BATCH_SIZE=4 # Higher = better throughput, more memory
BATCH_TIMEOUT=0.1 # Lower = lower latency, worse throughput
# Security & Limits
MAX_FILE_SIZE_MB=100
REQUEST_TIMEOUT_SECONDS=300
CORS_ORIGINS=*
# Server
HOST=0.0.0.0
PORT=8000export BATCH_SIZE=8
export BATCH_TIMEOUT=0.2export BATCH_SIZE=1
export BATCH_TIMEOUT=0.01export BATCH_SIZE=2
export MAX_FILE_SIZE_MB=50Model not loading: Wait 30-60 seconds after startup, check /health endpoint
Out of memory: Reduce --batch-size to 1-2 or use --device cpu
Slow performance: Use ATTN_IMPLEMENTATION=sdpa and ensure GPU is available (CUDA or MPS)
Monitor /health endpoint for:
- Model loaded status
- Device type (cuda, mps, or cpu)
- Queue depth (scale when > 5)
- GPU memory utilization for CUDA (alert at > 90%)
Structured logs with request IDs:
2026-01-21 15:30:45 | INFO | [a1b2c3d4] New transcription request: audio.wav
2026-01-21 15:30:46 | INFO | Processing batch of 3 request(s)
2026-01-21 15:30:48 | INFO | [a1b2c3d4] Completed in 3.12s
See LICENSE file.
Built on Microsoft's VibeVoice ASR model.