VibeVoice ASR API

OpenAI-compatible API server for Microsoft's VibeVoice ASR model with intelligent batching and speaker diarization.

Features

OpenAI-compatible API - works with OpenAI's Python client
Intelligent batching - 2-4x throughput improvement
Speaker diarization - identifies who spoke when
60-minute audio support
Structured response with timestamps and speaker information

Setup

1. Install

# Clone repository
git clone https://github.com/voxcom-us/vibevoice-asr-api.git
cd vibevoice-asr-api

# Create virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Install VibeVoice
pip install git+https://github.com/microsoft/VibeVoice.git

2. Configure (Optional)

cp .env.example .env
# Edit .env to customize settings

The server automatically loads .env if it exists.

3. Run Server

python server.py

Usage

Health Check

curl http://localhost:8000/health

Response:

{
  "status": "healthy",
  "model_loaded": true,
  "queue_depth": 0,
  "device": "cuda",
  "device_info": {
    "type": "cuda",
    "memory_allocated_gb": 14.2,
    "memory_reserved_gb": 15.0
  }
}

Transcribe with curl

curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F "file=@audio.wav"

Transcribe with Python Client

python client_example.py audio.wav

Using OpenAI Client

from openai import OpenAI

client = OpenAI(
    api_key="not-needed",
    base_url="http://localhost:8000/v1"
)

with open("audio.wav", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="vibevoice-asr",
        file=f
    )

print(transcript.text)
print(transcript.segments)  # With timestamps and speaker IDs

Response Format

{
  "output": "assistant\n[{\"Start\":0,\"End\":4.03,\"Content\":\"[Silence]\"},{\"Start\":4.03,\"End\":6.22,\"Speaker\":0,\"Content\":\"Hello.\"}]\n",
  "text": "Hello.",
  "segments": [
    {
      "start_time": 0,
      "end_time": 4.03,
      "text": "[Silence]"
    },
    {
      "start_time": 4.03,
      "end_time": 6.22,
      "speaker_id": 0,
      "text": "Hello."
    }
  ],
  "generation_time": 1.40
}

Fields:

output: Raw model output with full structured data
text: Clean transcription text (filters out [Silence], [Music], [Noise])
segments: Array of segments with timestamps and optional speaker IDs
generation_time: Processing time in seconds

Configuration

All configuration is done via .env file. Copy .env.example to .env and customize:

# Model
MODEL_PATH=microsoft/VibeVoice-ASR
DEVICE=cuda                    # cuda, mps, cpu, or auto (auto-detects if not set)
ATTN_IMPLEMENTATION=sdpa       # sdpa (recommended), eager, flash_attention_2

# Batch Processing
BATCH_SIZE=4                   # Higher = better throughput, more memory
BATCH_TIMEOUT=0.1              # Lower = lower latency, worse throughput

# Security & Limits
MAX_FILE_SIZE_MB=100
REQUEST_TIMEOUT_SECONDS=300
CORS_ORIGINS=*

# Server
HOST=0.0.0.0
PORT=8000

Performance Tuning

High Throughput

export BATCH_SIZE=8
export BATCH_TIMEOUT=0.2

Low Latency

export BATCH_SIZE=1
export BATCH_TIMEOUT=0.01

Memory Constrained

export BATCH_SIZE=2
export MAX_FILE_SIZE_MB=50

Troubleshooting

Model not loading: Wait 30-60 seconds after startup, check /health endpoint

Out of memory: Reduce --batch-size to 1-2 or use --device cpu

Slow performance: Use ATTN_IMPLEMENTATION=sdpa and ensure GPU is available (CUDA or MPS)

Monitoring

Monitor /health endpoint for:

Model loaded status
Device type (cuda, mps, or cpu)
Queue depth (scale when > 5)
GPU memory utilization for CUDA (alert at > 90%)

Structured logs with request IDs:

2026-01-21 15:30:45 | INFO     | [a1b2c3d4] New transcription request: audio.wav
2026-01-21 15:30:46 | INFO     | Processing batch of 3 request(s)
2026-01-21 15:30:48 | INFO     | [a1b2c3d4] Completed in 3.12s

License

See LICENSE file.

Acknowledgments

Built on Microsoft's VibeVoice ASR model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VibeVoice ASR API

Features

Setup

1. Install

2. Configure (Optional)

3. Run Server

Usage

Health Check

Transcribe with curl

Transcribe with Python Client

Using OpenAI Client

Response Format

Configuration

Performance Tuning

High Throughput

Low Latency

Memory Constrained

Troubleshooting

Monitoring

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
client_example.py		client_example.py
requirements.txt		requirements.txt
server.py		server.py

Folders and files

Latest commit

History

Repository files navigation

VibeVoice ASR API

Features

Setup

1. Install

2. Configure (Optional)

3. Run Server

Usage

Health Check

Transcribe with curl

Transcribe with Python Client

Using OpenAI Client

Response Format

Configuration

Performance Tuning

High Throughput

Low Latency

Memory Constrained

Troubleshooting

Monitoring

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages