Skip to content

voxcom-us/vibevoice-asr-api

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VibeVoice ASR API

OpenAI-compatible API server for Microsoft's VibeVoice ASR model with intelligent batching and speaker diarization.

Features

  • OpenAI-compatible API - works with OpenAI's Python client
  • Intelligent batching - 2-4x throughput improvement
  • Speaker diarization - identifies who spoke when
  • 60-minute audio support
  • Structured response with timestamps and speaker information

Setup

1. Install

# Clone repository
git clone https://github.com/voxcom-us/vibevoice-asr-api.git
cd vibevoice-asr-api

# Create virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Install VibeVoice
pip install git+https://github.com/microsoft/VibeVoice.git

2. Configure (Optional)

cp .env.example .env
# Edit .env to customize settings

The server automatically loads .env if it exists.

3. Run Server

python server.py

Usage

Health Check

curl http://localhost:8000/health

Response:

{
  "status": "healthy",
  "model_loaded": true,
  "queue_depth": 0,
  "device": "cuda",
  "device_info": {
    "type": "cuda",
    "memory_allocated_gb": 14.2,
    "memory_reserved_gb": 15.0
  }
}

Transcribe with curl

curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F "file=@audio.wav"

Transcribe with Python Client

python client_example.py audio.wav

Using OpenAI Client

from openai import OpenAI

client = OpenAI(
    api_key="not-needed",
    base_url="http://localhost:8000/v1"
)

with open("audio.wav", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="vibevoice-asr",
        file=f
    )

print(transcript.text)
print(transcript.segments)  # With timestamps and speaker IDs

Response Format

{
  "output": "assistant\n[{\"Start\":0,\"End\":4.03,\"Content\":\"[Silence]\"},{\"Start\":4.03,\"End\":6.22,\"Speaker\":0,\"Content\":\"Hello.\"}]\n",
  "text": "Hello.",
  "segments": [
    {
      "start_time": 0,
      "end_time": 4.03,
      "text": "[Silence]"
    },
    {
      "start_time": 4.03,
      "end_time": 6.22,
      "speaker_id": 0,
      "text": "Hello."
    }
  ],
  "generation_time": 1.40
}

Fields:

  • output: Raw model output with full structured data
  • text: Clean transcription text (filters out [Silence], [Music], [Noise])
  • segments: Array of segments with timestamps and optional speaker IDs
  • generation_time: Processing time in seconds

Configuration

All configuration is done via .env file. Copy .env.example to .env and customize:

# Model
MODEL_PATH=microsoft/VibeVoice-ASR
DEVICE=cuda                    # cuda, mps, cpu, or auto (auto-detects if not set)
ATTN_IMPLEMENTATION=sdpa       # sdpa (recommended), eager, flash_attention_2

# Batch Processing
BATCH_SIZE=4                   # Higher = better throughput, more memory
BATCH_TIMEOUT=0.1              # Lower = lower latency, worse throughput

# Security & Limits
MAX_FILE_SIZE_MB=100
REQUEST_TIMEOUT_SECONDS=300
CORS_ORIGINS=*

# Server
HOST=0.0.0.0
PORT=8000

Performance Tuning

High Throughput

export BATCH_SIZE=8
export BATCH_TIMEOUT=0.2

Low Latency

export BATCH_SIZE=1
export BATCH_TIMEOUT=0.01

Memory Constrained

export BATCH_SIZE=2
export MAX_FILE_SIZE_MB=50

Troubleshooting

Model not loading: Wait 30-60 seconds after startup, check /health endpoint

Out of memory: Reduce --batch-size to 1-2 or use --device cpu

Slow performance: Use ATTN_IMPLEMENTATION=sdpa and ensure GPU is available (CUDA or MPS)

Monitoring

Monitor /health endpoint for:

  • Model loaded status
  • Device type (cuda, mps, or cpu)
  • Queue depth (scale when > 5)
  • GPU memory utilization for CUDA (alert at > 90%)

Structured logs with request IDs:

2026-01-21 15:30:45 | INFO     | [a1b2c3d4] New transcription request: audio.wav
2026-01-21 15:30:46 | INFO     | Processing batch of 3 request(s)
2026-01-21 15:30:48 | INFO     | [a1b2c3d4] Completed in 3.12s

License

See LICENSE file.

Acknowledgments

Built on Microsoft's VibeVoice ASR model.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages