Skip to content

feat(voice): native voice agent support with Pipecat ASR/LLM/TTS#184

Open
itsarbit wants to merge 11 commits into
mainfrom
feat/voice-agent-support
Open

feat(voice): native voice agent support with Pipecat ASR/LLM/TTS#184
itsarbit wants to merge 11 commits into
mainfrom
feat/voice-agent-support

Conversation

@itsarbit

Copy link
Copy Markdown
Contributor

Summary

Adds a native voice agent type so arksim can evaluate voice agents through their real speech stack. v1 supports Pipecat: arksim voices the simulated user with local Kokoro TTS, drives the agent's own ASR to LLM to TTS pipeline with audio, captures the spoken reply, and transcribes it with local faster-whisper for the existing evaluator. Tool calls are captured and tagged with a pipecat source.

What's included

  • New agent_type: voice with a framework discriminator (pipecat implemented, livekit reserved) and pluggable tts/stt providers, defaulting to local Kokoro + faster-whisper (no API keys, CI reproducible).
  • arksim/speech/: TTS/STT provider abstractions and a registry, mirroring the arksim/llms/ pattern.
  • VoiceAgent dispatcher and factory wiring. User agent code stays framework pure via a zero-arg agent_factory callable (for example ./agent.py:build).
  • arksim/integrations/pipecat.py: drives the Pipecat pipeline with VAD bracketed 16 kHz audio injection and captures the agent's TTS audio on the TTS frame boundary.
  • ToolCallSource.PIPECAT (and LIVEKIT reserved), feeding the tool-call evaluation work (PLA-106).
  • Example at examples/integrations/pipecat-voice/, plus a key-free smoke_local.py.
  • Optional extras: arksim[voice] (Kokoro + faster-whisper). The agent's own pipecat services install separately.

How it works

simulated user TEXT -> [arksim Kokoro TTS] -> AUDIO -> agent ASR -> LLM -> agent TTS -> AUDIO -> [arksim faster-whisper STT] -> TEXT -> evaluator

The agent's real STT, LLM, and TTS run; arksim supplies the simulated-user voice and transcribes the reply. The simulator and evaluator are unchanged.

How to test

Key-free, fully local (real audio loop, deterministic brain):

pip install -e ".[voice]"
python examples/integrations/pipecat-voice/smoke_local.py

Full LLM-backed run (needs OPENAI_API_KEY and, on Apple Silicon, pip install 'pipecat-ai[openai,mlx-whisper,silero]'):

arksim simulate-evaluate examples/integrations/pipecat-voice/config.yaml

Verified end to end against a real gpt-4o-mini Pipecat agent: the loop runs, transcripts are evaluated, and an HTML report is produced.

Notes and scope

  • The agent's real STT/LLM/TTS are exercised; text transport is intentionally not the model here.
  • pipecat and kokoro require Python 3.11+, so voice tests carry a 3.11+ skip marker (the 3.10 CI leg skips them). Driver and provider modules are import guarded so collection never fails without the extras.
  • Deferred follow-ups: LiveKit driver, accent/noise/volume perturbation (a no-op _perturb() seam is reserved in the driver), and voice-specific metrics (latency, WER, interruption handling). This PR evaluates conversational content through the real speech stack; it is not yet full voice QA.

Test plan

  • pytest tests/unit/: 914 passed, 9 skipped locally. Includes voice config, dispatcher, factory, registry, resample, the echo driver loop, and a model-free VAD-gated driver test that guards the segmented-STT contract.

@itsarbit itsarbit requested a review from a team as a code owner June 26, 2026 21:44
@codecov

codecov Bot commented Jun 26, 2026

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant