feat(voice): native voice agent support with Pipecat ASR/LLM/TTS#184
Open
itsarbit wants to merge 11 commits into
Open
feat(voice): native voice agent support with Pipecat ASR/LLM/TTS#184itsarbit wants to merge 11 commits into
itsarbit wants to merge 11 commits into
Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a native
voiceagent type so arksim can evaluate voice agents through their real speech stack. v1 supports Pipecat: arksim voices the simulated user with local Kokoro TTS, drives the agent's own ASR to LLM to TTS pipeline with audio, captures the spoken reply, and transcribes it with local faster-whisper for the existing evaluator. Tool calls are captured and tagged with apipecatsource.What's included
agent_type: voicewith aframeworkdiscriminator (pipecatimplemented,livekitreserved) and pluggabletts/sttproviders, defaulting to local Kokoro + faster-whisper (no API keys, CI reproducible).arksim/speech/: TTS/STT provider abstractions and a registry, mirroring thearksim/llms/pattern.VoiceAgentdispatcher and factory wiring. User agent code stays framework pure via a zero-argagent_factorycallable (for example./agent.py:build).arksim/integrations/pipecat.py: drives the Pipecat pipeline with VAD bracketed 16 kHz audio injection and captures the agent's TTS audio on the TTS frame boundary.ToolCallSource.PIPECAT(andLIVEKITreserved), feeding the tool-call evaluation work (PLA-106).examples/integrations/pipecat-voice/, plus a key-freesmoke_local.py.arksim[voice](Kokoro + faster-whisper). The agent's own pipecat services install separately.How it works
The agent's real STT, LLM, and TTS run; arksim supplies the simulated-user voice and transcribes the reply. The simulator and evaluator are unchanged.
How to test
Key-free, fully local (real audio loop, deterministic brain):
Full LLM-backed run (needs
OPENAI_API_KEYand, on Apple Silicon,pip install 'pipecat-ai[openai,mlx-whisper,silero]'):Verified end to end against a real gpt-4o-mini Pipecat agent: the loop runs, transcripts are evaluated, and an HTML report is produced.
Notes and scope
_perturb()seam is reserved in the driver), and voice-specific metrics (latency, WER, interruption handling). This PR evaluates conversational content through the real speech stack; it is not yet full voice QA.Test plan
pytest tests/unit/: 914 passed, 9 skipped locally. Includes voice config, dispatcher, factory, registry, resample, the echo driver loop, and a model-free VAD-gated driver test that guards the segmented-STT contract.