Skip to content

devdrop-gc/voice-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

🤖 Real-Time Voice Agent (LiveKit)

A real-time voice agent that joins a LiveKit room and interacts with users via audio. The agent listens to the user's speech, converts it to text, and responds back with synthesized audio.

✨ Features

  • 🎙️ Speech-to-Text using Deepgram (nova-3 model)
  • 🔊 Text-to-Speech using Cartesia
  • 🧠 Voice Activity Detection (VAD) using Silero VAD (local model, no API key required)
  • 🚫 No overlap — agent never speaks while user is speaking; stops immediately if interrupted
  • ⏱️ Silence handling — plays a reminder if no user speech for 20+ seconds

⚙️ How It Works

  1. Agent joins a LiveKit room and greets the user
  2. Silero VAD continuously monitors audio to detect when the user is speaking
  3. When the user finishes speaking, Deepgram STT transcribes the audio to text
  4. The agent responds with "You said: <text>" converted to audio via Cartesia TTS
  5. If the user speaks while the agent is talking, LiveKit's built-in interruption handling stops the agent immediately
  6. If no speech is detected for 20 seconds, the agent plays a reminder prompt

🚫 No-Overlap / Interruption Handling

The LiveKit Agents SDK handles interruption automatically via the VAD pipeline:

  • Silero VAD detects speech start/end in real time
  • If the user starts speaking while the agent is outputting audio, the SDK cancels the agent's current speech immediately
  • The agent only resumes listening after the user finishes their turn
  • This is documented in the LiveKit Agents SDK under AgentSession interruption behavior

⏱️ Silence Handling

  • A 20-second countdown timer (asyncio.sleep(20)) starts after every user interaction or agent greeting
  • If the timer completes without being reset, the agent says "Are you still there? Feel free to say something."
  • The timer resets on every new user turn, preventing repeated reminders during active conversation

🛠️ Setup Instructions

Prerequisites

Installation

  1. Clone the repository:

    git clone https://github.com/YOUR_USERNAME/voice-agent.git
    cd voice-agent
  2. Create and activate a virtual environment:

    python -m venv venv
    # On Mac/Linux:
    source venv/bin/activate
    # On Windows:
    venv\Scripts\activate
  3. Install dependencies:

    pip install livekit-agents livekit-plugins-deepgram livekit-plugins-cartesia livekit-plugins-silero python-dotenv
  4. Create a .env file in the project root:

    LIVEKIT_URL=wss://your-project.livekit.cloud
    LIVEKIT_API_KEY=your_livekit_api_key
    LIVEKIT_API_SECRET=your_livekit_api_secret
    DEEPGRAM_API_KEY=your_deepgram_api_key
    CARTESIA_API_KEY=your_cartesia_api_key
    

🚀 How to Run

python agent.py dev

The agent will start and wait for someone to join a LiveKit room.

🧪 Testing

Use the LiveKit Agents Playground to connect and talk to the agent:

  1. Go to https://agents-playground.livekit.io
  2. Enter your LIVEKIT_URL, LIVEKIT_API_KEY, and LIVEKIT_API_SECRET
  3. Click Connect and allow microphone access
  4. Speak — the agent will reply with "You said: <your words>"

🔑 Required Environment Variables

Variable Description
LIVEKIT_URL Your LiveKit Cloud WebSocket URL
LIVEKIT_API_KEY LiveKit project API key
LIVEKIT_API_SECRET LiveKit project API secret
DEEPGRAM_API_KEY Deepgram API key for speech-to-text
CARTESIA_API_KEY Cartesia API key for text-to-speech

📦 SDK Used

  • livekit-agents v1.4.3 — core agent framework
  • livekit-plugins-deepgram — Deepgram STT integration
  • livekit-plugins-cartesia — Cartesia TTS integration
  • livekit-plugins-silero — Silero VAD (local, no API key needed)

🌐 External Services

Service Purpose Free Tier
LiveKit Cloud Real-time audio room infrastructure Yes
Deepgram Speech-to-text (nova-3 model) $200 free credit
Cartesia Text-to-speech Yes

⚠️ Known Limitations

  • No UI — testing requires the LiveKit Agents Playground
  • The silence reminder loops every 20 seconds if the user remains silent indefinitely
  • Agent is English-only (Deepgram configured for en-US)
  • Requires a stable internet connection for Deepgram and Cartesia API calls

About

A real-time voice agent that joins a LiveKit room and interacts with users via audio. The agent listens to the user's speech, converts it to text, and responds back with synthesized audio.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages