Skip to content

Conversation

@pranavjoshi001
Copy link

@pranavjoshi001 pranavjoshi001 commented Dec 12, 2025

Changelog Entry

TBD

Description

This PR introduces Speech-to-Speech (S2S) functionality in Web Chat, enabling real-time voice conversations with bots. The implementation includes audio recording via AudioWorklet, audio playback with buffer queueing, and speech state management. This foundation supports upcoming MMRT (Multi-Modal Real-Time), ABS (Azure Bot Service), and CCV2 integration changes.

Activity structure - microsoft/Agents#377
Note - Utilizing value instead of payload until we have proposal merged in.

Design

The Speech-to-Speech feature is built on three main components:

  1. Voice Activities Hook (useVoiceActivities.ts) - Filters and provides voice-specific activities from the Redux store's voiceActivities slice
  2. SpeechToSpeech Provider (SpeechToSpeechComposer.tsx) - A React context provider that manages:
    • Audio recording via useRecorder.ts hook using Web Audio API with AudioWorklet (CSP compliant)
    • Audio playback via useAudioPlayer.ts hook with proper queueing and timing
    • Speech state management (idle, listening, user_speaking, processing, bot_speaking)
    • Integration with DirectLine for sending audio chunks and handling voice events
  3. useSpeechToSpeech Hook - Provides recording, setRecording, and speechState for consumer UI components

Speech State Flow

idle → listening → user_speaking → processing → bot_speaking → listening

The provider is designed to work with the existing Web Chat architecture, consuming voice activities from the Redux store and posting audio data through the postActivity hook.

Performance Optimization

A new voiceActivities slice has been introduced in the Redux store to optimize voice activity handling:

  • Separate Storage: Voice activities (non-transcript) are stored in a dedicated slice instead of the main activities array
  • Fire-and-Forget: Voice events like voice chunk delta, bot states, etc. bypass expensive sorting and grouping operations
  • Reduced Overhead: Prevents clogging the main activities reducer with high-frequency voice events that don't need rendering, replaying etc..
  • Selective Processing: Only voice transcript activities (which need to be rendered in chat) go through the standard activity pipeline

Specific Changes

New Files Added:

Core Utilities (packages/core)

  • isVoiceActivity.ts - Type guard for voice/DTMF activities
  • isVoiceTranscriptActivity.ts - Type guard for transcript activities (voice.transcript)
  • getVoiceTranscriptRole.ts - Extract role (user/bot) from voice transcript
  • getVoiceTranscriptText.ts - Extract transcription text from voice activity

Provider & Hooks (packages/api)

  • SpeechToSpeechComposer.tsx - Main S2S provider component (integrated into Composer)
  • useSpeechToSpeech.ts - Public hook to consume S2S context
  • useVoiceActivities.ts - Hook to select voice activities from store
  • useAudioPlayer.ts - Audio playback with buffer queueing (Base64 → Int16 → Float32)
  • useRecorder.ts - AudioWorklet-based recording (Float32 → Int16 → Base64)
  • SpeechState.ts - Speech state type definition

Redux Store (packages/core)

  • voiceActivities slice for storing non-transcript voice events

Test Coverage Added:

  • Unit tests for hooks and utility.
  • E2E HTML tests covering:
    • Full conversation flow
    • Multi-turn conversations
    • Barge-in/interruption handling
    • CSP compliance
    • Audio timing and synchronization

Next Steps:

  • Animation using voice intensity
  • Documentation/sample once activity and BE is ready (as currently still in testing phase)
  • Minor UI improvement (timestamps, bot labeling, flair in incoming message)
  • Improve config logic in composer (keeping now to unblock further work and exploring other approach)
  • I have added tests and executed them locally
  • I have updated CHANGELOG.md
  • I have updated documentation

Review Checklist

This section is for contributors to review your work.

  • Accessibility reviewed (tab order, content readability, alt text, color contrast)
  • Browser and platform compatibilities reviewed
  • CSS styles reviewed (minimal rules, no z-index)
  • Documents reviewed (docs, samples, live demo)
  • Internationalization reviewed (strings, unit formatting)
  • package.json and package-lock.json reviewed
  • Security reviewed (no data URIs, check for nonce leak)
  • Tests reviewed (coverage, legitimacy)

@pranavjoshi001 pranavjoshi001 changed the title Feature/core s2s composer Core speech to speech composer implementation (no-op code) Dec 12, 2025
@pranavjoshi001 pranavjoshi001 changed the title Core speech to speech composer implementation (no-op code) Core speech to speech implementation Jan 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants