Migrate to inworld-tts-2 + Soniox STT, add 54 languages#75
Merged
Conversation
- TTS: switch realtime + REST calls to inworld-tts-2; pass BCP-47 via session.providerData.tts.language and the REST language field so cross-lingual voices stay grounded in the target locale. - STT: switch to soniox/stt-rt-v4 with ISO 639-1 language hint via transcription.language. Soniox emits cumulative partial deltas, so replace the buffer per delta instead of appending (fixes "hi hi there hi there this..." UI bug). - Languages: reshuffle LanguageConfig to make BCP-47 canonical (drop sttLanguageCode + ttsConfig.languageCode); add 54 Soniox-supported languages with alternating Sarah/Jason TTS-2 voices and templated personas. Existing 6 curated personas/voices/topics preserved. - Frontend: bump welcome modal copy to "60+ languages" and let the language dropdown scroll (max-height: 60vh, overflow-y: auto).
There was a problem hiding this comment.
Pull request overview
This PR expands the language-tutor app from a small curated set to a 60-language catalog while migrating speech integration to Inworld TTS 2 and Soniox realtime STT. It updates both realtime session wiring and REST pronunciation/export paths so language selection now drives provider-specific locale hints.
Changes:
- Switch realtime transcription to
soniox/stt-rt-v4and update partial-transcript handling for Soniox’s cumulative deltas. - Migrate TTS requests to
inworld-tts-2, passing BCP-47 language hints for realtime and REST pronunciation/export flows. - Restructure
LanguageConfigaroundcode+bcp47, add 54 language entries, and make small frontend UI updates for the larger language list.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
frontend/src/styles/main.css |
Makes the new-chat language dropdown scrollable for the expanded language list. |
frontend/src/components/WelcomeModal.tsx |
Updates the welcome copy to advertise broader language support. |
backend/src/services/websocket-handler.ts |
Passes the selected language locale into websocket-triggered pronunciation requests. |
backend/src/services/session-manager.ts |
Switches realtime STT provider/model, updates partial-delta handling, and sends TTS locale hints in session.update. |
backend/src/services/inworld-llm.ts |
Updates REST pronunciation requests to TTS 2 and adds a language parameter. |
backend/src/helpers/tts-audio-generator.ts |
Updates batch Anki/export TTS generation to TTS 2 with per-language locale hints. |
backend/src/config/server.ts |
Changes the default configured TTS model string. |
backend/src/config/languages.ts |
Refactors language config shape and adds the new supported language catalog. |
backend/src/__tests__/session-manager.test.ts |
Adjusts realtime/session tests for Soniox cumulative deltas and new session payload fields. |
backend/src/__tests__/inworld-llm.test.ts |
Updates pronunciation tests for the new method signature and TTS 2 request body. |
backend/src/__tests__/config/languages.test.ts |
Updates config tests for the new bcp47 field. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Teach the model to use TTS-2's three expressivity mechanisms (English steering tags, English non-verbal tags, target-language disfluencies), and use locale-native voices where the Inworld catalog has them. - New `disfluencies` field on LanguageConfig with 2-4 fillers per locale (e.g. えーと/あの for ja, este/eh/pues for es) seeded across all 60 languages - Expanded session.update instructions with a `# Voice & expressivity` section distinguishing English-only control tags ([speak warmly], [laugh]) from inline target-language disfluencies - Strip bracketed control tags from streaming and final assistant transcripts so the UI shows clean text; held-back buffer handles tags spanning chunks - Switch ar/nl/he/hi/ja/ko/pl/ru/zh to native voices (Nour, Katrien, Yael, Aarav, Hina, Hyunwoo, Szymon, Elena, Mei). Sarah/Jason remain fallbacks for the ~45 languages without natives in the catalog - Polish only has male natives, so persona renamed Kasia → Szymon
The previous "0–2 per turn, only when thinking" wording was too permissive and disfluencies were buried as the third bullet under steering and non-verbal tags. With responses already capped at 1–2 sentences, the model read it as optional and skipped them entirely. Lead the expressivity section with disfluencies, frame them as expected in MOST responses, and demote bracketed control tags to optional/rare. Also seed two concrete inline templates so the model has a usage pattern to mimic.
A more capable model should follow the new TTS-2 expressivity guidance (disfluencies, steering tags) more reliably. Updates all three call sites that were unified on the previous model so realtime conversation, flashcard/feedback generation, and memory operations stay aligned.
Live testing showed the model picking the first item in the list ("um")
turn after turn — the seeded list was being read as "use these specific
words" rather than "natural fillers like these".
- Reframe the list as non-exhaustive examples; explicitly welcome any
common ${name} filler the model knows
- Add a "VARY" rule: never reuse the same disfluency in consecutive
responses, don't lean on the most generic one every turn
- Drop the example template that hardcoded disfluencies[0], which was
reinforcing the lock-in
- Expand the curated 6 (en/es/fr/de/it/pt) and ja/ko/zh from 3-4 fillers
to 8-10 each so the model has more native variety to draw from
Adopts the inworld-golden-demo pattern: route TTS playback through a
two-peer local WebRTC loopback so Chrome's hardware AEC sees the
playback as an explicit reference signal. Without that signal the
echoCancellation: true constraint we were already passing has nothing
concrete to subtract, which is why the agent could hear itself when
the user listens through speakers.
Architecture:
- New AudioPipeline owns one shared 24kHz playback AudioContext and a
master output node. Both AudioPlayer instances (streaming voice + TTS
pronunciation) connect into it so a single WebRTC reference covers
all playback paths.
- New WebRtcLoopback (ported from golden demo) wires two RTCPeerConnections
locally via ICE; mic stream goes to client peer, playback stream to
server peer. The AEC-processed mic stream comes back out for capture.
- Loopback playback routes through a hidden <audio> element so Chrome
uses its standard output pipeline (required for the AEC reference).
Smaller wins bundled in:
- getUserMedia constraints add suppressLocalAudioPlayback: { ideal: true }
(Chrome-only hint) and sampleRate: 24000.
- Capture AudioContext is created at 24kHz so the worklet receives samples
at the target rate — drops the per-quantum linear-interp resampling.
- Worklet simplified accordingly: just buffer 2400 samples (100ms) and
post Int16 PCM. Drop the ScriptProcessorNode fallback (AudioWorklet is
universally supported on browsers we target).
- AudioPlayer fade duration bumped 15ms → 25ms; the longer ramp lets
Chrome's AEC re-converge cleanly on user interrupt without an audible
click. Documented inline.
Graceful degradation: WebRTC loopback only enables on Chrome/Edge. On
Firefox/iOS Safari the pipeline detects the lack of support and routes
playback directly to ctx.destination, returning the original mic stream
unchanged — same behavior as before this change. iOS handler path is
fully preserved.
- Stop hardcoding 'inworld-tts-2' in REST TTS paths (inworld-llm.ts pronounce + tts-audio-generator.ts batch export). Both now read modelId from langConfig.ttsConfig so future model migrations stay in one place. - Match Sarah/Jason fallback voices to each persona's gender across all 25 non-native-voice languages (e.g. Pieter→Jason, Aino→Sarah, Iker→Jason). Native-voice languages were already correct. - Indonesian nativeName: 'Indonesia' (country) → 'Bahasa Indonesia'. - Filipino: rename code 'tl' → 'fil' so Soniox STT and TTS-2 BCP-47 hint reference the same language designator. Updated the LanguageConfig comment to acknowledge ISO 639-2 fallback when 639-1 doesn't apply (Filipino has no 639-1 code). - Distinguish Basque/Catalan/Galician sidebar flags (all rendered as 🏴 because Spanish-region tag sequences aren't in standard emoji fonts): 🟥 / 🟨 / 🟦 matching each flag's dominant color. Welsh keeps the standard 🏴 dragon glyph. - Run prettier across backend so the formatter-driven CI lint job passes.
SuomiKP31
approved these changes
May 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
inworld-tts-2everywhere (realtime + REST). Language is supplied per-call: BCP-47 viasession.providerData.tts.languagefor the realtime session, and the RESTlanguagefield for Anki batch + flashcard pronunciation. Voices stay cross-lingual but grounded in the target locale.soniox/stt-rt-v4with the ISO 639-1 language hint viatranscription.language. Soniox emits cumulative partial deltas (each delta is the full transcript so far) — switched the buffer logic from append to replace so partials no longer render as"hi hi there hi there this...".LanguageConfigto make BCP-47 the canonical form (droppedsttLanguageCodeandttsConfig.languageCode). Added all Soniox-supported languages with alternating Sarah/Jason TTS-2 voices and templated personas. The 6 existing curated personas/voices/topics are preserved as-is.max-height: 60vh,overflow-y: auto,overscroll-behavior: contain).Test plan
cd backend && npx tsc --noEmitcd frontend && npx tsc --noEmitcd backend && npx vitest run— all 60 tests passsession.updatewith new BCP-47inworld_error_event(TTS-2 / Soniox surface here)