diff --git a/api-reference/server/services/tts/elevenlabs.mdx b/api-reference/server/services/tts/elevenlabs.mdx index 4bda91bb..1c2dc619 100644 --- a/api-reference/server/services/tts/elevenlabs.mdx +++ b/api-reference/server/services/tts/elevenlabs.mdx @@ -245,7 +245,7 @@ async with aiohttp.ClientSession() as session: - **Multilingual models required for `language`**: Setting `language` with a non-multilingual model (e.g. `eleven_turbo_v2_5`) has no effect. Use `eleven_multilingual_v2` or similar. - **WebSocket vs HTTP**: The WebSocket service supports word-level timestamps and interruption handling, making it significantly better for interactive conversations. The HTTP service is simpler but lacks these features. - **Text aggregation**: Sentence aggregation is enabled by default (`text_aggregation_mode=TextAggregationMode.SENTENCE`). Buffering until sentence boundaries produces more natural speech. Set `text_aggregation_mode=TextAggregationMode.TOKEN` to stream tokens directly for lower latency. The `auto_mode` parameter is automatically configured based on the aggregation mode for optimal quality. -- **Word timestamp accuracy**: Word timestamps accurately reflect the spoken audio, not just the input text. When using pronunciation dictionaries or text normalization (`apply_text_normalization`), the service consumes ElevenLabs' normalized alignment data to ensure downstream consumers (captions, transcripts, context aggregation) match what the listener actually hears. +- **Word timestamp accuracy**: Word timestamps reflect the original input text by default, preserving non-Latin scripts in transcripts and LLM context. When pronunciation dictionaries are configured via `pronunciation_dictionary_locators`, the service switches to ElevenLabs' normalized alignment to avoid duplicate words caused by dictionary substitutions. Text normalization (`apply_text_normalization`) does not affect which alignment field is used. ## Event Handlers