SyncSpeak uses two external AI services: Sarvam AI (STT and TTS) and Groq (translation).
Dashboard (get your key): https://dashboard.sarvam.ai
Docs: https://docs.sarvam.ai
All Sarvam calls require the header:
api-subscription-key: <your_sarvam_key>
The key is entered via the Settings modal in the app and saved to %APPDATA%\com.syncspeak.app\config.json. It is automatically re-injected into the Python engine on every launch. No .env file is needed.
Console (get your key): https://console.groq.com
Docs: https://console.groq.com/docs
Groq uses the standard Bearer token pattern handled by the groq Python SDK. The key is injected at runtime via the Settings modal (update_groq_key command) and stored in config.json by Tauri.
Endpoint: POST https://api.sarvam.ai/speech-to-text
Function: speech_to_text(wav_bytes) in translator.py
Used for: Transcribing Hindi/Hinglish mic audio
Multipart form data:
| Field | Value | Notes |
|---|---|---|
file |
WAV bytes | RIFF WAV, int16, 16 kHz, mono |
model |
"saarika:v2.5" |
Current model — v2 is deprecated |
language_code |
"hi-IN" |
Source language |
{"transcript": "transcribed text here"}| Status | Cause |
|---|---|
| 400 | Wrong model name, missing field, or malformed WAV |
| 401/403 | Invalid API key |
| 429 | Rate limit exceeded |
The Python sidecar captures audio as int16 PCM. Before sending, _pcm_to_wav_bytes() (in sidecar_main.py) wraps the raw PCM in a standard 44-byte RIFF WAV header. Do not send raw PCM bytes without the header.
Endpoint: POST https://api.sarvam.ai/text-to-speech
Function: synthesize_speech(text, speaker) in translator.py
Used for: Synthesizing the translated English text
JSON body:
| Field | Value | Notes |
|---|---|---|
inputs |
["text here"] |
Array of strings |
target_language_code |
"en-IN" |
English (Indian accent) |
speaker |
e.g. "shubh" |
See voice list below |
model |
"bulbul:v3" |
TTS model |
speech_sample_rate |
22050 |
Output sample rate (Hz) |
temperature |
0.7 |
Voice variation (0.0–1.0) |
pace |
1.1 |
Speed multiplier (1.0 = natural) |
{"audios": ["<base64_encoded_wav_string>"]}decode_base64_audio() in translator.py decodes this to a float32 numpy array + sample rate tuple.
| ID | Gender | ID | Gender |
|---|---|---|---|
shubh |
Male | ritu |
Female |
sumit |
Male | pooja |
Female |
amit |
Male | simran |
Female |
manan |
Male | kavya |
Female |
rahul |
Male | priya |
Female |
ratan |
Male | ishita |
Female |
| — | — | shreya |
Female |
| — | — | shruti |
Female |
The female voice set is also tracked in FEMALE_SPEAKERS in translator.py.
Endpoint: POST https://api.sarvam.ai/translate
Function: translate_text() and verify_api_key() in translator.py
Used for: API key verification only — not used in the main translation pipeline
The main pipeline uses Groq Llama for translation (see below). This endpoint is only called by verify_api_key() to confirm the Sarvam key is valid when the user first enters it.
{
"source_language_code": "hi-IN",
"target_language_code": "en-IN",
"input": "text to translate",
"model": "mayura:v1"
}SDK: groq Python package
Docs: https://console.groq.com/docs
Model: llama-3.3-70b-versatile
Used for: Translating Hindi/Hinglish transcripts to English with conversation context
groq_client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[
{"role": "system", "content": system_prompt}, # includes rolling history
{"role": "user", "content": hindi_text} # current utterance
],
temperature=0.1, # low for consistent formal output
max_tokens=300
)A pure translation API translates each sentence independently. Llama receives the last 5 translated English sentences as context, allowing it to resolve pronouns like "woh" (he/she/they) to the actual person mentioned in prior sentences. This is critical for natural-sounding corporate meeting translation.
The system prompt (_MEETING_PROMPT in sidecar_main.py) uses a two-step structure:
Step 1 — Translate: Handles pure Hindi, Hinglish (keep English parts, translate Hindi parts), and already-English input (pass through unchanged).
Step 2 — Contextualize: Replaces vague pronouns using rolling history, only when clearly identifiable.
Strict rules enforced in the prompt:
- Output only the final English text (no meta-commentary)
- Translation length must be proportional to input (no fabrication)
- History is for pronoun resolution only, never for adding new content
- Add
?for interrogative patterns (kya, kaisa, kahan, kyun, kab, kaun) - Preserve proper nouns, company names, and numbers exactly
The update_groq_key command verifies the Groq key by making a 1-token completion:
test_client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": "hi"}],
max_tokens=1
)This confirms both key validity and model access without costing significant tokens.
All Sarvam REST calls go through call_api_with_retry() in translator.py:
- First attempt: if status 200 → return; if 4xx → raise immediately (no retry); if other error → fall through to retry
- Wait 100 ms
- Second attempt:
raise_for_status()on any non-200
The STT function (speech_to_text) bypasses this and makes a direct request, exposing the full Sarvam error response body in the exception message. This makes 400 errors debuggable from the app's TRACE log.
| Operation | API | Approximate cost |
|---|---|---|
| STT (per utterance) | Sarvam | Billed per minute of audio |
| Translation | Groq | Billed per token (input + output) |
| TTS (per sentence) | Sarvam | Billed per character |
| API key verification | Sarvam | ~1 call to /translate with "नमस्ते" |
| Voice preview (cached) | None | Free — plays local WAV file |
| Voice preview (uncached) | Sarvam | 1 TTS call |