A Hindi-first voice PWA that helps underserved Indian citizens discover and apply for government schemes through natural conversation — no literacy required, no bureaucratic maze.
Over 90 crore Indian citizens are eligible for government welfare schemes but miss out — not because they aren't entitled, but because discovering schemes, checking eligibility, and completing applications requires navigating opaque English-language portals that assume literacy and internet fluency neither exists.
Sarkari Saathi eliminates that barrier. A citizen speaks a question in Hindi — "Mujhe gas cylinder chahiye, kya karna hoga?" — and receives a spoken, jargon-free answer naming the right scheme (Ujjwala Yojana), citing the official source, and giving one clear next step (helpline or CSC address).
Architected a 5-stage async voice pipeline — ffmpeg → Sarvam Saaras STT → OpenAI + Pinecone RAG → Sarvam-m LLM → Bulbul TTS — that reduces government scheme discovery from hours of confusing form-filling to a single spoken question, delivering structured answers and eligibility checks in ~15 s end-to-end on Render's free tier. Reduced per-query wall-clock time by 5–15 s by running TTS synthesis and eligibility checking concurrently via asyncio.gather, cutting latency to max(TTS, eligibility) instead of their sum.
- Key Features
- Why Sarkari Saathi
- Tech Stack
- Project Layout
- Voice Pipeline & AI Routing
- NLP & Prompt Design
- API Endpoints
- Data Models
- Getting Started
- Known Limitations & v2 Ideas
- Contributing
- Author
-
Engineered a real-time Hindi voice pipeline — records mic audio on mobile, normalizes via ffmpeg to 16 kHz mono WAV, transcribes with Sarvam Saaras (hi-IN), retrieves top-5 scheme chunks from Pinecone, reasons with Sarvam-m, and plays back spoken Hindi audio via Bulbul TTS — end-to-end in ~15 s on Render's free tier.
-
Reduced per-query latency by 5–15 s by parallelizing TTS synthesis and eligibility checking via
asyncio.gather— both stages are data-independent, so wall-clock drops from the sequential sum tomax(TTS, eligibility). -
Shipped a zero-build-toolchain PWA — vanilla HTML/CSS/JS with a service worker and Web App Manifest, installable on low-end Android devices with no Node.js, bundler, or app store required.
-
Optimized for 2G and flaky networks — switchable 8 kHz TTS mode halves audio payload size for low-bandwidth connections; all TTS files served with 24 h
Cache-Control: immutableheaders so the service worker can re-serve without hitting the backend. -
Integrated a RAG corpus of 7 Central Government schemes — PDFs → structured JSON (eligibility rules, documents required, 5-step application guide, helpline) → OpenAI
text-embedding-3-small→ Pinecone indexvoice-matters-schemes. -
Productionized a hallucination guard — every LLM response is validated against the live scheme name list from Postgres; answers referencing unindexed schemes are soft-refused with helpline 14434, never fabricated.
-
Automated jargon sanitization — a regex substitution table enforces conversational Hindi post-generation (process → kaam, verification → jaanch) even if the LLM slips, ensuring Bulbul TTS reads naturally to an 8th-class-pass listener.
-
Deployed declaratively on Render via
render.yamlBlueprint — Dockerized FastAPI backend in the Singapore region (minimizes Sarvam API round-trip latency) + static frontend, zero-downtime auto-deploy on every push tomain.
| Who | Why this works |
|---|---|
| Rural citizens (primary) | Hindi voice input removes the literacy and English barrier entirely; no app download needed — installable as a PWA on any Android. |
| NGO and CSC field workers | Walk a beneficiary through eligibility and document checklists on a basic Android phone, even on a 2G connection. |
| Policy researchers | Structured telemetry (per-stage latency, retrieval hit rate, confidence, feedback votes) surfaces exactly where scheme discovery breaks down at scale. |
| Civic tech builders | The voice + RAG architecture is language- and corpus-agnostic — reusable for any regional language × government knowledge base. |
Frontend
Backend
AI / ML
Infrastructure
Click to expand full file tree
Voice-Matters/
├── backend/ # FastAPI application
│ ├── main.py # App entry point, CORS, static mounts, startup hook
│ ├── requirements.txt # Python deps (no PyTorch — avoids 2 GB Render build bloat)
│ ├── Dockerfile # Python 3.11 + ffmpeg image
│ ├── render.yaml # Render Blueprint: 2-service declarative deploy
│ │
│ ├── api/v1/
│ │ ├── conversation.py # Voice / chat endpoints, feedback, action tracking
│ │ ├── schemes.py # Scheme detail, explain (cached), apply-steps
│ │ └── admin.py # Admin routes (scheme management)
│ │
│ ├── clients/
│ │ ├── sarvam_client.py # STT (Saaras v2.5), LLM (sarvam-m), TTS (Bulbul v2) + retry pool
│ │ ├── openai_client.py # text-embedding-3-small
│ │ ├── pinecone_client.py # Pinecone index upsert + similarity query
│ │ └── local_embedder.py # sentence-transformers fallback (dev / EMBED_PROVIDER=local)
│ │
│ ├── services/
│ │ ├── voice_pipeline.py # 5-stage orchestrator: norm→STT→RAG→LLM→[TTS ‖ eligibility]
│ │ ├── answer_service.py # LLM answer + hallucination guard + jargon sanitizer
│ │ ├── eligibility_service.py # LLM fact extraction: maps query to eligibility rules
│ │ ├── rag_service.py # OpenAI embed → Pinecone top-k retrieve
│ │ ├── conversation_service.py # Conversation / message CRUD, feedback recording
│ │ └── audio.py # ffmpeg normalization (WAV 16 kHz mono)
│ │
│ ├── models/
│ │ ├── conversation.py # Conversation, Message, UserAction, SchemeMeta
│ │ ├── feedback.py # Feedback (rating / vote / chip tags)
│ │ ├── telemetry.py # Per-pipeline-run timing JSONB
│ │ ├── scheme_explain_cache.py # (scheme_id, length) → explanation text + audio URL
│ │ └── db.py # asyncpg engine, SessionLocal, Neon URL normalization
│ │
│ ├── prompts/
│ │ └── system_prompts.py # SYSTEM_PROMPT_HINDI, RESPONSE_TEMPLATE, FEW_SHOT_EXAMPLES
│ │
│ ├── data/
│ │ └── scheme_corpus.py # In-memory loader for processed scheme JSON
│ │
│ ├── scripts/
│ │ └── ingest_schemes.py # PDF → JSON → embedding → Pinecone upsert pipeline
│ │
│ ├── alembic/ # DB migrations (3 versions)
│ └── tests/
│ ├── persona_scripts.py # Persona-based voice test scripts
│ ├── run_regression.py # End-to-end regression runner
│ └── test_persona.py # Pytest persona tests
│
├── web/ # Static PWA (zero build toolchain)
│ ├── index.html # Main app: voice/chat UI, scheme cards, feedback
│ ├── admin.html # Admin dashboard
│ ├── manifest.json # Web App Manifest (installable on Android)
│ ├── sw.js # Service Worker (offline caching strategy)
│ └── icons/ # App icons (192 px, 512 px)
│
├── scheme-corpus/
│ └── schemes/
│ ├── raw/ # Source PDFs — immutable, .gitignored
│ └── processed/ # 7 normalized Central Govt scheme JSONs
│ ├── pmjdy.json # PM Jan Dhan Yojana
│ ├── pmuy.json # PM Ujjwala Yojana 2.0
│ ├── pmmy.json # PM Mudra Yojana
│ ├── pmjjby.json # PM Jeevan Jyoti Bima Yojana
│ ├── kcc.json # Kisan Credit Card
│ ├── day-nrlm.json # DAY-NRLM (rural livelihoods)
│ └── mmsby.json # MMSBY
│
├── docs/
│ ├── architecture.md
│ ├── api-contracts.md
│ └── build-log.md # Prompt-by-prompt build history
│
└── .github/workflows/
└── ci.yml # ruff lint on every push
sequenceDiagram
participant User as 📱 User (Android)
participant PWA as Web PWA
participant API as FastAPI Backend
participant Sarvam as Sarvam AI
participant OAI as OpenAI
participant Pine as Pinecone
participant DB as PostgreSQL
User->>PWA: Records voice (Hindi)
PWA->>API: POST /api/v1/conversation/{id}/voice (multipart audio)
Note over API: Stage 1 — ffmpeg: normalize to WAV 16kHz mono
API->>Sarvam: Stage 2 — Saaras v2.5 STT (hi-IN)
Sarvam-->>API: Hindi transcript
API->>OAI: Stage 3a — text-embedding-3-small
OAI-->>API: 1536-dim vector
API->>Pine: Stage 3b — top-5 similarity search (voice-matters-schemes)
Pine-->>API: Retrieved scheme chunks
API->>Sarvam: Stage 4 — sarvam-m LLM (RAG context + system prompt + few-shots)
Sarvam-->>API: Hindi response text (chain-of-thought stripped)
par Stage 5a — TTS
API->>Sarvam: Bulbul v2 TTS (anushka, 22050 Hz or 8000 Hz low)
Sarvam-->>API: MP3 audio bytes
API->>API: Write /static/audio/{uuid}.mp3 (24h cache)
and Stage 5b — Eligibility (parallel)
API->>Sarvam: sarvam-m eligibility fact extraction
Sarvam-->>API: Eligibility rule results
end
API->>DB: Persist conversation turns + telemetry (per-stage ms)
API-->>PWA: JSON envelope (transcript, response_text, audio_url, top_3_schemes, eligibility)
PWA->>User: Plays spoken Hindi audio + renders scheme cards
Parallel optimization: Stages 5a (TTS) and 5b (eligibility) share no data and run concurrently via
asyncio.gather. This saves 5–15 s per query — wall-clock drops fromTTS_ms + eligibility_mstomax(TTS_ms, eligibility_ms).
flowchart TD
RAG[RAG retrieves top-5 chunks from Pinecone] --> CHECK{Retrieved chunks empty?}
CHECK -- Yes --> REFUSE[Soft refusal → helpline 14434]
CHECK -- No --> LLM[sarvam-m reasons with retrieved context]
LLM --> GUARD{Hallucination guard:\nresponse names a scheme\nnot in corpus?}
GUARD -- Yes --> REFUSE
GUARD -- No --> JARGON[Jargon sanitizer:\nEnglish terms → conversational Hindi]
JARGON --> TTS_CHECK{answer.refused?}
TTS_CHECK -- Yes --> DROP[Drop eligibility results — noise]
TTS_CHECK -- No --> OUT[Deliver response + TTS + eligibility]
| Component | Detail |
|---|---|
| Persona | "Sarkari Saathi" — a trusted didi/bhaiya at a local bank or scheme office; tone calibrated to an 8th-class-pass listener |
| Language | Devanagari Hindi throughout — Sarvam Bulbul TTS reads Latin-script words with an English accent, so even scheme names are written in Devanagari form (PMJDY → जन धन योजना) |
| Response template | 4-part structure enforced in every answer: (1) Acknowledge → (2) Mirror + key fact + concrete number → (3) Cite source domain → (4) One clear next step |
| Few-shot examples | 5 grounded worked examples: Jan Dhan, KCC, Ujjwala, fake-scheme refusal, sensitive-data refusal (Aadhaar/OTP) |
| Jargon substitution | 8-entry table enforced at prompt level and post-generation — belt-and-suspenders so LLM slippage never reaches TTS |
| Refusal criteria | Empty RAG retrieval OR LLM references a scheme not in the Postgres hallucination-guard list → redirect to helpline 14434 |
| Model | Sarvam-m (reasoning model); <think>…</think> chain-of-thought blocks stripped by regex before TTS |
| TTS truncation | Responses capped at 450 chars at last sentence boundary (।, ., \n) to stay within Sarvam Bulbul's ~500-char limit |
| Scheme explain cache | LLM explanation + TTS audio cached in Postgres by (scheme_id, length) — repeat requests skip LLM and Sarvam entirely |
| Method | Path | Purpose |
|---|---|---|
POST |
/api/v1/conversation/{id}/voice |
Submit audio → transcript + spoken response + scheme cards + eligibility |
POST |
/api/v1/conversation/{id}/chat |
Submit text → identical response envelope as /voice (no STT/TTS) |
GET |
/api/v1/conversation/{id}/messages |
List all messages in a conversation |
GET |
/api/v1/conversations |
List conversations grouped |
POST |
/api/v1/conversation/{id}/feedback |
Record rating / thumbs vote / chip tags per message |
POST |
/api/v1/conversation/{id}/action |
Track scheme action steps taken by user |
GET |
/api/v1/schemes/{id} |
Full scheme metadata (benefits, eligibility rules, documents needed, helpline) |
GET |
/api/v1/schemes/{id}/explain?length=short|medium|long |
LLM-generated Hindi explanation + Bulbul TTS audio (DB-cached by scheme + length) |
GET |
/api/v1/schemes/{id}/apply-steps |
5-step structured application guide (Devanagari) |
GET |
/api/v1/messages/{id}/explanation |
Per-message "Samjhao" payload + community up/down vote stats aggregated by top scheme |
GET |
/health |
Health check ({"status": "ok"}) |
| Model | Key Fields |
|---|---|
Conversation |
id (UUID PK), created_at |
Message |
id, conversation_id, role (user/assistant), modality (voice/text), content_text, content_audio_url, retrieved_schemes (JSONB), sources, confidence, eligibility_results |
UserAction |
id, conversation_id, scheme_id, action, step_number |
SchemeMeta |
scheme_id (PK), name, ministry, summary |
Feedback |
id, conversation_id, message_id (FK → Message), rating (int), comment |
Telemetry |
id, event_type, payload (JSONB: norm_ms, stt_ms, rag_ms, llm_ms, elig_ms, tts_ms, total_ms + outcome flags) |
SchemeExplainCache |
(scheme_id, length) composite PK, explanation_text_hi, explanation_audio_url |
- Python 3.11+
ffmpegon PATH (brew install ffmpeg/apt install ffmpeg)- API keys:
SARVAM_API_KEY,OPENAI_API_KEY,PINECONE_API_KEY - PostgreSQL connection string (Neon serverless works out of the box)
1. Clone
git clone https://github.com/yatinbhalla/Voice-Matters.git sarkari-saathi
cd sarkari-saathi2. Backend setup
cd backend
python3.11 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt3. Configure environment
cp .env.example .env
# Fill in your keys — see table belowRequired .env variables
SARVAM_API_KEY=your_sarvam_key
OPENAI_API_KEY=your_openai_key
PINECONE_API_KEY=your_pinecone_key
PINECONE_INDEX_NAME=voice-matters-schemes
DATABASE_URL=postgresql://user:pass@host/db
FRONTEND_ORIGIN=http://localhost:8000
ENVIRONMENT=development
EMBED_PROVIDER=openai # or "local" (needs: pip install sentence-transformers)4. Run database migrations
alembic upgrade head5. Start backend (different port from the frontend)
uvicorn main:app --reload --port 8080
# API: http://localhost:8080 | Docs: http://localhost:8080/docs6. Start frontend
cd ../web
python3 -m http.server 8000
# App: http://localhost:80007. (Optional) Ingest scheme corpus into Pinecone
cd backend
python scripts/ingest_schemes.pyPush render.yaml to your repo, then in the Render dashboard: New + → Blueprint → connect repo → Apply. Render creates both services and prompts for the secret env vars.
Current limitations
- Scheme corpus covers 7 Central Government schemes; state-level and sector-specific schemes are not yet indexed.
- Sarvam Bulbul has a ~500 char ceiling per TTS request — long responses are truncated at the nearest sentence boundary.
- Render free tier sleeps after 15 min idle; first post-sleep request incurs a ~10 s cold start.
- Scheme JSON is static — eligibility changes (e.g., revised income ceilings) require a manual re-ingest cycle.
- TTS audio files accumulate in
/static/audio/with no TTL cleanup in production.
v2 ideas
- Expand corpus to 50+ schemes including state-level, MSME, and agriculture schemes.
- Stream TTS audio chunks so playback starts before synthesis completes.
- Build a GPS → nearest CSC (Common Service Centre) lookup into every next-step response.
- Add a post-session CSAT survey to close the product feedback loop.
- Extend to Hinglish and regional languages (Tamil, Telugu, Bengali) via Sarvam's multilingual models.
Sarkari Saathi is built prompt-by-prompt with every step recorded in docs/build-log.md. Whether you want to add a new scheme JSON, sharpen a jargon-substitution rule, or fix a frontend accessibility issue — contributions are very welcome.
- Issues: Open a GitHub Issue labeled
scheme-request,bug, orproduct-feedback. - PRs: Fork → feature branch → PR against
main. Include the relevant scheme ID or pipeline stage in the PR title. - Product feedback: Not sure it's a bug? Open a Discussion — I especially welcome input from NGO workers, CSC operators, or civic tech practitioners who've tried it in the field.