Gemini Live Agent Challenge 2026 Submission
A real-time spatial awareness system that turns any phone camera into an intelligent navigation companion for blind users — powered by Gemini Live API.
Live Demo: https://drishti-whn43ovjpq-uc.a.run.app
2.2 billion people worldwide have vision impairment. Existing AI solutions describe scenes after the fact — "I see a chair." But a blind person walking needs real-time spatial intelligence: "Stop! Chair 1 step ahead — step 2 steps right to go around it."
The gap isn't seeing. It's understanding space, time, and movement together.
Drishti is a voice companion that gives turn-by-turn walking directions to blind users through their phone camera and speaker. It doesn't just describe — it thinks, tracks, predicts, and guides.
Key behaviors:
- "Turn left in 2 steps, stairs going down"
- "Continue straight for 10 more steps"
- "Stop! Chair directly ahead, 1 step"
- Silence when the path is safe (silence = safety)
- Responds to questions in the user's language (Hindi, English, etc.)
Drishti uses a three-channel perception architecture where Python decides WHEN Gemini speaks, not WHAT.
View Interactive Architecture Diagram → (open locally or on GitHub Pages)
Phone Camera (1 FPS) + Microphone + Sensors
│
│ WebSocket (JSON)
▼
┌─────────────────────────────────────────────────────────┐
│ Python Backend (FastAPI on Cloud Run) │
│ │
│ Channel 1: Cloud Vision API ──► Object Detection │
│ (every 3rd frame, 200ms) Bounding boxes, labels │
│ │
│ Channel 2: Gemini generateContent ──► Scene Analysis │
│ (event-driven, 2-3s) Rich JSON: env, │
│ path, navigation │
│ │
│ Channel 3: Gemini Live API ──► Voice I/O │
│ (continuous) Speaks to user, │
│ hears questions │
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ World Model (the BRAIN) │ │
│ │ - Absorbs: sensors, CV, cognitive, speech │ │
│ │ - Temporal Validator: is this still relevant? │ │
│ │ - 9-priority nudge system (P1→P9) │ │
│ │ - 4 dimensions: vigilance, urgency, │ │
│ │ spatial_confidence, verbosity │ │
│ │ - Goal system (explicit + implicit) │ │
│ │ - Decides: speak, stay silent, or inject context │ │
│ └─────────────────────────────────────────────────┘ │
│ │
│ Output: nudge ──► Gemini Live speaks to user │
│ OR silence (deliberate decision) │
└─────────────────────────────────────────────────────────┘
| Channel | Purpose | Latency | Cost |
|---|---|---|---|
| Cloud Vision | Fast object detection, emergency tripwire (vehicles, large obstacles) | 200ms | Low |
| Gemini generateContent | Deep scene understanding, navigation actions, environment classification | 2-3s | Medium |
| Gemini Live | Voice conversation, natural speech output, user questions | Real-time | Continuous |
The core insight: perception results are about the PAST. By the time Cloud Vision detects a chair (200ms) or Gemini analyzes a scene (3s), the user has moved. They may have already walked past the obstacle, or turned away from it.
Every frame is stamped with a sensor snapshot (heading, step count, speed). Before any alert fires, the Temporal Validator checks:
- Has the user turned away (heading changed >45 degrees)? → Drop it (stale)
- Has the user walked past it (distance moved > obstacle distance)? → Drop it (stale)
- Is the obstacle less than 1.5m ahead? → URGENT alert (imminent)
- Otherwise → Normal alert (valid)
This eliminates the false-positive problem that plagues all camera-based navigation systems.
| Feature | How Drishti Uses It |
|---|---|
| Native audio I/O | Natural voice conversation — user asks questions, Drishti responds |
| Video frame input | 1 FPS camera stream for continuous spatial awareness |
| Affective dialog | Adapts tone — calm for safe paths, urgent for dangers |
| Proactive audio | System-initiated speech via send_client_content(turn_complete=True) |
| Input/output transcription | Speech-to-text for goal extraction and UI display |
| Context window compression | Long sessions without losing spatial state |
| generateContent | Separate cognitive perception channel for deep scene analysis |
| Multi-language | Responds in user's language (tested with Hindi + English) |
- Gemini Live API —
gemini-2.5-flash-native-audio-preview-12-2025for voice - Gemini generateContent —
gemini-2.5-flashfor cognitive scene analysis - Google Cloud Vision API — Fast object detection and OCR
- Python / FastAPI — Backend server with WebSocket
- Google Cloud Run — Serverless deployment
- HTML/JS PWA — Frontend (works on any phone browser, no app install)
drishti/
├── backend/
│ ├── main.py # FastAPI WebSocket server, v4.2 pipeline
│ ├── gemini_session.py # Gemini Live API session manager
│ ├── cognitive_perception.py # Gemini generateContent scene analysis
│ ├── cognitive_trigger.py # Event-driven cognitive call decisions
│ ├── cloud_vision.py # Google Cloud Vision API client
│ ├── world_model.py # The BRAIN — 9-priority nudge system
│ ├── temporal_validator.py # v4.2 — validates perception vs movement
│ ├── sensor_processor.py # Phone sensor processing + snapshots
│ ├── goal_system.py # Explicit + implicit goal management
│ ├── implicit_goals.py # Context-derived navigation goals
│ ├── speech_processing.py # Speech event detection + interpretation
│ ├── plugins.py # Plugin configs (blind_nav, baby, elderly, security)
│ ├── system_prompt.py # Gemini Live voice prompt
│ ├── frame_quality.py # Frame quality gate
│ └── logger.py # Structured JSON logging + session stats
├── frontend/
│ ├── index.html # Demo-ready UI with brain dashboard
│ ├── app.js # Camera, mic, WebSocket, brain panel rendering
│ ├── sensors.js # Phone sensor capture + step detection
│ ├── style.css
│ └── manifest.json # PWA manifest
├── docs/
│ ├── ARCHITECTURE.md # Detailed architecture documentation
│ └── ...
├── Dockerfile # Cloud Run container
├── deploy.sh # One-command Cloud Run deployment
├── requirements.txt
└── .env.example
- Open on your phone: Visit https://drishti-whn43ovjpq-uc.a.run.app on your phone's browser (Chrome on Android or Safari on iOS)
- Grant permissions: Tap "Allow" for camera, microphone, and motion sensors when prompted
- Select mode: Choose "Blind Navigation" from the mode selector (default)
- Tap Start: The camera feed appears full-screen with a brain dashboard below
- Walk around: Point your phone camera forward as if you're walking. Hold it at chest height, rear camera facing ahead
- Listen: Drishti will speak turn-by-turn directions through your phone speaker:
- "Continue straight" when path is clear
- "Turn left in 2 steps" at corridors
- "Stop! Chair ahead" for obstacles
- Silence means the path is safe — this is intentional
- Ask questions: Speak naturally — "What's in front of me?" or "Where are the stairs?" — Drishti responds conversationally
- Watch the brain panel: Scroll down to see real-time system intelligence (environment type, urgency bars, path status, temporal validation)
Best test scenarios:
- Walk through a hallway with turns
- Approach furniture or obstacles
- Walk up/down stairs
- Walk toward a person
Note: Use headphones/earbuds for the best experience — the phone speaker works but earbuds prevent echo from the microphone picking up Drishti's voice.
# 1. Clone
git clone https://github.com/vikasjoel/drishti.git
cd drishti
# 2. Install dependencies
pip install -r requirements.txt
# 3. Configure API keys
cp .env.example .env
# Edit .env and add:
# GOOGLE_API_KEY=your_gemini_api_key (paid tier recommended)
# GOOGLE_CLOUD_PROJECT=your_gcp_project_id
# 4. Set up Cloud Vision credentials (for object detection)
# Download a GCP service account JSON with Vision API enabled
# Set: GOOGLE_APPLICATION_CREDENTIALS=path/to/service-account.json
# 5. Run the server
uvicorn backend.main:app --host 0.0.0.0 --port 8080
# 6. Open on your phone (must be on the same WiFi network)
# Find your computer's local IP: ifconfig | grep inet
# On phone browser, go to: http://<your-computer-ip>:8080
# IMPORTANT: Camera/mic require HTTPS in production,
# but localhost/local-IP works over HTTP for testing# Set environment variables in deploy.sh, then:
chmod +x deploy.sh
./deploy.sh| Time | What Happens |
|---|---|
| 0-3s | Connection establishes, camera feed appears |
| 3-5s | First cognitive analysis runs, Drishti greets you |
| 5-10s | Environment detected (room, corridor, stairs), first navigation direction |
| Ongoing | Directions every 3-6s when scene changes; silence when path is safe |
| On obstacle | "Stop! [object] ahead" with avoidance directions |
| On question | Natural voice response about the scene |
| Variable | Required | Description |
|---|---|---|
GOOGLE_API_KEY |
Yes | Gemini API key (paid tier recommended for rate limits) |
GOOGLE_CLOUD_PROJECT |
Yes | GCP project ID (for Cloud Vision) |
GOOGLE_APPLICATION_CREDENTIALS |
For Cloud Vision | Path to service account JSON |
DEFAULT_PLUGIN |
No | Plugin to use (default: blind_navigation) |
COGNITIVE_MODEL |
No | Model for generateContent (default: gemini-2.5-flash) |
LOG_LEVEL |
No | Logging level (default: INFO) |
Drishti is a universal spatial intelligence framework. Swap a plugin config to change the entire behavior:
| Plugin | Use Case | Behavior |
|---|---|---|
blind_navigation |
Visually impaired user walking | Turn-by-turn directions, obstacle alerts, silence = safe |
baby_monitor |
Stationary camera watching baby | Alert on unusual movement, boundary approach |
elderly_care |
Stationary camera for elderly | Fall detection, prolonged stillness alerts |
security |
Stationary security camera | Person detection, zone intrusion alerts |
The 9-priority nudge system ensures the most important information gets through:
| Priority | Trigger | Example |
|---|---|---|
| P1 (1.0) | Vehicle emergency | "Stop! Car ahead!" |
| P2 (0.9) | Safety alert / imminent obstacle | "Stop! Chair 1 step ahead!" |
| P3 (0.8) | Fast-approaching object | "Person approaching, 10 o'clock" |
| P4 (0.7) | Environment transition | "You've reached the stairs. Surface: stairs_down." |
| P5 (0.65) | Goal match | "Found the elevator — ahead on your right." |
| P6 (0.6) | New obstacle | "Table about 3 steps ahead." |
| P7 (0.5) | Navigation action / orientation | "Turn left in 2 steps." |
| P8 (0.4) | Proactive info | "Path is clear. Hallway ahead." |
| P9 (0.3) | Memory augment (silent) | Context refresh for Gemini Live |
The frontend shows a live brain dashboard that demonstrates the system's intelligence:
- Camera feed with speech overlay ("Drishti says: Turn left in 2 steps")
- Environment badge — changes from ROOM → STAIRS → CORRIDOR as user walks
- 4 dimension bars — Alertness, Urgency, Confidence, Voice (animated in real-time)
- Path status — Clear / Blocked with obstacle name
- Temporal validation — Shows "IMMINENT 0.8m" (pulsing red) or "Dropped: user turned 62 degrees" (gray)
- "SAFE — PATH CLEAR" indicator appears when the system is intentionally silent
| Metric | Value |
|---|---|
| Cognitive latency (generateContent) | 1.8–4.0s avg |
| Cloud Vision latency | 130–300ms |
| Gemini Live connection | ~100ms |
| Frames processed | 1 FPS |
| Nudges per 2-min session | ~14 (10 proactive, 4 silent) |
| Temporal stale drops | ~30% of perceptions correctly filtered |
| End-to-end (see obstacle → speak) | 2-4 seconds |
Built by Vikas Joel for the Gemini Live Agent Challenge 2026.
MIT