An AI-powered assistive navigation system for visually impaired individuals.
This project analyzes visual scenes using computer vision and Vision-Language Models to provide real-time audio guidance. The system detects objects in the environment, generates descriptive captions of the scene, and converts them into speech to help visually impaired users understand their surroundings.
The system integrates object detection (YOLOv8), image captioning (BLIP), and text-to-speech (gTTS) in a multi-step pipeline. Users can upload images, process videos, or use a live webcam to receive spoken navigation guidance.
Object Detection (YOLOv8): Detects important objects such as people, vehicles, chairs, buses, and obstacles in real time.
Image Captioning (BLIP): Generates natural language descriptions of the scene using a Vision-Language Model.
Navigation Alerts: Combines detected objects and scene captions to produce meaningful navigation guidance.
The system supports multiple input sources:
Image Upload: Analyze static images to generate scene descriptions.
Video Processing: Extract frames from videos and analyze each scene.
Live Webcam Navigation: Provide real-time environmental descriptions.
Text-to-Speech Conversion (gTTS): Convert generated scene descriptions into spoken guidance.
Accessibility Support: Designed to help visually impaired individuals understand their surroundings through audio feedback.
Gradio Interface: Provides an easy-to-use web application where users can:
- Upload images
- Upload videos
- Use a live webcam
The interface displays:
- Detected objects
- Scene descriptions
- Audio guidance
git clone https://github.com/BhaveshBhakta/Smart-Navigation-Stick-Using-VLM.git
cd Smart-Navigation-Stick-Using-VLM
pip install -r requirements.txt
YOLOv8 will automatically download the model weights when running for the first time.
python app.py
After running the application, open the local Gradio interface:
http://127.0.0.1:7860
- Capture visual input from images, videos, or webcam.
- Detect objects in the environment using YOLOv8.
- Generate scene descriptions using the BLIP Vision-Language Model.
- Combine detected objects and captions to create navigation guidance.
- Convert navigation text into audio using Google Text-to-Speech.
- Deliver spoken guidance to the user.
Smart-Navigation-System
│
├── app.py
├── modules
│ ├── caption.py
│ ├── detection.py
│ ├── navigation.py
│ ├── tts.py
│ ├── video_processing.py
│ └── webcam_processing.py
│
├── training
│ ├── train_blip.py
│ ├── evaluate_model.py
│ └── dataset_loader.py
│
├── dataset
│
├── requirements.txt
└── README.md
Distance Estimation
Integrate depth estimation to determine how far objects are from the user.
Mobile Deployment
Convert the system into a mobile application for real-world use.
Improved Scene Understanding
Use advanced Vision-Language Models such as InstructBLIP or LLaVA for richer descriptions.
Edge Deployment
Optimize models for deployment on edge devices such as Raspberry Pi or smart glasses.