Real-Time Applications based on Computer Vision
This project implements a real-time multimodal sentiment analysis system that combines facial expression recognition (visual modality) with Transformer-based text sentiment analysis (textual modality) to produce a highly accurate and context-aware sentiment prediction.
The system utilizes a novel Context-Aware Reliability Fusion strategy that dynamically adjusts the influence of each modality based on its confidence, rather than using traditional fixed weights. Unlike standard approaches that arbitrarily divide importance (e.g., 50% face, 50% text), this system adapts in real-time. If the facial emotion model is highly certain about an emotion but the text is ambiguous, the system increases the visual weight automatically, and vice versa.
This project addresses the requirement for an interactive, adaptive, and user-friendly interface by providing a real-time Tkinter dashboard that processes live webcam feeds and user text inputs concurrently.
The architecture consists of two independent neural network pipelines that feed into a central fusion engine.
- Framework: OpenCV (Face Detection) and DeepFace (Emotion Classification)
- Architecture: Convolutional Neural Network (CNN) based on VGG-Face/Facenet.
- Dataset: Trained on the FER-2013 dataset (35,887 grayscale facial images categorized into 7 discrete emotion classes).
- Processing: The system captures frames via webcam, isolates the facial region using Haar Cascades, and passes the region to the CNN.
- Continuous Sentiment Mapping: The CNN outputs a full probability distribution across 7 emotions. To fuse this with text sentiment, these probabilities are mapped to a continuous sentiment score from -1.0 (most negative) to +1.0 (most positive) using the following weight scale:
- Happy (+1.0), Surprise (+0.5), Neutral (0.0), Fear (-0.3), Disgust (-0.5), Sad (-0.7), Angry (-1.0).
- Model: RoBERTa Base (Robustly Optimized BERT Pretraining Approach).
- Implementation: HuggingFace Transformers (
cardiffnlp/twitter-roberta-base-sentiment). - Dataset: Pre-trained on 58 million data points and fine-tuned on the TweetEval sentiment benchmark.
- Processing: Text input is tokenized and passed through the Transformer model, yielding probability scores for Negative, Neutral, and Positive classifications.
- Sentiment Score: The final continuous text score is calculated as the difference between the positive probability and the negative probability (range: -1.0 to +1.0).
To determine the final multimodal sentiment, the system calculates dynamic weights based on the confidence of each model's independent prediction.
Mathematical formulation:
Visual_Confidence= maximum probability within the 7-class emotion distribution.Text_Confidence= maximum probability within the 3-class text sentiment distribution.Wv(Visual Weight) =Visual_Confidence/ (Visual_Confidence+Text_Confidence)Wt(Text Weight) =Text_Confidence/ (Visual_Confidence+Text_Confidence)Final_Score= (Wv*Visual_Score) + (Wt*Text_Score)
An essential aspect of human communication is irony, sarcasm, and emotional masking. The system includes a conflict detection algorithm that triggers when the polarities of the visual and textual modalities diverge significantly (e.g., a sad facial expression accompanied by overly positive text). This feature highlights the distinct advantage of multimodal AI over unimodal systems.
- Programming Language: Python 3.x
- Computer Vision: OpenCV (cv2) for real-time video capture and frame processing.
- Deep Learning Frameworks: PyTorch and TensorFlow/Keras.
- Pre-trained Models: DeepFace (CNN) and HuggingFace Transformers (RoBERTa).
- Graphical User Interface: Tkinter and Python Imaging Library (PIL).
- Data Processing: NumPy for tensor and matrix operations.
Ensure Python 3.8 or higher is installed on the host machine.
Open a terminal in the project directory and install the required dependencies using the provided requirements file:
pip install -r requirements.txtRun the main application script from the terminal:
python main_app.pyNote: Upon the first execution, the application will automatically download the RoBERTa Transformer model weights (approximately 500 MB) and the DeepFace CNN weights (approximately 6 MB). Subsequent executions will launch instantly.
- Webcam Feed: Upon launch, the system will access the default camera. Ensure your face is clearly visible. The left panel will display the real-time probability distribution of your facial expressions.
- Text Input: Type a sentence into the designated text box on the right panel.
- Analysis: Click the "Analyze Sentiment" button.
- Interpretation: The dashboard will instantly update to show:
- The visual sentiment score.
- The text sentiment score.
- The dynamically calculated weights (Wv and Wt).
- Any detected emotional conflicts.
- The final, fused multimodal sentiment classification and confidence percentage.
The system has been evaluated against various communicative scenarios to demonstrate the efficacy of the dynamic fusion model:
| Scenario | Facial Expression | Textual Input | Conflict Detected | Fused Sentiment Result |
|---|---|---|---|---|
| Congruent Positive | Happy | "I am thrilled about this!" | No | Strong Positive |
| Congruent Negative | Sad/Angry | "This is a terrible experience." | No | Strong Negative |
| Sarcasm/Masking | Angry | "Oh, what a wonderful day." | Yes | Slightly Negative (adjusted based on visual dominance) |
| Text-Dominant | Neutral | "The product quality is excellent." | No | Positive (adjusted based on text dominance) |
Conclusion: The conflict detection and dynamic weighting modules successfully process complex human expressions, confirming that multimodal fusion yields significantly more robust insights than either modality could achieve independently.
To fulfill the evaluation requirements on real-world datasets, the text pipeline was evaluated against a sample testing suite of real-world sentences across three classifications (Positive, Negative, Neutral).
The evaluation utilized the sklearn.metrics library to calculate standard machine learning performance indicators:
| Metric | Result |
|---|---|
| Accuracy | 95.24% |
| Precision (Weighted) | 95.77% |
| Recall (Weighted) | 95.24% |
| F1-Score (Macro) | 0.95 |
Note: The high classification metrics (though imperfect due to intentional testing on highly ambiguous/sarcastic edge cases) are a direct result of utilizing the robust, pre-trained RoBERTa architecture fine-tuned on the massive TweetEval dataset, ensuring state-of-the-art inference on real-world text inputs.
main_app.py: The entry point and Tkinter GUI implementation.visual_pipeline.py: Handles webcam processing, face detection, and DeepFace CNN integration.text_pipeline.py: Handles tokenization and RoBERTa Transformer inference.fusion_engine.py: Contains the mathematical logic for the dynamic fusion and conflict detection.requirements.txt: Defines all library dependencies for reproducibility.README.md: Project documentation.