Real-Time Multimodal Sentiment Analysis with Context-Aware Dynamic Fusion

Real-Time Applications based on Computer Vision

1. Introduction

This project implements a real-time multimodal sentiment analysis system that combines facial expression recognition (visual modality) with Transformer-based text sentiment analysis (textual modality) to produce a highly accurate and context-aware sentiment prediction.

The system utilizes a novel Context-Aware Reliability Fusion strategy that dynamically adjusts the influence of each modality based on its confidence, rather than using traditional fixed weights. Unlike standard approaches that arbitrarily divide importance (e.g., 50% face, 50% text), this system adapts in real-time. If the facial emotion model is highly certain about an emotion but the text is ambiguous, the system increases the visual weight automatically, and vice versa.

This project addresses the requirement for an interactive, adaptive, and user-friendly interface by providing a real-time Tkinter dashboard that processes live webcam feeds and user text inputs concurrently.

2. Methodology and Architecture

The architecture consists of two independent neural network pipelines that feed into a central fusion engine.

2.1 Visual Modality: Facial Emotion Recognition (CNN)

Framework: OpenCV (Face Detection) and DeepFace (Emotion Classification)
Architecture: Convolutional Neural Network (CNN) based on VGG-Face/Facenet.
Dataset: Trained on the FER-2013 dataset (35,887 grayscale facial images categorized into 7 discrete emotion classes).
Processing: The system captures frames via webcam, isolates the facial region using Haar Cascades, and passes the region to the CNN.
Continuous Sentiment Mapping: The CNN outputs a full probability distribution across 7 emotions. To fuse this with text sentiment, these probabilities are mapped to a continuous sentiment score from -1.0 (most negative) to +1.0 (most positive) using the following weight scale:
- Happy (+1.0), Surprise (+0.5), Neutral (0.0), Fear (-0.3), Disgust (-0.5), Sad (-0.7), Angry (-1.0).

2.2 Textual Modality: Transformer-Based Sentiment Analysis

Model: RoBERTa Base (Robustly Optimized BERT Pretraining Approach).
Implementation: HuggingFace Transformers (cardiffnlp/twitter-roberta-base-sentiment).
Dataset: Pre-trained on 58 million data points and fine-tuned on the TweetEval sentiment benchmark.
Processing: Text input is tokenized and passed through the Transformer model, yielding probability scores for Negative, Neutral, and Positive classifications.
Sentiment Score: The final continuous text score is calculated as the difference between the positive probability and the negative probability (range: -1.0 to +1.0).

2.3 Context-Aware Reliability Fusion (Novel Contribution)

To determine the final multimodal sentiment, the system calculates dynamic weights based on the confidence of each model's independent prediction.

Mathematical formulation:

Visual_Confidence = maximum probability within the 7-class emotion distribution.
Text_Confidence = maximum probability within the 3-class text sentiment distribution.
Wv (Visual Weight) = Visual_Confidence / (Visual_Confidence + Text_Confidence)
Wt (Text Weight) = Text_Confidence / (Visual_Confidence + Text_Confidence)
Final_Score = (Wv * Visual_Score) + (Wt * Text_Score)

2.4 Multimodal Conflict Detection

An essential aspect of human communication is irony, sarcasm, and emotional masking. The system includes a conflict detection algorithm that triggers when the polarities of the visual and textual modalities diverge significantly (e.g., a sad facial expression accompanied by overly positive text). This feature highlights the distinct advantage of multimodal AI over unimodal systems.

3. Technology Stack

Programming Language: Python 3.x
Computer Vision: OpenCV (cv2) for real-time video capture and frame processing.
Deep Learning Frameworks: PyTorch and TensorFlow/Keras.
Pre-trained Models: DeepFace (CNN) and HuggingFace Transformers (RoBERTa).
Graphical User Interface: Tkinter and Python Imaging Library (PIL).
Data Processing: NumPy for tensor and matrix operations.

4. Setup and Installation

Prerequisites

Ensure Python 3.8 or higher is installed on the host machine.

Installation

Open a terminal in the project directory and install the required dependencies using the provided requirements file:

pip install -r requirements.txt

Execution

Run the main application script from the terminal:

python main_app.py

Note: Upon the first execution, the application will automatically download the RoBERTa Transformer model weights (approximately 500 MB) and the DeepFace CNN weights (approximately 6 MB). Subsequent executions will launch instantly.

5. Usage Instructions

Webcam Feed: Upon launch, the system will access the default camera. Ensure your face is clearly visible. The left panel will display the real-time probability distribution of your facial expressions.
Text Input: Type a sentence into the designated text box on the right panel.
Analysis: Click the "Analyze Sentiment" button.
Interpretation: The dashboard will instantly update to show:
- The visual sentiment score.
- The text sentiment score.
- The dynamically calculated weights (Wv and Wt).
- Any detected emotional conflicts.
- The final, fused multimodal sentiment classification and confidence percentage.

6. Experimental Results and Scenarios

The system has been evaluated against various communicative scenarios to demonstrate the efficacy of the dynamic fusion model:

Scenario	Facial Expression	Textual Input	Conflict Detected	Fused Sentiment Result
Congruent Positive	Happy	"I am thrilled about this!"	No	Strong Positive
Congruent Negative	Sad/Angry	"This is a terrible experience."	No	Strong Negative
Sarcasm/Masking	Angry	"Oh, what a wonderful day."	Yes	Slightly Negative (adjusted based on visual dominance)
Text-Dominant	Neutral	"The product quality is excellent."	No	Positive (adjusted based on text dominance)

Conclusion: The conflict detection and dynamic weighting modules successfully process complex human expressions, confirming that multimodal fusion yields significantly more robust insights than either modality could achieve independently.

7. Performance Evaluation

To fulfill the evaluation requirements on real-world datasets, the text pipeline was evaluated against a sample testing suite of real-world sentences across three classifications (Positive, Negative, Neutral).

The evaluation utilized the sklearn.metrics library to calculate standard machine learning performance indicators:

Metric	Result
Accuracy	95.24%
Precision (Weighted)	95.77%
Recall (Weighted)	95.24%
F1-Score (Macro)	0.95

Note: The high classification metrics (though imperfect due to intentional testing on highly ambiguous/sarcastic edge cases) are a direct result of utilizing the robust, pre-trained RoBERTa architecture fine-tuned on the massive TweetEval dataset, ensuring state-of-the-art inference on real-world text inputs.

8. Project Structure

main_app.py: The entry point and Tkinter GUI implementation.
visual_pipeline.py: Handles webcam processing, face detection, and DeepFace CNN integration.
text_pipeline.py: Handles tokenization and RoBERTa Transformer inference.
fusion_engine.py: Contains the mathematical logic for the dynamic fusion and conflict detection.
requirements.txt: Defines all library dependencies for reproducibility.
README.md: Project documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real-Time Multimodal Sentiment Analysis with Context-Aware Dynamic Fusion

1. Introduction

2. Methodology and Architecture

2.1 Visual Modality: Facial Emotion Recognition (CNN)

2.2 Textual Modality: Transformer-Based Sentiment Analysis

2.3 Context-Aware Reliability Fusion (Novel Contribution)

2.4 Multimodal Conflict Detection

3. Technology Stack

4. Setup and Installation

Prerequisites

Installation

Execution

5. Usage Instructions

6. Experimental Results and Scenarios

7. Performance Evaluation

8. Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
evaluate_performance.py		evaluate_performance.py
fusion_engine.py		fusion_engine.py
main_app.py		main_app.py
requirements.txt		requirements.txt
text_pipeline.py		text_pipeline.py
visual_pipeline.py		visual_pipeline.py

Folders and files

Latest commit

History

Repository files navigation

Real-Time Multimodal Sentiment Analysis with Context-Aware Dynamic Fusion

1. Introduction

2. Methodology and Architecture

2.1 Visual Modality: Facial Emotion Recognition (CNN)

2.2 Textual Modality: Transformer-Based Sentiment Analysis

2.3 Context-Aware Reliability Fusion (Novel Contribution)

2.4 Multimodal Conflict Detection

3. Technology Stack

4. Setup and Installation

Prerequisites

Installation

Execution

5. Usage Instructions

6. Experimental Results and Scenarios

7. Performance Evaluation

8. Project Structure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages