🚀 Real-Time Social Media Sentiment Analysis (Big Data & NLP)

📌 Project Overview

This project implements a fully containerized, end-to-end Big Data pipeline for real-time sentiment analysis of social media data (Reddit).

The system covers:

Real-time data ingestion
Stream processing
Persistent storage
NLP-based sentiment classification
Topic extraction and insights

All components are orchestrated using Docker Compose, ensuring reproducibility and portability across machines.

🧠 Global Architecture

Reddit (Public JSON Endpoints) ↓ Collector (Python) ↓ Kafka Topics (reddit_posts / reddit_comments) ↓ Spark Structured Streaming ↓ MongoDB (posts, comments, analytics) ↓ Sentiment & Topic Insights

🧩 Team Contributions

👤 Mohamed Amine Azirgui — Data Ingestion & Streaming Backbone

Responsibilities

Scraped Reddit using public JSON endpoints (no API keys)
Produced structured JSON messages into Kafka topics:
- reddit_posts
- reddit_comments
Designed a Docker-first ingestion stack
Implemented Kafka health checks to guarantee safe startup order
Persisted ingestion state to avoid duplicate data on restarts

Technologies

Python
Apache Kafka
Docker & Docker Compose

👤 Youssef Bouzit — Streaming ETL & Storage (Spark + MongoDB)

Responsibilities

Implemented Spark Structured Streaming jobs consuming Kafka topics
Cleaned, normalized, and enriched text streams
Persisted raw and processed data into MongoDB
Built time-based aggregations
Solved Windows/Hadoop compatibility issues by running Spark in Linux containers

Technologies

Apache Spark (Structured Streaming)
Apache Kafka
MongoDB
Docker

👤 Mouad Souhal — Sentiment Analysis & NLP Modeling

Responsibilities

Built an automated NLP pipeline for posts and comments
Classified sentiment into positive / neutral / negative
Stored sentiment labels, confidence scores, timestamps, and model versions in MongoDB
Applied text preprocessing and normalization
Evaluated model performance and ensured reproducibility

Technologies

Python
scikit-learn
NLP (TF-IDF, Bag-of-Words)
MongoDB

👤 Abdoul Amine Kabirou Amusa — Topic Modeling, Insights & Reporting

Responsibilities

Implemented topic extraction to explain sentiment context
Applied TF-IDF + NMF for unsupervised topic modeling
Identified dominant discussion themes per subreddit
Analyzed sentiment trends and topic frequency over time
Produced interpretable insights and summaries for reporting and presentation

Technologies

Python
scikit-learn
NLP (TF-IDF, NMF)
Data Analysis & Visualization

🧪 Validation & Debugging

The pipeline was validated with real evidence, not just running containers:

Kafka topics manually listed and consumed from earliest offsets
JSON message schemas verified
MongoDB queried from inside the container using mongosh
Document counts and collections validated
Spark stabilized using Docker-based Linux execution

▶️ How to Run the Project

Prerequisites

Docker
Docker Compose

Run the full pipeline

docker compose up -d --build

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
collector		collector
dashboard		dashboard
data		data
models		models
mongodb		mongodb
network		network
outputs		outputs
schemas		schemas
scripts		scripts
spark		spark
.env		.env
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
QUICKSTART_FR.md		QUICKSTART_FR.md
READFIRST.md		READFIRST.md
README.md		README.md
TRELLO_TASKS.md		TRELLO_TASKS.md
docker-compose.yml		docker-compose.yml
project.txt		project.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Real-Time Social Media Sentiment Analysis (Big Data & NLP)

📌 Project Overview

🧠 Global Architecture

🧩 Team Contributions

👤 Mohamed Amine Azirgui — Data Ingestion & Streaming Backbone

👤 Youssef Bouzit — Streaming ETL & Storage (Spark + MongoDB)

👤 Mouad Souhal — Sentiment Analysis & NLP Modeling

👤 Abdoul Amine Kabirou Amusa — Topic Modeling, Insights & Reporting

🧪 Validation & Debugging

▶️ How to Run the Project

Prerequisites

Run the full pipeline

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚀 Real-Time Social Media Sentiment Analysis (Big Data & NLP)

📌 Project Overview

🧠 Global Architecture

🧩 Team Contributions

👤 Mohamed Amine Azirgui — Data Ingestion & Streaming Backbone

👤 Youssef Bouzit — Streaming ETL & Storage (Spark + MongoDB)

👤 Mouad Souhal — Sentiment Analysis & NLP Modeling

👤 Abdoul Amine Kabirou Amusa — Topic Modeling, Insights & Reporting

🧪 Validation & Debugging

▶️ How to Run the Project

Prerequisites

Run the full pipeline

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages