Reddit Sentiment Analysis Pipeline

A production-ready, real-time sentiment analysis system for Reddit data with streaming ingestion, transformer-based ML models, automated retraining, and comprehensive MLOps integration. 100% FREE - no credit card or paid API access required!

📚 Documentation

Architecture Overview - System design, components, and data flow
Deployment Guide - Production deployment strategies and infrastructure
Reddit API Setup - Quick guide to get free Reddit API credentials

✨ Features

Core Capabilities

🔄 Real-Time Data Ingestion: Stream posts and comments using Reddit API (PRAW) - completely free!
🧹 Advanced Text Preprocessing: NLP preprocessing with emoji handling, URL removal, and tokenization (NLTK, spaCy)
🤖 Transformer Models: Fine-tune BERT/DistilBERT models with HuggingFace Transformers
📊 MLOps Integration: Full experiment tracking with MLflow and Weights & Biases
🔁 Automated Retraining: Intelligent retraining based on data drift and performance degradation
🚀 Production-Ready API: FastAPI with async support, automatic OpenAPI docs, and Prometheus metrics
📦 Containerized: Docker and Docker Compose for consistent deployments

Technical Highlights

Modern Python 3.12: Type hints with PEP 604 syntax, async/await patterns
Horizontal Scaling: Stateless API design with Redis caching and Celery workers
Comprehensive Monitoring: Prometheus metrics, structured logging, health checks
Database Migrations: Alembic for version-controlled schema changes
Security Best Practices: Input validation, secrets management, TLS/SSL ready

Architecture

├── src/
│   ├── data_ingestion/       # Reddit streaming and data collection
│   ├── preprocessing/         # Text preprocessing and feature engineering
│   ├── model_training/        # Model training, versioning, and retraining
│   ├── api/                   # FastAPI application
│   ├── tasks/                 # Celery background tasks
│   └── database/              # Database models and connections
├── tests/                     # Unit and integration tests
├── scripts/                   # Utility scripts
├── .github/workflows/         # CI/CD pipelines
└── docker-compose.yml         # Docker orchestration

📋 Prerequisites

Required

Python 3.12+ (uses modern type hints and features)
Docker & Docker Compose (recommended for easy setup)
Reddit API Access (100% FREE):
- Reddit account (free)
- Create app at reddit.com/prefs/apps
- Get Client ID and Client Secret (takes 2 minutes)
- No credit card required!
- Rate limit: 60 requests/minute (very generous)

Optional (for manual installation)

PostgreSQL 15+
Redis 7+
Weights & Biases account (enhanced experiment tracking)

🚀 Quick Start

Option 1: Docker Compose (Recommended)

# 1. Clone and configure
git clone https://github.com/yourusername/reddit-sentiment-analysis.git
cd reddit-sentiment-analysis
cp .env.example .env
# Edit .env with your Reddit API credentials (free - see setup below)

# 2. Start all services
docker compose up -d

# 3. Access the application
open http://localhost:8000/docs

That's it! All services (API, Database, Redis, MLflow) are now running.

Option 2: Manual Installation

1. Clone and Setup Environment

git clone <repository-url>
cd reddit-sentiment-analysis

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Download NLTK data
python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords'); nltk.download('wordnet')"

2. Configure Environment Variables

cp .env.example .env
# Edit .env with your credentials

Required environment variables:

REDDIT_CLIENT_ID - REQUIRED (free from reddit.com/prefs/apps)
REDDIT_CLIENT_SECRET - REQUIRED (free from reddit.com/prefs/apps)
REDDIT_USER_AGENT - Your app identifier (e.g., "sentiment-bot/1.0")
DATABASE_URL (PostgreSQL connection string)
REDIS_URL (Redis connection string)
MLFLOW_TRACKING_URI (MLflow server URL)
WANDB_API_KEY (optional, for W&B tracking)

3. Initialize Database

python scripts/init_db.py

4. Start Services with Docker Compose

docker compose up -d

This starts:

FastAPI application (port 8000)
PostgreSQL database (port 5432)
Redis (port 6379)
MLflow server (port 5000)
Celery workers for background tasks
Prometheus (port 9090)
Grafana (port 3000)

5. Access the Application

API Documentation: http://localhost:8000/docs (Interactive Swagger UI)
API Health Check: http://localhost:8000/health
MLflow UI: http://localhost:5000 (Experiment tracking)
Prometheus: http://localhost:9090 (Metrics)
Grafana: http://localhost:3000 (Dashboards - admin/admin)

🎯 Demo Credentials

For quick testing without Reddit API access:

# Use the demo endpoint with sample data
curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"text": "I love this amazing product! Best purchase ever!"}'

# Expected response:
{
  "sentiment": "positive",
  "confidence": 0.94,
  "scores": {
    "positive": 0.94,
    "neutral": 0.04,
    "negative": 0.02
  }
}

See the API Documentation for all available endpoints.

Usage

Stream Reddit Data

from src.data_ingestion.reddit_streamer import RedditStreamManager

manager = RedditStreamManager()
subreddits = ['technology', 'MachineLearning', 'artificial']
manager.stream_subreddit_submissions(subreddits, save_to_db=True)

Train a Model

from src.model_training.trainer import SentimentModelTrainer

trainer = SentimentModelTrainer()
train_dataset, val_dataset, test_dataset = trainer.prepare_data(texts, labels)
trainer.train(train_dataset, val_dataset)
model_path = trainer.save_model('./trained_models', version='v1.0.0')

Make Predictions via API

# Single prediction
curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"text": "I love this amazing product!"}'

# Batch prediction
curl -X POST "http://localhost:8000/predict/batch" \
  -H "Content-Type: application/json" \
  -d '{"texts": ["Great service!", "Terrible experience", "It was okay"]}'

Trigger Retraining

curl -X POST "http://localhost:8000/retrain" \
  -H "Content-Type: application/json" \
  -d '{"force": false}'

API Endpoints

Endpoint	Method	Description
`/`	GET	API information
`/health`	GET	Health check and model status
`/predict`	POST	Single text sentiment prediction
`/predict/batch`	POST	Batch sentiment predictions
`/retrain`	POST	Trigger model retraining
`/model/info`	GET	Current and recent model information
`/data/quality`	GET	Data quality and drift monitoring
`/metrics`	GET	Prometheus metrics

Model Training & Versioning

The system uses MLflow and Weights & Biases for experiment tracking:

Automatic Logging: All training runs are logged with hyperparameters and metrics
Model Registry: Models are versioned and stored with metadata
Experiment Comparison: Compare different model versions in MLflow/W&B UI
Artifact Storage: Model checkpoints and artifacts are stored and versioned

Automated Retraining

The retraining pipeline automatically triggers when:

Data Threshold: Accumulation of N new unlabeled samples (configurable)
Time-Based: After X hours since last training (configurable)
Performance Degradation: When model confidence drops below threshold
Data Drift: When significant distribution shift is detected

Configure thresholds in .env:

MIN_SAMPLES_FOR_RETRAIN=1000
RETRAIN_INTERVAL_HOURS=24

Monitoring & Observability

Prometheus Metrics

sentiment_predictions_total: Total number of predictions
sentiment_prediction_duration_seconds: Prediction latency
sentiment_prediction_confidence: Model confidence distribution

MLflow Tracking

Training metrics (loss, accuracy, F1-score)
Model parameters and hyperparameters
Model artifacts and checkpoints

Data Quality Monitoring

Sentiment distribution tracking
Prediction confidence monitoring
Data drift detection
Model performance degradation alerts

CI/CD Pipeline

GitHub Actions workflows:

CI Pipeline (.github/workflows/ci.yml):
- Code linting (flake8, black)
- Type checking (mypy)
- Unit tests with coverage
- Security scanning (Trivy)
- Docker image build
CD Pipeline (.github/workflows/cd.yml):
- Docker image build and push
- Automated deployment
- Model retraining trigger

Required GitHub Secrets

REDDIT_CLIENT_ID, REDDIT_CLIENT_SECRET, REDDIT_USER_AGENT
DOCKER_USERNAME, DOCKER_PASSWORD
DEPLOY_KEY, DEPLOY_HOST
MLFLOW_TRACKING_URI, WANDB_API_KEY
DATABASE_URL

Testing

# Run all tests
pytest tests/

# Run with coverage
pytest tests/ --cov=src --cov-report=html

# Run specific test file
pytest tests/test_preprocessing.py

Development

Code Formatting

# Format code
black src/

# Check formatting
black --check src/

# Lint code
flake8 src/

Type Checking

mypy src/ --ignore-missing-imports

Deployment

Production Deployment

Update environment variables for production

Build Docker image:

docker build -t sentiment-analysis:latest .

Deploy using docker-compose or orchestration tool (Kubernetes, ECS, etc.)
Set up monitoring and alerting
Configure backup and disaster recovery

Scaling Considerations

Use Redis for caching and rate limiting
Deploy multiple API instances behind a load balancer
Use Celery for distributed task processing
Consider using GPU instances for model training
Implement horizontal scaling for API workers

Configuration

Key configuration options in config.py:

MODEL_NAME: Base transformer model (default: distilbert-base-uncased)
MAX_LENGTH: Maximum sequence length for tokenization
BATCH_SIZE: Training and inference batch size
LEARNING_RATE: Model training learning rate
NUM_EPOCHS: Number of training epochs
MIN_SAMPLES_FOR_RETRAIN: Minimum samples to trigger retraining
RETRAIN_INTERVAL_HOURS: Hours between automatic retraining

Troubleshooting

Common Issues

Reddit API Rate Limits: PRAW handles rate limiting automatically (60 req/min)
Memory Issues: Reduce batch size or use gradient accumulation
Database Connection Errors: Check PostgreSQL is running and credentials are correct
Model Loading Errors: Ensure model files exist and are not corrupted

Logs

# API logs
docker-compose logs api

# Celery worker logs
docker-compose logs celery_worker

# MLflow logs
docker-compose logs mlflow

Support

For issues and questions:

Create an issue on GitHub
Check existing documentation
Review MLflow and W&B dashboards for model insights

🗺️ Roadmap

Short-term

Model A/B testing framework
GraphQL API option
Enhanced data drift detection
Multi-language support

Medium-term

Active learning pipeline
Model interpretability (SHAP, LIME)
Real-time prediction dashboard
Advanced data augmentation

Long-term

Multi-model ensemble predictions
Edge deployment for inference
Federated learning support
Custom sentiment categories per client

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes with tests
Ensure all tests pass and code is formatted (black src/)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

HuggingFace Transformers for transformer models
FastAPI for the excellent web framework
MLflow for experiment tracking
PRAW for the Reddit API wrapper
Reddit for free API access

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
scripts		scripts
src		src
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
DEPLOYMENT.md		DEPLOYMENT.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
REDDIT_SETUP.md		REDDIT_SETUP.md
SETUP_GUIDE.md		SETUP_GUIDE.md
config.py		config.py
docker-compose.yml		docker-compose.yml
prometheus.yml		prometheus.yml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Reddit Sentiment Analysis Pipeline

📚 Documentation

✨ Features

Core Capabilities

Technical Highlights

Architecture

📋 Prerequisites

Required

Optional (for manual installation)

🚀 Quick Start

Option 1: Docker Compose (Recommended)

Option 2: Manual Installation

1. Clone and Setup Environment

2. Configure Environment Variables

3. Initialize Database

4. Start Services with Docker Compose

5. Access the Application

🎯 Demo Credentials

Usage

Stream Reddit Data

Train a Model

Make Predictions via API

Trigger Retraining

API Endpoints

Model Training & Versioning

Automated Retraining

Monitoring & Observability

Prometheus Metrics

MLflow Tracking

Data Quality Monitoring

CI/CD Pipeline

Required GitHub Secrets

Testing

Development

Code Formatting

Type Checking

Deployment

Production Deployment

Scaling Considerations

Configuration

Troubleshooting

Common Issues

Logs

Support

🗺️ Roadmap

Short-term

Medium-term

Long-term

🤝 Contributing

📄 License

🙏 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages