A production-ready, real-time sentiment analysis system for Reddit data with streaming ingestion, transformer-based ML models, automated retraining, and comprehensive MLOps integration. 100% FREE - no credit card or paid API access required!
- Architecture Overview - System design, components, and data flow
- Deployment Guide - Production deployment strategies and infrastructure
- Reddit API Setup - Quick guide to get free Reddit API credentials
- π Real-Time Data Ingestion: Stream posts and comments using Reddit API (PRAW) - completely free!
- π§Ή Advanced Text Preprocessing: NLP preprocessing with emoji handling, URL removal, and tokenization (NLTK, spaCy)
- π€ Transformer Models: Fine-tune BERT/DistilBERT models with HuggingFace Transformers
- π MLOps Integration: Full experiment tracking with MLflow and Weights & Biases
- π Automated Retraining: Intelligent retraining based on data drift and performance degradation
- π Production-Ready API: FastAPI with async support, automatic OpenAPI docs, and Prometheus metrics
- π¦ Containerized: Docker and Docker Compose for consistent deployments
- Modern Python 3.12: Type hints with PEP 604 syntax, async/await patterns
- Horizontal Scaling: Stateless API design with Redis caching and Celery workers
- Comprehensive Monitoring: Prometheus metrics, structured logging, health checks
- Database Migrations: Alembic for version-controlled schema changes
- Security Best Practices: Input validation, secrets management, TLS/SSL ready
βββ src/
β βββ data_ingestion/ # Reddit streaming and data collection
β βββ preprocessing/ # Text preprocessing and feature engineering
β βββ model_training/ # Model training, versioning, and retraining
β βββ api/ # FastAPI application
β βββ tasks/ # Celery background tasks
β βββ database/ # Database models and connections
βββ tests/ # Unit and integration tests
βββ scripts/ # Utility scripts
βββ .github/workflows/ # CI/CD pipelines
βββ docker-compose.yml # Docker orchestration
- Python 3.12+ (uses modern type hints and features)
- Docker & Docker Compose (recommended for easy setup)
- Reddit API Access (100% FREE):
- Reddit account (free)
- Create app at reddit.com/prefs/apps
- Get Client ID and Client Secret (takes 2 minutes)
- No credit card required!
- Rate limit: 60 requests/minute (very generous)
- PostgreSQL 15+
- Redis 7+
- Weights & Biases account (enhanced experiment tracking)
# 1. Clone and configure
git clone https://github.com/yourusername/reddit-sentiment-analysis.git
cd reddit-sentiment-analysis
cp .env.example .env
# Edit .env with your Reddit API credentials (free - see setup below)
# 2. Start all services
docker compose up -d
# 3. Access the application
open http://localhost:8000/docsThat's it! All services (API, Database, Redis, MLflow) are now running.
git clone <repository-url>
cd reddit-sentiment-analysis
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Download NLTK data
python -c "import nltk; nltk.download('punkt'); nltk.download('stopwords'); nltk.download('wordnet')"cp .env.example .env
# Edit .env with your credentialsRequired environment variables:
REDDIT_CLIENT_ID- REQUIRED (free from reddit.com/prefs/apps)REDDIT_CLIENT_SECRET- REQUIRED (free from reddit.com/prefs/apps)REDDIT_USER_AGENT- Your app identifier (e.g., "sentiment-bot/1.0")DATABASE_URL(PostgreSQL connection string)REDIS_URL(Redis connection string)MLFLOW_TRACKING_URI(MLflow server URL)WANDB_API_KEY(optional, for W&B tracking)
python scripts/init_db.pydocker compose up -dThis starts:
- FastAPI application (port 8000)
- PostgreSQL database (port 5432)
- Redis (port 6379)
- MLflow server (port 5000)
- Celery workers for background tasks
- Prometheus (port 9090)
- Grafana (port 3000)
- API Documentation: http://localhost:8000/docs (Interactive Swagger UI)
- API Health Check: http://localhost:8000/health
- MLflow UI: http://localhost:5000 (Experiment tracking)
- Prometheus: http://localhost:9090 (Metrics)
- Grafana: http://localhost:3000 (Dashboards - admin/admin)
For quick testing without Reddit API access:
# Use the demo endpoint with sample data
curl -X POST "http://localhost:8000/predict" \
-H "Content-Type: application/json" \
-d '{"text": "I love this amazing product! Best purchase ever!"}'
# Expected response:
{
"sentiment": "positive",
"confidence": 0.94,
"scores": {
"positive": 0.94,
"neutral": 0.04,
"negative": 0.02
}
}See the API Documentation for all available endpoints.
from src.data_ingestion.reddit_streamer import RedditStreamManager
manager = RedditStreamManager()
subreddits = ['technology', 'MachineLearning', 'artificial']
manager.stream_subreddit_submissions(subreddits, save_to_db=True)from src.model_training.trainer import SentimentModelTrainer
trainer = SentimentModelTrainer()
train_dataset, val_dataset, test_dataset = trainer.prepare_data(texts, labels)
trainer.train(train_dataset, val_dataset)
model_path = trainer.save_model('./trained_models', version='v1.0.0')# Single prediction
curl -X POST "http://localhost:8000/predict" \
-H "Content-Type: application/json" \
-d '{"text": "I love this amazing product!"}'
# Batch prediction
curl -X POST "http://localhost:8000/predict/batch" \
-H "Content-Type: application/json" \
-d '{"texts": ["Great service!", "Terrible experience", "It was okay"]}'curl -X POST "http://localhost:8000/retrain" \
-H "Content-Type: application/json" \
-d '{"force": false}'| Endpoint | Method | Description |
|---|---|---|
/ |
GET | API information |
/health |
GET | Health check and model status |
/predict |
POST | Single text sentiment prediction |
/predict/batch |
POST | Batch sentiment predictions |
/retrain |
POST | Trigger model retraining |
/model/info |
GET | Current and recent model information |
/data/quality |
GET | Data quality and drift monitoring |
/metrics |
GET | Prometheus metrics |
The system uses MLflow and Weights & Biases for experiment tracking:
- Automatic Logging: All training runs are logged with hyperparameters and metrics
- Model Registry: Models are versioned and stored with metadata
- Experiment Comparison: Compare different model versions in MLflow/W&B UI
- Artifact Storage: Model checkpoints and artifacts are stored and versioned
The retraining pipeline automatically triggers when:
- Data Threshold: Accumulation of N new unlabeled samples (configurable)
- Time-Based: After X hours since last training (configurable)
- Performance Degradation: When model confidence drops below threshold
- Data Drift: When significant distribution shift is detected
Configure thresholds in .env:
MIN_SAMPLES_FOR_RETRAIN=1000
RETRAIN_INTERVAL_HOURS=24
sentiment_predictions_total: Total number of predictionssentiment_prediction_duration_seconds: Prediction latencysentiment_prediction_confidence: Model confidence distribution
- Training metrics (loss, accuracy, F1-score)
- Model parameters and hyperparameters
- Model artifacts and checkpoints
- Sentiment distribution tracking
- Prediction confidence monitoring
- Data drift detection
- Model performance degradation alerts
GitHub Actions workflows:
-
CI Pipeline (
.github/workflows/ci.yml):- Code linting (flake8, black)
- Type checking (mypy)
- Unit tests with coverage
- Security scanning (Trivy)
- Docker image build
-
CD Pipeline (
.github/workflows/cd.yml):- Docker image build and push
- Automated deployment
- Model retraining trigger
REDDIT_CLIENT_ID,REDDIT_CLIENT_SECRET,REDDIT_USER_AGENTDOCKER_USERNAME,DOCKER_PASSWORDDEPLOY_KEY,DEPLOY_HOSTMLFLOW_TRACKING_URI,WANDB_API_KEYDATABASE_URL
# Run all tests
pytest tests/
# Run with coverage
pytest tests/ --cov=src --cov-report=html
# Run specific test file
pytest tests/test_preprocessing.py# Format code
black src/
# Check formatting
black --check src/
# Lint code
flake8 src/mypy src/ --ignore-missing-imports- Update environment variables for production
- Build Docker image:
docker build -t sentiment-analysis:latest . - Deploy using docker-compose or orchestration tool (Kubernetes, ECS, etc.)
- Set up monitoring and alerting
- Configure backup and disaster recovery
- Use Redis for caching and rate limiting
- Deploy multiple API instances behind a load balancer
- Use Celery for distributed task processing
- Consider using GPU instances for model training
- Implement horizontal scaling for API workers
Key configuration options in config.py:
MODEL_NAME: Base transformer model (default: distilbert-base-uncased)MAX_LENGTH: Maximum sequence length for tokenizationBATCH_SIZE: Training and inference batch sizeLEARNING_RATE: Model training learning rateNUM_EPOCHS: Number of training epochsMIN_SAMPLES_FOR_RETRAIN: Minimum samples to trigger retrainingRETRAIN_INTERVAL_HOURS: Hours between automatic retraining
- Reddit API Rate Limits: PRAW handles rate limiting automatically (60 req/min)
- Memory Issues: Reduce batch size or use gradient accumulation
- Database Connection Errors: Check PostgreSQL is running and credentials are correct
- Model Loading Errors: Ensure model files exist and are not corrupted
# API logs
docker-compose logs api
# Celery worker logs
docker-compose logs celery_worker
# MLflow logs
docker-compose logs mlflowFor issues and questions:
- Create an issue on GitHub
- Check existing documentation
- Review MLflow and W&B dashboards for model insights
- Model A/B testing framework
- GraphQL API option
- Enhanced data drift detection
- Multi-language support
- Active learning pipeline
- Model interpretability (SHAP, LIME)
- Real-time prediction dashboard
- Advanced data augmentation
- Multi-model ensemble predictions
- Edge deployment for inference
- Federated learning support
- Custom sentiment categories per client
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes with tests
- Ensure all tests pass and code is formatted (
black src/) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- HuggingFace Transformers for transformer models
- FastAPI for the excellent web framework
- MLflow for experiment tracking
- PRAW for the Reddit API wrapper
- Reddit for free API access