A topic modeling pipeline for analyzing social media content from digital trace data across multiple platforms including BlueSky, TikTok, Instagram, Twitter, YouTube, and Telegram.
This tool provides automated topic modeling capabilities using advanced machine learning techniques to analyze social media posts and extract meaningful topics and themes. It supports two main modeling approaches:
- BERTopic: State-of-the-art neural topic modeling using transformer embeddings
- Toponymy: Geographic location-aware topic modeling
- Multi-platform data collection from 6 social media platforms
- Configurable topic modeling with BERTopic and Toponymy models
- Automatic cluster size optimization based on dataset size
- Multilingual support with sentence transformers
- API-based data access with authentication
- Flexible configuration system
- Export capabilities for analysis results
- Python 3.11+ (required for some dependencies)
- Access to MEO Insights Hub API (credentials required)
- Virtual environment (recommended)
- GPU (recommended): NVIDIA GPU with CUDA support for significant performance improvements
- RAM: Minimum 8GB, 16GB+ recommended for large datasets
- Clone the repository:
git clone <repository-url>
cd digitaltrace-topicmodeling- Create and activate virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt
# Install toponymy from GitHub (required for toponymy model)
pip install git+https://github.com/TutteInstitute/toponymy.gitFor significantly faster processing with NVIDIA GPUs, install additional GPU-accelerated libraries:
# For GPU-accelerated UMAP and HDBSCAN
pip install cuml-cu11==23.12.0
# For GPU-accelerated PaCMAP (if using pacmap for dimensionality reduction)
pip install git+https://github.com/hyhuang00/ParamRepulsor.gitPerformance Benefits with GPU:
- Embedding generation: 3-5x faster with multi-GPU support
- UMAP dimensionality reduction: 10-20x faster with cuml
- HDBSCAN clustering: 5-10x faster with cuml
- Overall pipeline: 3-8x faster end-to-end processing
GPU Requirements:
- NVIDIA GPU with CUDA Compute Capability 7.0+ (RTX 20 series or newer)
- CUDA 11.8 or 12.x
- 8GB+ GPU memory recommended for large datasets (>50K documents)
Edit config.cfg to set up your API credentials:
[fastapi]
api = https://api.meoinsightshub.net
username = your_username
password = your_password
timeout = 30
[model]
default_type = bertopic
n_neighbors = 15
n_components = 2
embedding_model = paraphrase-multilingual-MiniLM-L12-v2
dim_reduction = umap
[llm]
openai_api_key = your_openai_api_keyImportant: Replace the default credentials with your actual API access credentials and OpenAI API key.
Run the topic modeling pipeline with the following command:
python main.py \
--config config.cfg \
--platforms "twitter,instagram,youtube" \
--date-start "2024-01-01" \
--date-end "2024-01-31" \
--query "climate change" \
--model bertopic \
--output-path "./output" \
--name "climate_analysis" \
--save-embeds False| Parameter | Required | Description | Examples |
|---|---|---|---|
--config |
Yes | Configuration file path | config.cfg |
--platforms |
Yes | Comma-separated list of platforms | twitter,instagram,youtube,bluesky,tiktok,telegram |
--date-start |
Yes | Start date for data collection | 2024-01-01 |
--date-end |
Yes | End date for data collection | 2024-01-31 |
--query |
Yes | Search query/keywords | "climate change" |
--model |
No | Topic model type (default: bertopic) | bertopic, toponymy |
--output-path |
Yes | Directory for output files | ./output |
--name |
Yes | Prefix for output files | climate_analysis |
- Uses transformer-based embeddings for semantic understanding
- Supports multilingual content with
paraphrase-multilingual-MiniLM-L12-v2 - Employs UMAP for dimensionality reduction and HDBSCAN for clustering
- Automatically determines optimal number of topics
- Specialized for geographic and location-based topic modeling
- Identifies place names and geographic references in content
- Useful for analyzing location-specific trends and discussions
- Note: If you don't have an OpenAI API key, the toponymy model can also run on local language models
The pipeline generates several output files in the specified output directory:
- Topic modeling results and visualizations
- Cluster assignments for each post
- Topic keywords and representations
- Statistical summaries and metrics
This tool integrates with the MEO Insights Hub API to access digital trace data. The API provides:
- Authenticated access to multi-platform social media data
- Real-time data collection capabilities
- Filtered querying by date range, keywords, and platforms
- Structured data output ready for analysis
- Authentication Error: Verify your API credentials in
config.cfg - Empty Dataset: Check date range and query parameters
- Memory Issues: For large datasets, consider processing in smaller batches
- Model Loading: Ensure sentence-transformers models can download (internet connection required)
- Missing OpenAI API Key: Add your OpenAI API key to
config.cfg. For toponymy model, you can use local models instead if you don't have an OpenAI key. - GPU Dependencies: If GPU acceleration fails, ensure CUDA drivers and cuml are properly installed. The system will automatically fall back to CPU processing.
- CUDA Version Mismatch: Ensure your CUDA version matches the cuml installation (use
cuml-cu11for CUDA 11.x orcuml-cu12for CUDA 12.x)
- Cluster size automatically adjusts based on dataset size (min: 30, max: 80, or 1.5% of data)
- For very large datasets (>10K posts), consider filtering by specific platforms or shorter date ranges
- The multilingual embedding model requires significant memory; ensure adequate RAM
This tool is designed for users with access to MEO digital trace data. For technical issues or feature requests, contact the development team.
This project is proprietary and intended for authorized users of the MEO digital trace data platform.