Skip to content

MEOMcGill/digitaltrace-topicmodeling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A topic modeling pipeline for analyzing social media content from digital trace data across multiple platforms including BlueSky, TikTok, Instagram, Twitter, YouTube, and Telegram.

Overview

This tool provides automated topic modeling capabilities using advanced machine learning techniques to analyze social media posts and extract meaningful topics and themes. It supports two main modeling approaches:

  • BERTopic: State-of-the-art neural topic modeling using transformer embeddings
  • Toponymy: Geographic location-aware topic modeling

Features

  • Multi-platform data collection from 6 social media platforms
  • Configurable topic modeling with BERTopic and Toponymy models
  • Automatic cluster size optimization based on dataset size
  • Multilingual support with sentence transformers
  • API-based data access with authentication
  • Flexible configuration system
  • Export capabilities for analysis results

Prerequisites

  • Python 3.11+ (required for some dependencies)
  • Access to MEO Insights Hub API (credentials required)
  • Virtual environment (recommended)
  • GPU (recommended): NVIDIA GPU with CUDA support for significant performance improvements
  • RAM: Minimum 8GB, 16GB+ recommended for large datasets

Installation

  1. Clone the repository:
git clone <repository-url>
cd digitaltrace-topicmodeling
  1. Create and activate virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt

# Install toponymy from GitHub (required for toponymy model)
pip install git+https://github.com/TutteInstitute/toponymy.git

GPU Support (Optional but Recommended)

For significantly faster processing with NVIDIA GPUs, install additional GPU-accelerated libraries:

# For GPU-accelerated UMAP and HDBSCAN
pip install cuml-cu11==23.12.0

# For GPU-accelerated PaCMAP (if using pacmap for dimensionality reduction)
pip install git+https://github.com/hyhuang00/ParamRepulsor.git

Performance Benefits with GPU:

  • Embedding generation: 3-5x faster with multi-GPU support
  • UMAP dimensionality reduction: 10-20x faster with cuml
  • HDBSCAN clustering: 5-10x faster with cuml
  • Overall pipeline: 3-8x faster end-to-end processing

GPU Requirements:

  • NVIDIA GPU with CUDA Compute Capability 7.0+ (RTX 20 series or newer)
  • CUDA 11.8 or 12.x
  • 8GB+ GPU memory recommended for large datasets (>50K documents)

Configuration

API Configuration

Edit config.cfg to set up your API credentials:

[fastapi]
api = https://api.meoinsightshub.net
username = your_username
password = your_password
timeout = 30

[model]
default_type = bertopic
n_neighbors = 15
n_components = 2
embedding_model = paraphrase-multilingual-MiniLM-L12-v2
dim_reduction = umap

[llm]
openai_api_key = your_openai_api_key

Important: Replace the default credentials with your actual API access credentials and OpenAI API key.

Usage

Run the topic modeling pipeline with the following command:

python main.py \
    --config config.cfg \
    --platforms "twitter,instagram,youtube" \
    --date-start "2024-01-01" \
    --date-end "2024-01-31" \
    --query "climate change" \
    --model bertopic \
    --output-path "./output" \
    --name "climate_analysis" \
    --save-embeds False

Parameters

Parameter Required Description Examples
--config Yes Configuration file path config.cfg
--platforms Yes Comma-separated list of platforms twitter,instagram,youtube,bluesky,tiktok,telegram
--date-start Yes Start date for data collection 2024-01-01
--date-end Yes End date for data collection 2024-01-31
--query Yes Search query/keywords "climate change"
--model No Topic model type (default: bertopic) bertopic, toponymy
--output-path Yes Directory for output files ./output
--name Yes Prefix for output files climate_analysis

Models

BERTopic

  • Uses transformer-based embeddings for semantic understanding
  • Supports multilingual content with paraphrase-multilingual-MiniLM-L12-v2
  • Employs UMAP for dimensionality reduction and HDBSCAN for clustering
  • Automatically determines optimal number of topics

Toponymy

  • Specialized for geographic and location-based topic modeling
  • Identifies place names and geographic references in content
  • Useful for analyzing location-specific trends and discussions
  • Note: If you don't have an OpenAI API key, the toponymy model can also run on local language models

Output

The pipeline generates several output files in the specified output directory:

  • Topic modeling results and visualizations
  • Cluster assignments for each post
  • Topic keywords and representations
  • Statistical summaries and metrics

API Integration

This tool integrates with the MEO Insights Hub API to access digital trace data. The API provides:

  • Authenticated access to multi-platform social media data
  • Real-time data collection capabilities
  • Filtered querying by date range, keywords, and platforms
  • Structured data output ready for analysis

Troubleshooting

Common Issues

  1. Authentication Error: Verify your API credentials in config.cfg
  2. Empty Dataset: Check date range and query parameters
  3. Memory Issues: For large datasets, consider processing in smaller batches
  4. Model Loading: Ensure sentence-transformers models can download (internet connection required)
  5. Missing OpenAI API Key: Add your OpenAI API key to config.cfg. For toponymy model, you can use local models instead if you don't have an OpenAI key.
  6. GPU Dependencies: If GPU acceleration fails, ensure CUDA drivers and cuml are properly installed. The system will automatically fall back to CPU processing.
  7. CUDA Version Mismatch: Ensure your CUDA version matches the cuml installation (use cuml-cu11 for CUDA 11.x or cuml-cu12 for CUDA 12.x)

Performance Optimization

  • Cluster size automatically adjusts based on dataset size (min: 30, max: 80, or 1.5% of data)
  • For very large datasets (>10K posts), consider filtering by specific platforms or shorter date ranges
  • The multilingual embedding model requires significant memory; ensure adequate RAM

Contributing

This tool is designed for users with access to MEO digital trace data. For technical issues or feature requests, contact the development team.

License

This project is proprietary and intended for authorized users of the MEO digital trace data platform.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages