Digital Trace Topic Modeling

A topic modeling pipeline for analyzing social media content from digital trace data across multiple platforms including BlueSky, TikTok, Instagram, Twitter, YouTube, and Telegram.

Overview

This tool provides automated topic modeling capabilities using advanced machine learning techniques to analyze social media posts and extract meaningful topics and themes. It supports two main modeling approaches:

BERTopic: State-of-the-art neural topic modeling using transformer embeddings
Toponymy: Geographic location-aware topic modeling

Features

Multi-platform data collection from 6 social media platforms
Configurable topic modeling with BERTopic and Toponymy models
Automatic cluster size optimization based on dataset size
Multilingual support with sentence transformers
API-based data access with authentication
Flexible configuration system
Export capabilities for analysis results

Prerequisites

Python 3.11+ (required for some dependencies)
Access to MEO Insights Hub API (credentials required)
Virtual environment (recommended)
GPU (recommended): NVIDIA GPU with CUDA support for significant performance improvements
RAM: Minimum 8GB, 16GB+ recommended for large datasets

Installation

Clone the repository:

git clone <repository-url>
cd digitaltrace-topicmodeling

Create and activate virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

# Install toponymy from GitHub (required for toponymy model)
pip install git+https://github.com/TutteInstitute/toponymy.git

GPU Support (Optional but Recommended)

For significantly faster processing with NVIDIA GPUs, install additional GPU-accelerated libraries:

# For GPU-accelerated UMAP and HDBSCAN
pip install cuml-cu11==23.12.0

# For GPU-accelerated PaCMAP (if using pacmap for dimensionality reduction)
pip install git+https://github.com/hyhuang00/ParamRepulsor.git

Performance Benefits with GPU:

Embedding generation: 3-5x faster with multi-GPU support
UMAP dimensionality reduction: 10-20x faster with cuml
HDBSCAN clustering: 5-10x faster with cuml
Overall pipeline: 3-8x faster end-to-end processing

GPU Requirements:

NVIDIA GPU with CUDA Compute Capability 7.0+ (RTX 20 series or newer)
CUDA 11.8 or 12.x
8GB+ GPU memory recommended for large datasets (>50K documents)

Configuration

API Configuration

Edit config.cfg to set up your API credentials:

[fastapi]
api = https://api.meoinsightshub.net
username = your_username
password = your_password
timeout = 30

[model]
default_type = bertopic
n_neighbors = 15
n_components = 2
embedding_model = paraphrase-multilingual-MiniLM-L12-v2
dim_reduction = umap

[llm]
openai_api_key = your_openai_api_key

Important: Replace the default credentials with your actual API access credentials and OpenAI API key.

Usage

Run the topic modeling pipeline with the following command:

python main.py \
    --config config.cfg \
    --platforms "twitter,instagram,youtube" \
    --date-start "2024-01-01" \
    --date-end "2024-01-31" \
    --query "climate change" \
    --model bertopic \
    --output-path "./output" \
    --name "climate_analysis" \
    --save-embeds False

Parameters

Parameter	Required	Description	Examples
`--config`	Yes	Configuration file path	`config.cfg`
`--platforms`	Yes	Comma-separated list of platforms	`twitter,instagram,youtube,bluesky,tiktok,telegram`
`--date-start`	Yes	Start date for data collection	`2024-01-01`
`--date-end`	Yes	End date for data collection	`2024-01-31`
`--query`	Yes	Search query/keywords	`"climate change"`
`--model`	No	Topic model type (default: bertopic)	`bertopic`, `toponymy`
`--output-path`	Yes	Directory for output files	`./output`
`--name`	Yes	Prefix for output files	`climate_analysis`

Models

BERTopic

Uses transformer-based embeddings for semantic understanding
Supports multilingual content with paraphrase-multilingual-MiniLM-L12-v2
Employs UMAP for dimensionality reduction and HDBSCAN for clustering
Automatically determines optimal number of topics

Toponymy

Specialized for geographic and location-based topic modeling
Identifies place names and geographic references in content
Useful for analyzing location-specific trends and discussions
Note: If you don't have an OpenAI API key, the toponymy model can also run on local language models

Output

The pipeline generates several output files in the specified output directory:

Topic modeling results and visualizations
Cluster assignments for each post
Topic keywords and representations
Statistical summaries and metrics

API Integration

This tool integrates with the MEO Insights Hub API to access digital trace data. The API provides:

Authenticated access to multi-platform social media data
Real-time data collection capabilities
Filtered querying by date range, keywords, and platforms
Structured data output ready for analysis

Troubleshooting

Common Issues

Authentication Error: Verify your API credentials in config.cfg
Empty Dataset: Check date range and query parameters
Memory Issues: For large datasets, consider processing in smaller batches
Model Loading: Ensure sentence-transformers models can download (internet connection required)
Missing OpenAI API Key: Add your OpenAI API key to config.cfg. For toponymy model, you can use local models instead if you don't have an OpenAI key.
GPU Dependencies: If GPU acceleration fails, ensure CUDA drivers and cuml are properly installed. The system will automatically fall back to CPU processing.
CUDA Version Mismatch: Ensure your CUDA version matches the cuml installation (use cuml-cu11 for CUDA 11.x or cuml-cu12 for CUDA 12.x)

Performance Optimization

Cluster size automatically adjusts based on dataset size (min: 30, max: 80, or 1.5% of data)
For very large datasets (>10K posts), consider filtering by specific platforms or shorter date ranges
The multilingual embedding model requires significant memory; ensure adequate RAM

Contributing

This tool is designed for users with access to MEO digital trace data. For technical issues or feature requests, contact the development team.

License

This project is proprietary and intended for authorized users of the MEO digital trace data platform.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
bert_model.py		bert_model.py
config.cfg		config.cfg
main.py		main.py
requirements.txt		requirements.txt
toponymy_model.py		toponymy_model.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Digital Trace Topic Modeling

Overview

Features

Prerequisites

Installation

GPU Support (Optional but Recommended)

Configuration

API Configuration

Usage

Parameters

Models

BERTopic

Toponymy

Output

API Integration

Troubleshooting

Common Issues

Performance Optimization

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Digital Trace Topic Modeling

Overview

Features

Prerequisites

Installation

GPU Support (Optional but Recommended)

Configuration

API Configuration

Usage

Parameters

Models

BERTopic

Toponymy

Output

API Integration

Troubleshooting

Common Issues

Performance Optimization

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages