worldLab_2026
This project extracts and processes "Then vs Now" image datasets from web articles, downloads images, geocodes locations, and splits images into past and current views.
- Install uv:
curl -LsSf https://astral.sh/uv/install.sh | sh - Create virtual environment:
uv venv - Activate and install dependencies:
source .venv/bin/activate && uv pip install requests beautifulsoup4 pillow geopy
- Place the web page content in
dataset_assets/data/page_content.txt - Run the parser:
cd dataset_assets/scripts && python parse_data.py - Download images:
python downloader.py - Split images:
python splitter.py
The processed dataset is stored in dataset_assets/data/dataset.json with the following format:
title: Section titlelocation: Extracted locationdate: Yearimage_url: URL of the imagedescription: Text descriptiongeolocation: Latitude, longitude, and address (if available)
Split images are saved in dataset_assets/images/ and dataset_assets/split_images/.
Here are some examples from the dataset:


