Add Toronto Open Data pipeline and dataset documentation by Kingsolima · Pull Request #1 · Kingsolima/Nvidia-Hackathon

Kingsolima · 2026-05-30T03:05:35Z

ml/fetch.py: CKAN download helpers with SSL bypass (Windows cert issue), ZIP/shapefile extraction, case-insensitive CSV lat/lon conversion, and GTFS stop parsing for TTC
ml/data_pipeline.py: downloads all 17 datasets to data/ as GeoParquet/ Parquet; caches on disk, filters permits to 2020+, joins neighbourhood profiles (income/density) onto polygon boundaries
ml/requirements.txt: geopandas, xgboost, rasterio, python-dotenv
data/data.md: full dataset guide — buckets, column specs, handoff notes
.env.example: all required keys (WAQI, Mapbox, Anthropic, Ollama)
.gitignore: excludes the data files themselves (.parquet files), its recieved through runniing python ml/data_pipeline.py.

- ml/fetch.py: CKAN download helpers with SSL bypass (Windows cert issue), ZIP/shapefile extraction, case-insensitive CSV lat/lon conversion, and GTFS stop parsing for TTC - ml/data_pipeline.py: downloads all 17 datasets to data/ as GeoParquet/ Parquet; caches on disk, filters permits to 2020+, joins neighbourhood profiles (income/density) onto polygon boundaries - ml/requirements.txt: geopandas, xgboost, rasterio, python-dotenv - data/data.md: full dataset guide — buckets, column specs, handoff notes - data/coefficients/: ITE trip generation rates + StatsCan I-O multipliers - .env.example: all required keys (WAQI, Mapbox, Anthropic, Ollama) - .gitignore: exclude data/*.parquet, data/*.tif, backend/models/, .venv - .vscode/settings.json: Python interpreter + ml/ extra path for team

Copilot

Pull request overview

Adds an ML-oriented Toronto Open Data ingestion pipeline plus accompanying project/data documentation to support downstream model training and spatial lookups.

Changes:

Introduces ml/fetch.py CKAN download helpers (GeoJSON/ZIP/CSV/GTFS + raster download).
Adds ml/data_pipeline.py to download/cache multiple Toronto datasets into data/ as Parquet/GeoParquet (and a GeoTIFF), including a permits date filter and neighbourhood profile join.
Adds dataset guide + coefficient CSVs, along with updated README, env template, editor settings, and ignore rules.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 27 comments.

Show a summary per file

File	Description
README.md	Replaces placeholder README with Quick Start + architecture overview and dataset references.
ml/requirements.txt	Adds pinned geospatial/ML dependencies for the pipeline.
ml/fetch.py	Implements CKAN resource discovery + download/parsing helpers (geo/tabular/GTFS/raster).
ml/data_pipeline.py	Orchestrates dataset downloads, caching, light transformations, and writes to `data/`.
data/data.md	Adds comprehensive dataset documentation and intended runtime/training usage notes.
data/coefficients/statscan_io_multipliers.csv	Adds StatsCan multiplier lookup table for job/GDP estimates.
data/coefficients/ite_trip_rates.csv	Adds trip-generation lookup table (needs license/compliance review).
.vscode/settings.json	Adds workspace Python settings (currently Windows-specific interpreter path).
.gitignore	Ignores generated data/model artifacts and frontend build outputs.
.env.example	Documents required environment variables for data pipeline and runtime APIs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+def list_formats(ckan_id: str) -> list[tuple[str, str]]:
+    """Debug helper: print available resource formats for a dataset."""
+    pkg = _package(ckan_id)
+    return [(r.get("format", "?"), r.get("name", "?"), r.get("url", "?")[:80])
+            for r in pkg["resources"]]


+            shp_files = [n for n in z.namelist() if n.lower().endswith(".shp")]
+            if shp_files:
+                with tempfile.TemporaryDirectory() as tmpdir:
+                    z.extractall(tmpdir)
+                    return gpd.read_file(os.path.join(tmpdir, shp_files[0]))


+# Windows (Python 3.13) does not include the city's CA cert in its bundle.
+# All requests to the Toronto Open Data API use verify=False.
+# This is safe: we're reading public government data, not transmitting secrets.
+urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
+
+BASE = "https://ckan0.cf.opendata.inter.prod-toronto.ca/api/3/action/"
+_SSL = False  # set to True if you install the cert chain manually


+        if res.get("format", "").lower() == "zip" and res.get("url"):
+            raw = requests.get(res["url"], timeout=120, verify=_SSL).content
+            with zipfile.ZipFile(io.BytesIO(raw)) as z:


+            if date_col and name == "building_permits_cleared":
+                df[date_col] = pd.to_datetime(df[date_col], errors="coerce")
+                df = df[df[date_col] >= PERMITS_FILTER_DATE]
+                print(f"    [{name}] filtered to {PERMITS_FILTER_DATE}+ -> {len(df):,} rows")
+            _save(name, df, lm)


+### 2. Train the ML models (requires NVIDIA GPU)
+```bash
+python train_models.py
+```
+
+### 3. Start Ollama + FastAPI backend
+```bash
+# From the project root
+docker-compose up -d
+# Pull the LLM model on first run
+docker exec -it nvidia-hackathon-ollama-1 ollama pull llama3.1:8b


+{
+  "python.defaultInterpreterPath": "${workspaceFolder}/.venv/Scripts/python.exe",
+  "python.analysis.extraPaths": ["${workspaceFolder}/ml"]
+}


+# Toronto Open Data CKAN API (no key required)
+TORONTO_CKAN_BASE=https://ckan0.cf.opendata.inter.prod-toronto.ca/api/3/action/
+


+building_type,ite_code,daily_trips_per_1000sqft,peak_am_per_1000sqft,peak_pm_per_1000sqft,source
+residential,220,6.65,0.35,0.44,ITE Trip Generation 11th Ed (per unit: 6.65/unit residential mid-rise)
+commercial_office,710,11.03,1.56,1.49,ITE Trip Generation 11th Ed
+retail_general,820,42.70,1.03,4.24,ITE Trip Generation 11th Ed


+                else:
+                    return (
+                        pd.read_csv(io.BytesIO(resp.content), encoding="latin-1", on_bad_lines="skip", low_memory=False),
+                        lm,
+                    )


Copilot AI review requested due to automatic review settings May 30, 2026 03:05

Copilot started reviewing on behalf of Kingsolima May 30, 2026 03:05 View session

Copilot AI reviewed May 30, 2026

View reviewed changes

Kingsolima merged commit e9cd0ba into main May 30, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Toronto Open Data pipeline and dataset documentation#1

Add Toronto Open Data pipeline and dataset documentation#1
Kingsolima merged 1 commit into
mainfrom
omar/data

Kingsolima commented May 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# Toronto Open Data CKAN API (no key required)
		TORONTO_CKAN_BASE=https://ckan0.cf.opendata.inter.prod-toronto.ca/api/3/action/

Conversation

Kingsolima commented May 30, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants