Add Toronto Open Data pipeline and dataset documentation#1
Merged
Conversation
Kingsolima
commented
May 30, 2026
Owner
- ml/fetch.py: CKAN download helpers with SSL bypass (Windows cert issue), ZIP/shapefile extraction, case-insensitive CSV lat/lon conversion, and GTFS stop parsing for TTC
- ml/data_pipeline.py: downloads all 17 datasets to data/ as GeoParquet/ Parquet; caches on disk, filters permits to 2020+, joins neighbourhood profiles (income/density) onto polygon boundaries
- ml/requirements.txt: geopandas, xgboost, rasterio, python-dotenv
- data/data.md: full dataset guide — buckets, column specs, handoff notes
- .env.example: all required keys (WAQI, Mapbox, Anthropic, Ollama)
- .gitignore: excludes the data files themselves (.parquet files), its recieved through runniing python ml/data_pipeline.py.
- ml/fetch.py: CKAN download helpers with SSL bypass (Windows cert issue), ZIP/shapefile extraction, case-insensitive CSV lat/lon conversion, and GTFS stop parsing for TTC - ml/data_pipeline.py: downloads all 17 datasets to data/ as GeoParquet/ Parquet; caches on disk, filters permits to 2020+, joins neighbourhood profiles (income/density) onto polygon boundaries - ml/requirements.txt: geopandas, xgboost, rasterio, python-dotenv - data/data.md: full dataset guide — buckets, column specs, handoff notes - data/coefficients/: ITE trip generation rates + StatsCan I-O multipliers - .env.example: all required keys (WAQI, Mapbox, Anthropic, Ollama) - .gitignore: exclude data/*.parquet, data/*.tif, backend/models/, .venv - .vscode/settings.json: Python interpreter + ml/ extra path for team
Contributor
There was a problem hiding this comment.
Pull request overview
Adds an ML-oriented Toronto Open Data ingestion pipeline plus accompanying project/data documentation to support downstream model training and spatial lookups.
Changes:
- Introduces
ml/fetch.pyCKAN download helpers (GeoJSON/ZIP/CSV/GTFS + raster download). - Adds
ml/data_pipeline.pyto download/cache multiple Toronto datasets intodata/as Parquet/GeoParquet (and a GeoTIFF), including a permits date filter and neighbourhood profile join. - Adds dataset guide + coefficient CSVs, along with updated README, env template, editor settings, and ignore rules.
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 27 comments.
Show a summary per file
| File | Description |
|---|---|
| README.md | Replaces placeholder README with Quick Start + architecture overview and dataset references. |
| ml/requirements.txt | Adds pinned geospatial/ML dependencies for the pipeline. |
| ml/fetch.py | Implements CKAN resource discovery + download/parsing helpers (geo/tabular/GTFS/raster). |
| ml/data_pipeline.py | Orchestrates dataset downloads, caching, light transformations, and writes to data/. |
| data/data.md | Adds comprehensive dataset documentation and intended runtime/training usage notes. |
| data/coefficients/statscan_io_multipliers.csv | Adds StatsCan multiplier lookup table for job/GDP estimates. |
| data/coefficients/ite_trip_rates.csv | Adds trip-generation lookup table (needs license/compliance review). |
| .vscode/settings.json | Adds workspace Python settings (currently Windows-specific interpreter path). |
| .gitignore | Ignores generated data/model artifacts and frontend build outputs. |
| .env.example | Documents required environment variables for data pipeline and runtime APIs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+41
to
+45
| def list_formats(ckan_id: str) -> list[tuple[str, str]]: | ||
| """Debug helper: print available resource formats for a dataset.""" | ||
| pkg = _package(ckan_id) | ||
| return [(r.get("format", "?"), r.get("name", "?"), r.get("url", "?")[:80]) | ||
| for r in pkg["resources"]] |
Comment on lines
+56
to
+60
| shp_files = [n for n in z.namelist() if n.lower().endswith(".shp")] | ||
| if shp_files: | ||
| with tempfile.TemporaryDirectory() as tmpdir: | ||
| z.extractall(tmpdir) | ||
| return gpd.read_file(os.path.join(tmpdir, shp_files[0])) |
Comment on lines
+21
to
+27
| # Windows (Python 3.13) does not include the city's CA cert in its bundle. | ||
| # All requests to the Toronto Open Data API use verify=False. | ||
| # This is safe: we're reading public government data, not transmitting secrets. | ||
| urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning) | ||
|
|
||
| BASE = "https://ckan0.cf.opendata.inter.prod-toronto.ca/api/3/action/" | ||
| _SSL = False # set to True if you install the cert chain manually |
Comment on lines
+179
to
+181
| if res.get("format", "").lower() == "zip" and res.get("url"): | ||
| raw = requests.get(res["url"], timeout=120, verify=_SSL).content | ||
| with zipfile.ZipFile(io.BytesIO(raw)) as z: |
Comment on lines
+251
to
+255
| if date_col and name == "building_permits_cleared": | ||
| df[date_col] = pd.to_datetime(df[date_col], errors="coerce") | ||
| df = df[df[date_col] >= PERMITS_FILTER_DATE] | ||
| print(f" [{name}] filtered to {PERMITS_FILTER_DATE}+ -> {len(df):,} rows") | ||
| _save(name, df, lm) |
Comment on lines
+17
to
+27
| ### 2. Train the ML models (requires NVIDIA GPU) | ||
| ```bash | ||
| python train_models.py | ||
| ``` | ||
|
|
||
| ### 3. Start Ollama + FastAPI backend | ||
| ```bash | ||
| # From the project root | ||
| docker-compose up -d | ||
| # Pull the LLM model on first run | ||
| docker exec -it nvidia-hackathon-ollama-1 ollama pull llama3.1:8b |
Comment on lines
+1
to
+4
| { | ||
| "python.defaultInterpreterPath": "${workspaceFolder}/.venv/Scripts/python.exe", | ||
| "python.analysis.extraPaths": ["${workspaceFolder}/ml"] | ||
| } |
Comment on lines
+1
to
+3
| # Toronto Open Data CKAN API (no key required) | ||
| TORONTO_CKAN_BASE=https://ckan0.cf.opendata.inter.prod-toronto.ca/api/3/action/ | ||
|
|
Comment on lines
+1
to
+4
| building_type,ite_code,daily_trips_per_1000sqft,peak_am_per_1000sqft,peak_pm_per_1000sqft,source | ||
| residential,220,6.65,0.35,0.44,ITE Trip Generation 11th Ed (per unit: 6.65/unit residential mid-rise) | ||
| commercial_office,710,11.03,1.56,1.49,ITE Trip Generation 11th Ed | ||
| retail_general,820,42.70,1.03,4.24,ITE Trip Generation 11th Ed |
Comment on lines
+119
to
+123
| else: | ||
| return ( | ||
| pd.read_csv(io.BytesIO(resp.content), encoding="latin-1", on_bad_lines="skip", low_memory=False), | ||
| lm, | ||
| ) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.