Skip to content

Add Toronto Open Data pipeline and dataset documentation#1

Merged
Kingsolima merged 1 commit into
mainfrom
omar/data
May 30, 2026
Merged

Add Toronto Open Data pipeline and dataset documentation#1
Kingsolima merged 1 commit into
mainfrom
omar/data

Conversation

@Kingsolima

Copy link
Copy Markdown
Owner
  • ml/fetch.py: CKAN download helpers with SSL bypass (Windows cert issue), ZIP/shapefile extraction, case-insensitive CSV lat/lon conversion, and GTFS stop parsing for TTC
  • ml/data_pipeline.py: downloads all 17 datasets to data/ as GeoParquet/ Parquet; caches on disk, filters permits to 2020+, joins neighbourhood profiles (income/density) onto polygon boundaries
  • ml/requirements.txt: geopandas, xgboost, rasterio, python-dotenv
  • data/data.md: full dataset guide — buckets, column specs, handoff notes
  • .env.example: all required keys (WAQI, Mapbox, Anthropic, Ollama)
  • .gitignore: excludes the data files themselves (.parquet files), its recieved through runniing python ml/data_pipeline.py.

- ml/fetch.py: CKAN download helpers with SSL bypass (Windows cert issue),
  ZIP/shapefile extraction, case-insensitive CSV lat/lon conversion, and
  GTFS stop parsing for TTC
- ml/data_pipeline.py: downloads all 17 datasets to data/ as GeoParquet/
  Parquet; caches on disk, filters permits to 2020+, joins neighbourhood
  profiles (income/density) onto polygon boundaries
- ml/requirements.txt: geopandas, xgboost, rasterio, python-dotenv
- data/data.md: full dataset guide — buckets, column specs, handoff notes
- data/coefficients/: ITE trip generation rates + StatsCan I-O multipliers
- .env.example: all required keys (WAQI, Mapbox, Anthropic, Ollama)
- .gitignore: exclude data/*.parquet, data/*.tif, backend/models/, .venv
- .vscode/settings.json: Python interpreter + ml/ extra path for team
Copilot AI review requested due to automatic review settings May 30, 2026 03:05

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an ML-oriented Toronto Open Data ingestion pipeline plus accompanying project/data documentation to support downstream model training and spatial lookups.

Changes:

  • Introduces ml/fetch.py CKAN download helpers (GeoJSON/ZIP/CSV/GTFS + raster download).
  • Adds ml/data_pipeline.py to download/cache multiple Toronto datasets into data/ as Parquet/GeoParquet (and a GeoTIFF), including a permits date filter and neighbourhood profile join.
  • Adds dataset guide + coefficient CSVs, along with updated README, env template, editor settings, and ignore rules.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 27 comments.

Show a summary per file
File Description
README.md Replaces placeholder README with Quick Start + architecture overview and dataset references.
ml/requirements.txt Adds pinned geospatial/ML dependencies for the pipeline.
ml/fetch.py Implements CKAN resource discovery + download/parsing helpers (geo/tabular/GTFS/raster).
ml/data_pipeline.py Orchestrates dataset downloads, caching, light transformations, and writes to data/.
data/data.md Adds comprehensive dataset documentation and intended runtime/training usage notes.
data/coefficients/statscan_io_multipliers.csv Adds StatsCan multiplier lookup table for job/GDP estimates.
data/coefficients/ite_trip_rates.csv Adds trip-generation lookup table (needs license/compliance review).
.vscode/settings.json Adds workspace Python settings (currently Windows-specific interpreter path).
.gitignore Ignores generated data/model artifacts and frontend build outputs.
.env.example Documents required environment variables for data pipeline and runtime APIs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ml/fetch.py
Comment on lines +41 to +45
def list_formats(ckan_id: str) -> list[tuple[str, str]]:
"""Debug helper: print available resource formats for a dataset."""
pkg = _package(ckan_id)
return [(r.get("format", "?"), r.get("name", "?"), r.get("url", "?")[:80])
for r in pkg["resources"]]
Comment thread ml/fetch.py
Comment on lines +56 to +60
shp_files = [n for n in z.namelist() if n.lower().endswith(".shp")]
if shp_files:
with tempfile.TemporaryDirectory() as tmpdir:
z.extractall(tmpdir)
return gpd.read_file(os.path.join(tmpdir, shp_files[0]))
Comment thread ml/fetch.py
Comment on lines +21 to +27
# Windows (Python 3.13) does not include the city's CA cert in its bundle.
# All requests to the Toronto Open Data API use verify=False.
# This is safe: we're reading public government data, not transmitting secrets.
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

BASE = "https://ckan0.cf.opendata.inter.prod-toronto.ca/api/3/action/"
_SSL = False # set to True if you install the cert chain manually
Comment thread ml/fetch.py
Comment on lines +179 to +181
if res.get("format", "").lower() == "zip" and res.get("url"):
raw = requests.get(res["url"], timeout=120, verify=_SSL).content
with zipfile.ZipFile(io.BytesIO(raw)) as z:
Comment thread ml/data_pipeline.py
Comment on lines +251 to +255
if date_col and name == "building_permits_cleared":
df[date_col] = pd.to_datetime(df[date_col], errors="coerce")
df = df[df[date_col] >= PERMITS_FILTER_DATE]
print(f" [{name}] filtered to {PERMITS_FILTER_DATE}+ -> {len(df):,} rows")
_save(name, df, lm)
Comment thread README.md
Comment on lines +17 to +27
### 2. Train the ML models (requires NVIDIA GPU)
```bash
python train_models.py
```

### 3. Start Ollama + FastAPI backend
```bash
# From the project root
docker-compose up -d
# Pull the LLM model on first run
docker exec -it nvidia-hackathon-ollama-1 ollama pull llama3.1:8b
Comment thread .vscode/settings.json
Comment on lines +1 to +4
{
"python.defaultInterpreterPath": "${workspaceFolder}/.venv/Scripts/python.exe",
"python.analysis.extraPaths": ["${workspaceFolder}/ml"]
}
Comment thread .env.example
Comment on lines +1 to +3
# Toronto Open Data CKAN API (no key required)
TORONTO_CKAN_BASE=https://ckan0.cf.opendata.inter.prod-toronto.ca/api/3/action/

Comment on lines +1 to +4
building_type,ite_code,daily_trips_per_1000sqft,peak_am_per_1000sqft,peak_pm_per_1000sqft,source
residential,220,6.65,0.35,0.44,ITE Trip Generation 11th Ed (per unit: 6.65/unit residential mid-rise)
commercial_office,710,11.03,1.56,1.49,ITE Trip Generation 11th Ed
retail_general,820,42.70,1.03,4.24,ITE Trip Generation 11th Ed
Comment thread ml/fetch.py
Comment on lines +119 to +123
else:
return (
pd.read_csv(io.BytesIO(resp.content), encoding="latin-1", on_bad_lines="skip", low_memory=False),
lm,
)
@Kingsolima Kingsolima merged commit e9cd0ba into main May 30, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants