SkyCast pulls hourly weather forecasts for a set of cities from the free Open-Meteo API (no API key), lands the raw JSON in BigQuery, and transforms it with dbt into clean analytics marts. It is a self-contained portfolio project demonstrating the canonical scheduled ELT pattern on GCP.
Standalone learning project. No private data, no secrets, no external dependencies — the only data source is a public, keyless API.
Cloud Scheduler (hourly)
│ POST execution
▼
Cloud Workflow ──► Cloud Function (Python) ──► BigQuery inbound.raw_forecasts (raw JSON)
│ │
└──────────► Cloud Run job (dbt) ──────────────────────────┘
│ dbt run + test (tag:weather)
▼
stage.weather_typed ──► marts.daily_weather
marts.forecast_accuracy
Observability: log-based error metric + Cloud Monitoring alert
Flow: Scheduler triggers a Workflow → the Workflow invokes the ingestion Function
(raw JSON → inbound) → then runs the dbt Cloud Run job (inbound → stage → marts).
| Layer | Technology |
|---|---|
| Ingestion | Python 3.12, functions-framework, Cloud Functions (gen2) |
| Warehouse | BigQuery (inbound / stage / marts) |
| Transformation | dbt (dbt-bigquery), Cloud Run job |
| Orchestration | Cloud Workflows + Cloud Scheduler |
| IaC | Terraform (GCS remote state) |
| CI/CD | GitHub Actions + Workload Identity Federation |
| Tooling | uv, ruff, pytest |
skycast/
├── ingestion/ # Cloud Function: Open-Meteo -> BigQuery inbound
│ ├── main.py
│ ├── skycast/ # client, backend, config, logger
│ ├── config.yaml # cities + BQ target
│ └── tests/ # pytest, all clients mocked
├── dbt/ # stage + marts models, macros, runner.sh
├── infra/terraform/ # BQ, function, dbt job, workflow, scheduler, monitoring, IAM
├── Dockerfile # dbt image for the Cloud Run job
└── .github/workflows # ci.yaml (lint/test/validate), deploy.yaml (WIF)
Ingestion function (writes to a real BigQuery project):
cd ingestion
uv sync --dev
uv run ruff check .
uv run pytest -m "not integration"
# run the function locally against your own GCP project
export GCP_PROJECT=your-gcp-project
uv run functions-framework --target=ingest_weather --debug
curl http://localhost:8080dbt models (needs gcloud auth application-default login):
cd dbt
export GCP_PROJECT=your-gcp-project
dbt deps --profiles-dir .
dbt run --profiles-dir . --select tag:weather
dbt test --profiles-dir . --select tag:weather- Create a GCP project, enable APIs (BigQuery, Cloud Functions, Cloud Run, Workflows,
Scheduler, Artifact Registry), and create: an Artifact Registry repo
skycast, a function-source GCS bucket, a Terraform-state GCS bucket, and a Workload Identity Federation pool bound to this GitHub repo. - Set repo Actions variables:
GCP_PROJECT_ID,WIF_PROVIDER,DEPLOYER_SA,FUNCTION_SOURCE_BUCKET,TFSTATE_BUCKET. - Push to
main—deploy.yamlbuilds the dbt image, packages the function, and runsterraform plan/apply.
- ELT, not ETL — raw API JSON is stored verbatim in a
dataJSON column; all typing and shaping happens in dbt, so the warehouse keeps the source of truth. - Idempotent + deduplicated — re-runs append snapshots;
weather_typedkeeps the latest row per(city, forecast_ts)via a reusablededupmacro. - Keyless — no service-account keys anywhere; CI uses Workload Identity Federation and dbt uses OAuth / Application Default Credentials.
- Cost-aware — marts are partitioned by date and clustered by city; datasets live in
one region; pause the scheduler when idle and
terraform destroybetween sessions. - Observable from day one — a log-based error metric and an alert policy ship with the infrastructure.
- ELT, not ETL. Ingestion stores raw API JSON verbatim and dbt does all typing/shaping. Why: schema changes never break ingestion and history can be reprocessed by re-running dbt. Lesson: keep the loader dumb; put logic in the warehouse.
- Cloud Workflows to sequence ingest → dbt rather than gluing functions with Pub/Sub. Tradeoff: one more service, but built-in retries and a clear run history.
- Idempotent by design. Re-runs append snapshots;
weather_typedkeeps the latest row per(city, forecast_ts)viaROW_NUMBER() … QUALIFY 1. Lesson: design for at-least-once delivery and deduplicate downstream instead of chasing exactly-once. - Keyless everywhere. GitHub OIDC → Workload Identity Federation for CI, OAuth/ADC for dbt — no service-account keys in the repo. Learned: how to wire WIF end-to-end.
source()+ dbt tests over hardcoded table strings, so lineage and source freshness actually work (a deliberate fix to a pattern I saw drift in the reference codebase).
GCP serverless data engineering (Cloud Functions, Cloud Run jobs, Workflows, Scheduler) · BigQuery ELT with layered datasets · dbt modelling, testing, macros · partitioning & clustering · Terraform with remote state · GitHub Actions CI/CD with Workload Identity Federation · structured logging & alerting · tested, linted, reproducible Python.
cd infra/terraform && terraform destroy