TripLake ingests large public NYC TLC trip-record Parquet files into a Cloud Storage landing zone, exposes them as BigQuery external tables (schema-on-read), and builds incremental, partitioned native tables with dbt — producing marts like trips & revenue by zone/day and tipping behaviour by hour. A self-contained portfolio project focused on batch ingestion at scale and BigQuery cost/performance engineering.
Standalone learning project. The only source is the public TLC dataset (no auth).
NYC TLC Parquet (monthly, public)
│ Cloud Scheduler (monthly) ─► Cloud Workflow
▼
LOADER (Cloud Run job, argparse CLI)
• idempotent download → GCS landing (Hive layout: year=/month=)
│
▼
GCS landing gs://…-triplake-landing/yellow/year=YYYY/month=MM/*.parquet
│
▼
BigQuery EXTERNAL table ext.yellow_trips (schema-on-read, partition pruning)
│ dbt (Cloud Run job): incremental MERGE
▼
stage.trips_stage (native, partitioned by pickup_date, clustered by payment_type)
│
▼
marts.trips_daily · marts.tips_by_hour
Observability: dbt data-quality tests (tag:trips) + alert on job error logs
Two batch jobs, one workflow:
- loader —
--month YYYY-MMor--start/--endbackfill; skips existing objects (idempotent); streams Parquet straight to GCS. - dbt — external table → typed/deduped incremental
trips_stage→ marts; runs and teststag:trips.
| Layer | Technology |
|---|---|
| Ingestion | Python 3.12 Cloud Run job (batch, argparse CLI) |
| Lake | Cloud Storage (Hive-partitioned Parquet, lifecycle tiering) |
| Warehouse | BigQuery external + native partitioned/clustered tables |
| Transformation | dbt (incremental MERGE), Cloud Run job |
| Orchestration | Cloud Workflows + Cloud Scheduler |
| IaC | Terraform (GCS remote state) |
| CI/CD | GitHub Actions (matrix build) + Workload Identity Federation |
| Tooling | ruff, pytest |
triplake/
├── loader/ # Cloud Run job: TLC Parquet -> GCS landing (idempotent, backfill)
├── dbt/ # external source -> incremental stage -> marts, runner.sh, Dockerfile
├── infra/terraform/ # bucket+lifecycle, external table, run jobs, workflow, monitoring
└── .github/ # CI (lint/test + dbt parse + tf validate), deploy (WIF), dependabot
# loader
cd loader && pip install -e . pytest ruff && ruff check . && pytest -q
export LANDING_BUCKET=your-bucket
python main.py --dataset yellow --month 2024-01 # one month
python main.py --dataset yellow --start 2023-01 --end 2023-12 # backfill
# dbt (needs: gcloud auth application-default login)
cd dbt && export GCP_PROJECT=your-gcp-project
dbt deps --profiles-dir .
dbt build --profiles-dir . --select tag:trips # run + test- Create a GCP project; enable Cloud Run, BigQuery, Cloud Storage, Workflows, Scheduler,
Artifact Registry. Create an Artifact Registry repo
triplake, a Terraform-state bucket, and a Workload Identity Federation pool bound to this repo. - Set repo Actions variables:
GCP_PROJECT_ID,WIF_PROVIDER,DEPLOYER_SA,TFSTATE_BUCKET. - Push to
main—deploy.yamlbuilds both images (matrix) and runsterraform plan/apply. Run a backfill via the loader job with--start/--endoverrides.
- Schema-on-read first — land raw Parquet and query it via an external table; promote to native tables only where performance/cost warrants it.
- Hive partitioning —
year=/month=layout lets BigQuery prune external scans. - Incremental MERGE —
trips_stageonly scans partitions newer than its current max and upserts on a surrogatetrip_key; safe to re-run. - Partition + cluster — native tables are partitioned by
pickup_dateand clustered, so mart queries scan minimal bytes. - Idempotent backfill — the loader skips objects already present; re-running a range is free and safe.
- Cost-aware — GCS lifecycle tiers data to Nearline/Coldline; dirty out-of-range
timestamps are filtered to avoid junk partitions;
terraform destroywhen idle. - Quality gates — dbt tests (uniqueness, ranges, combinations) fail the job on bad data.
- Schema-on-read first. Land raw Parquet and query it via an external table; promote to native partitioned tables only where performance/cost justifies it. Lesson: don't pay to materialize data you might not query.
- Hive-partitioned layout (
year=/month=) so BigQuery prunes external scans by date — partitioning isn't only for native tables. - Incremental MERGE on a surrogate key with a "scan only newer partitions" predicate, so re-runs over millions of rows stay cheap. Lesson: incremental dbt at scale is about bounding the scan, not just the write.
- Idempotent backfill. The loader skips objects already in GCS, so re-running any month range is free and safe.
- Filter dirty data early. Real TLC files carry out-of-range timestamps that would create thousands of junk partitions — I caught this by inspecting the actual Parquet (2.96 M rows/month, column names verified) and added a sane date guard.
Lakehouse design · GCS lifecycle/tiering · BigQuery external & partitioned/clustered tables · partition pruning & query-cost optimization · dbt incremental models (MERGE) · batch processing on Cloud Run jobs · idempotent backfills · data-quality testing · Terraform with remote state · matrix CI/CD with Workload Identity Federation · observability.
cd infra/terraform && terraform destroy