TripLake

A cost-efficient BigQuery lakehouse over public NYC taxi data

TripLake ingests large public NYC TLC trip-record Parquet files into a Cloud Storage landing zone, exposes them as BigQuery external tables (schema-on-read), and builds incremental, partitioned native tables with dbt — producing marts like trips & revenue by zone/day and tipping behaviour by hour. A self-contained portfolio project focused on batch ingestion at scale and BigQuery cost/performance engineering.

Standalone learning project. The only source is the public TLC dataset (no auth).

Architecture

NYC TLC Parquet (monthly, public)
        │   Cloud Scheduler (monthly) ─► Cloud Workflow
        ▼
  LOADER (Cloud Run job, argparse CLI)
   • idempotent download → GCS landing (Hive layout: year=/month=)
        │
        ▼
  GCS landing  gs://…-triplake-landing/yellow/year=YYYY/month=MM/*.parquet
        │
        ▼
  BigQuery EXTERNAL table ext.yellow_trips   (schema-on-read, partition pruning)
        │   dbt (Cloud Run job): incremental MERGE
        ▼
  stage.trips_stage  (native, partitioned by pickup_date, clustered by payment_type)
        │
        ▼
  marts.trips_daily   ·   marts.tips_by_hour
Observability: dbt data-quality tests (tag:trips) + alert on job error logs

Two batch jobs, one workflow:

loader — --month YYYY-MM or --start/--end backfill; skips existing objects (idempotent); streams Parquet straight to GCS.
dbt — external table → typed/deduped incremental trips_stage → marts; runs and tests tag:trips.

Tech stack

Layer	Technology
Ingestion	Python 3.12 Cloud Run job (batch, argparse CLI)
Lake	Cloud Storage (Hive-partitioned Parquet, lifecycle tiering)
Warehouse	BigQuery external + native partitioned/clustered tables
Transformation	dbt (incremental MERGE), Cloud Run job
Orchestration	Cloud Workflows + Cloud Scheduler
IaC	Terraform (GCS remote state)
CI/CD	GitHub Actions (matrix build) + Workload Identity Federation
Tooling	ruff, pytest

Repository layout

triplake/
├── loader/   # Cloud Run job: TLC Parquet -> GCS landing (idempotent, backfill)
├── dbt/      # external source -> incremental stage -> marts, runner.sh, Dockerfile
├── infra/terraform/  # bucket+lifecycle, external table, run jobs, workflow, monitoring
└── .github/  # CI (lint/test + dbt parse + tf validate), deploy (WIF), dependabot

Run locally

# loader
cd loader && pip install -e . pytest ruff && ruff check . && pytest -q
export LANDING_BUCKET=your-bucket
python main.py --dataset yellow --month 2024-01           # one month
python main.py --dataset yellow --start 2023-01 --end 2023-12   # backfill

# dbt (needs: gcloud auth application-default login)
cd dbt && export GCP_PROJECT=your-gcp-project
dbt deps --profiles-dir .
dbt build --profiles-dir . --select tag:trips             # run + test

Deploy

Create a GCP project; enable Cloud Run, BigQuery, Cloud Storage, Workflows, Scheduler, Artifact Registry. Create an Artifact Registry repo triplake, a Terraform-state bucket, and a Workload Identity Federation pool bound to this repo.
Set repo Actions variables: GCP_PROJECT_ID, WIF_PROVIDER, DEPLOYER_SA, TFSTATE_BUCKET.
Push to main — deploy.yaml builds both images (matrix) and runs terraform plan/apply. Run a backfill via the loader job with --start/--end overrides.

Design notes

Schema-on-read first — land raw Parquet and query it via an external table; promote to native tables only where performance/cost warrants it.
Hive partitioning — year=/month= layout lets BigQuery prune external scans.
Incremental MERGE — trips_stage only scans partitions newer than its current max and upserts on a surrogate trip_key; safe to re-run.
Partition + cluster — native tables are partitioned by pickup_date and clustered, so mart queries scan minimal bytes.
Idempotent backfill — the loader skips objects already present; re-running a range is free and safe.
Cost-aware — GCS lifecycle tiers data to Nearline/Coldline; dirty out-of-range timestamps are filtered to avoid junk partitions; terraform destroy when idle.
Quality gates — dbt tests (uniqueness, ranges, combinations) fail the job on bad data.

Key decisions & what I learned

Schema-on-read first. Land raw Parquet and query it via an external table; promote to native partitioned tables only where performance/cost justifies it. Lesson: don't pay to materialize data you might not query.
Hive-partitioned layout (year=/month=) so BigQuery prunes external scans by date — partitioning isn't only for native tables.
Incremental MERGE on a surrogate key with a "scan only newer partitions" predicate, so re-runs over millions of rows stay cheap. Lesson: incremental dbt at scale is about bounding the scan, not just the write.
Idempotent backfill. The loader skips objects already in GCS, so re-running any month range is free and safe.
Filter dirty data early. Real TLC files carry out-of-range timestamps that would create thousands of junk partitions — I caught this by inspecting the actual Parquet (2.96 M rows/month, column names verified) and added a sane date guard.

Skills demonstrated

Lakehouse design · GCS lifecycle/tiering · BigQuery external & partitioned/clustered tables · partition pruning & query-cost optimization · dbt incremental models (MERGE) · batch processing on Cloud Run jobs · idempotent backfills · data-quality testing · Terraform with remote state · matrix CI/CD with Workload Identity Federation · observability.

Teardown

cd infra/terraform && terraform destroy

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github		.github
dbt		dbt
infra/terraform		infra/terraform
loader		loader
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TripLake

A cost-efficient BigQuery lakehouse over public NYC taxi data

Architecture

Tech stack

Repository layout

Run locally

Deploy

Design notes

Key decisions & what I learned

Skills demonstrated

Teardown

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TripLake

A cost-efficient BigQuery lakehouse over public NYC taxi data

Architecture

Tech stack

Repository layout

Run locally

Deploy

Design notes

Key decisions & what I learned

Skills demonstrated

Teardown

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages