Skip to content

Rahul06x1/triplake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TripLake

A cost-efficient BigQuery lakehouse over public NYC taxi data

CI License: MIT

TripLake ingests large public NYC TLC trip-record Parquet files into a Cloud Storage landing zone, exposes them as BigQuery external tables (schema-on-read), and builds incremental, partitioned native tables with dbt — producing marts like trips & revenue by zone/day and tipping behaviour by hour. A self-contained portfolio project focused on batch ingestion at scale and BigQuery cost/performance engineering.

Standalone learning project. The only source is the public TLC dataset (no auth).


Architecture

NYC TLC Parquet (monthly, public)
        │   Cloud Scheduler (monthly) ─► Cloud Workflow
        ▼
  LOADER (Cloud Run job, argparse CLI)
   • idempotent download → GCS landing (Hive layout: year=/month=)
        │
        ▼
  GCS landing  gs://…-triplake-landing/yellow/year=YYYY/month=MM/*.parquet
        │
        ▼
  BigQuery EXTERNAL table ext.yellow_trips   (schema-on-read, partition pruning)
        │   dbt (Cloud Run job): incremental MERGE
        ▼
  stage.trips_stage  (native, partitioned by pickup_date, clustered by payment_type)
        │
        ▼
  marts.trips_daily   ·   marts.tips_by_hour
Observability: dbt data-quality tests (tag:trips) + alert on job error logs

Two batch jobs, one workflow:

  • loader--month YYYY-MM or --start/--end backfill; skips existing objects (idempotent); streams Parquet straight to GCS.
  • dbt — external table → typed/deduped incremental trips_stage → marts; runs and tests tag:trips.

Tech stack

Layer Technology
Ingestion Python 3.12 Cloud Run job (batch, argparse CLI)
Lake Cloud Storage (Hive-partitioned Parquet, lifecycle tiering)
Warehouse BigQuery external + native partitioned/clustered tables
Transformation dbt (incremental MERGE), Cloud Run job
Orchestration Cloud Workflows + Cloud Scheduler
IaC Terraform (GCS remote state)
CI/CD GitHub Actions (matrix build) + Workload Identity Federation
Tooling ruff, pytest

Repository layout

triplake/
├── loader/   # Cloud Run job: TLC Parquet -> GCS landing (idempotent, backfill)
├── dbt/      # external source -> incremental stage -> marts, runner.sh, Dockerfile
├── infra/terraform/  # bucket+lifecycle, external table, run jobs, workflow, monitoring
└── .github/  # CI (lint/test + dbt parse + tf validate), deploy (WIF), dependabot

Run locally

# loader
cd loader && pip install -e . pytest ruff && ruff check . && pytest -q
export LANDING_BUCKET=your-bucket
python main.py --dataset yellow --month 2024-01           # one month
python main.py --dataset yellow --start 2023-01 --end 2023-12   # backfill

# dbt (needs: gcloud auth application-default login)
cd dbt && export GCP_PROJECT=your-gcp-project
dbt deps --profiles-dir .
dbt build --profiles-dir . --select tag:trips             # run + test

Deploy

  1. Create a GCP project; enable Cloud Run, BigQuery, Cloud Storage, Workflows, Scheduler, Artifact Registry. Create an Artifact Registry repo triplake, a Terraform-state bucket, and a Workload Identity Federation pool bound to this repo.
  2. Set repo Actions variables: GCP_PROJECT_ID, WIF_PROVIDER, DEPLOYER_SA, TFSTATE_BUCKET.
  3. Push to maindeploy.yaml builds both images (matrix) and runs terraform plan/apply. Run a backfill via the loader job with --start/--end overrides.

Design notes

  • Schema-on-read first — land raw Parquet and query it via an external table; promote to native tables only where performance/cost warrants it.
  • Hive partitioningyear=/month= layout lets BigQuery prune external scans.
  • Incremental MERGEtrips_stage only scans partitions newer than its current max and upserts on a surrogate trip_key; safe to re-run.
  • Partition + cluster — native tables are partitioned by pickup_date and clustered, so mart queries scan minimal bytes.
  • Idempotent backfill — the loader skips objects already present; re-running a range is free and safe.
  • Cost-aware — GCS lifecycle tiers data to Nearline/Coldline; dirty out-of-range timestamps are filtered to avoid junk partitions; terraform destroy when idle.
  • Quality gates — dbt tests (uniqueness, ranges, combinations) fail the job on bad data.

Key decisions & what I learned

  • Schema-on-read first. Land raw Parquet and query it via an external table; promote to native partitioned tables only where performance/cost justifies it. Lesson: don't pay to materialize data you might not query.
  • Hive-partitioned layout (year=/month=) so BigQuery prunes external scans by date — partitioning isn't only for native tables.
  • Incremental MERGE on a surrogate key with a "scan only newer partitions" predicate, so re-runs over millions of rows stay cheap. Lesson: incremental dbt at scale is about bounding the scan, not just the write.
  • Idempotent backfill. The loader skips objects already in GCS, so re-running any month range is free and safe.
  • Filter dirty data early. Real TLC files carry out-of-range timestamps that would create thousands of junk partitions — I caught this by inspecting the actual Parquet (2.96 M rows/month, column names verified) and added a sane date guard.

Skills demonstrated

Lakehouse design · GCS lifecycle/tiering · BigQuery external & partitioned/clustered tables · partition pruning & query-cost optimization · dbt incremental models (MERGE) · batch processing on Cloud Run jobs · idempotent backfills · data-quality testing · Terraform with remote state · matrix CI/CD with Workload Identity Federation · observability.


Teardown

cd infra/terraform && terraform destroy

About

Cost-efficient BigQuery lakehouse over public NYC taxi data — GCS, external tables, incremental dbt, Cloud Run, Terraform & GitHub Actions

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors