An Airflow DAG that orchestrates the flattener pipeline for the Connect for Cancer Prevention study.
The DAG runs daily at 9:30 AM Eastern and processes Connect study tables through a multi-step flattening pipeline. It calls the flattener Cloud Run service via authenticated HTTP requests.
- Health check -- verify the flattener service is up (
GET /heartbeat) - Firestore backup -- (optional, parameter-driven) trigger a Firestore data refresh
- Export to Parquet -- export each selected BigQuery table to Parquet in GCS
- Convert -- decode BLOB columns to STRUCT (boxes table only)
- Flatten -- flatten nested Parquet into wide Parquet files via DuckDB
- Load to BigQuery -- load flattened Parquet files back into BigQuery
- Profile history -- generate the user profile history table directly in BigQuery
Steps 3-6 run in parallel across all selected tables using Airflow dynamic task mapping. Step 7 runs once after all tables are loaded.
The DAG accepts two runtime parameters:
trigger_firestore_backup(boolean, default:false) -- trigger a Firestore backup before processingtables_to_process(array of strings, default: all tables) -- select which tables to include
| Variable | Description |
|---|---|
FLATTENER_PROCESSOR_ENDPOINT |
Base URL of the flattener Cloud Run service |
GCP_PROJECT_ID |
Google Cloud project ID |
GCS_FLATTENED_BUCKET |
GCS bucket for intermediate Parquet files |