Skip to content

Analyticsphere/flattener-orchestrator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

flattener-orchestrator

An Airflow DAG that orchestrates the flattener pipeline for the Connect for Cancer Prevention study.

Overview

The DAG runs daily at 9:30 AM Eastern and processes Connect study tables through a multi-step flattening pipeline. It calls the flattener Cloud Run service via authenticated HTTP requests.

Pipeline Steps

  1. Health check -- verify the flattener service is up (GET /heartbeat)
  2. Firestore backup -- (optional, parameter-driven) trigger a Firestore data refresh
  3. Export to Parquet -- export each selected BigQuery table to Parquet in GCS
  4. Convert -- decode BLOB columns to STRUCT (boxes table only)
  5. Flatten -- flatten nested Parquet into wide Parquet files via DuckDB
  6. Load to BigQuery -- load flattened Parquet files back into BigQuery
  7. Profile history -- generate the user profile history table directly in BigQuery

Steps 3-6 run in parallel across all selected tables using Airflow dynamic task mapping. Step 7 runs once after all tables are loaded.

Configuration

The DAG accepts two runtime parameters:

  • trigger_firestore_backup (boolean, default: false) -- trigger a Firestore backup before processing
  • tables_to_process (array of strings, default: all tables) -- select which tables to include

Environment Variables

Variable Description
FLATTENER_PROCESSOR_ENDPOINT Base URL of the flattener Cloud Run service
GCP_PROJECT_ID Google Cloud project ID
GCS_FLATTENED_BUCKET GCS bucket for intermediate Parquet files

About

Airflow DAG for orchestrating the Connect data flattening process

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages