Skip to content

Latest commit

 

History

History
213 lines (141 loc) · 10.3 KB

File metadata and controls

213 lines (141 loc) · 10.3 KB

AWS Deployment Guide

This guide details the AWS infrastructure deployed by this project and provides instructions on how to set it up and run the data pipelines.

🏛️ AWS Architecture Overview

The Terraform configuration deploys a complete data platform on AWS. The architecture is designed to ingest data from our three local generators:

  1. Clickstream (Real-time): The mcdp-clickstream generator sends JSON events to an API Gateway endpoint. These are streamed through Kinesis Firehose, transformed by a Lambda function, and landed in S3 as an Apache Iceberg table.
  2. Transactional (CDC): The local Debezium/Kafka stack captures changes from MySQL. A Kafka S3 Sink connector writes these CDC events to the raw/cdc/ prefix in S3.
  3. Ad Spend (Batch): The mcdp-ad-spend generator writes daily CSV reports directly to the raw/ad_spend/ prefix in S3.

Once in S3, Glue Crawlers catalog the CDC and Ad Spend data. Lake Formation governs access for different user personas. Athena is used for querying, and dbt (orchestrated by Dagster) transforms the raw data into staging and mart layers.

Full AWS Architecture Diagram


💸 Managing AWS Costs

Important: This is a playground, not a 24/7 service. To avoid unexpected charges, always run make infra-destroy as soon as you are finished with your experiments, especially at the end of the day.

This platform is designed to be very low-cost when used for typical development and learning (i.e., running for a few hours at a time).

  • Realistic Cost: In practice, running experiments intermittently for a month has resulted in charges as low as ~$2/month, even without the AWS Free Tier.
  • High-Volume Estimate: A continuous, high-load estimate was created using the AWS Calculator. This assumed the platform ran 24/7 for 30 days with a heavy clickstream load (20 events/second) and came to ~$160/month.
  • Main Cost Driver: The vast majority of the high-end estimate (~$140) comes from the API Gateway.

You can control costs by:

  1. Running make infra-destroy when you're done.
  2. Reducing the event frequency of the mcdp-clickstream generator or running it for shorter periods.

🚀 Deployment & Usage

Follow these steps to configure your environment, deploy the cloud infrastructure, and run the data pipelines.

Prerequisites

Before you begin, ensure you have the following tools installed and configured:

1. Configure Your Environment

The Makefile relies on a .env file for all configuration.

  1. Copy the example file:
    cp .env.example .env
  2. Edit the new .env file and provide your AWS profile name and region:
    AWS_PROFILE=my-sso-profile
    AWS_REGION=eu-central-1
    Note: These are the only values you need to provide manually.

2. Deploy the AWS Infrastructure

This step uses Terraform to create all the AWS resources (S3, API Gateway, Firehose, Glue, etc.).

make infra-apply

This command does two key things:

  • Applies Terraform: Deploys all cloud resources.
  • Populates .env: After a successful run, it automatically queries the Terraform outputs and adds the live S3 bucket name, API endpoint, etc., to your .env file.
  • Generates dbt Profile: It also uses the dbt/.dbt/profiles.yml.j2 template to create your final ~/.dbt/profiles.yml file, pointing dbt and dagster to your new Athena instance.

3. Start the Local CDC Stack

With the cloud infrastructure ready, start the local Docker containers for MySQL, Kafka, and Debezium.

# Starts the docker-compose stack
make cdc-stack-up

# On your first run, you MUST register the connectors
make register-connectors

4. Run the Data Generators

Now you can start one or all of the data generators to send data to AWS:

# Run the Transactional (CDC) Stream
# (Streams live order changes from MySQL to Kafka/S3)
make transation-stream

# Run the Clickstream (Real-time)
# (Starts sending events to the live API Gateway endpoint)
make clickstream

# Run the Ad Spend (Batch)
# (Generates a CSV for today's date and uploads it to S3)
make ad-spend

Note: The S3 sink connector is configured to flush events from Kafka to S3 every 5 minutes, so you may need to wait a few minutes for the first batch of CDC data to appear in S3.

5. Catalog the Raw Data (Important!)

The raw CDC and Ad Spend data is now in S3, but it is not yet visible in Athena.

The Glue Crawlers are configured to run on a 6-hour schedule. To see your data immediately, you must run them manually:

  1. Go to the AWS Glue Console in your region.
  2. Navigate to Crawlers.
  3. Select the mcdp-cdc-crawler and mcdp-ad-spend-crawler (or similar names).
  4. Click Run.

After a minute or two, the crawlers will finish, and the prod_raw database will be populated with tables, making them queryable in Athena.

6. View the Data Pipeline in Dagster

Now that the pipeline is running, you can explore the dbt models and jobs in the Dagster UI.

# Starts the Dagster UI
make dagster-dev

Open http://localhost:3000 in your browser. You can explore the lineage of all the dbt models and see the defined jobs.

7. Clean Up

To tear down the resources:

# Destroy all AWS resources managed by Terraform
make infra-destroy

# Stop and remove the local Docker stack
make cdc-stack-down

# [optional] To stop, remove containers, AND clear all Docker volumes:
make cdc-stack-down-clean

🔐 Security & Governance: IAM & Lake Formation

A key feature of this playground is its pre-configured data governance model using IAM and Lake Formation. This demonstrates how to secure a data lake for different user personas.

We define three primary personas:

  • Data Engineer Role: This role has full access. It can manage S3, run Glue crawlers, and has write access to create staging and mart tables. It is also empowered to manage Lake Formation tags (e.g., tag data as sensitive).
  • Data Analyst Role: This role is for consumption only. It has read-only access and is confined to its own Athena workgroup. Crucially, its permissions are restricted by Lake Formation tags.
  • Lake Formation Admin: This is the high-privilege role used by Terraform itself to register the S3 bucket, create databases, and set up the initial permission policies.

How it Works: LF-Tags

  1. Tag Definition: Terraform creates a Lake Formation tag called si (for "Sensitive Information") with a default value of false.
  2. Permission Policy: The Data Analyst role is granted SELECT access only on tables and columns that do not have the si tag (or where si is false).
  3. dbt Integration: The dbt models (like dim_customer) are configured to apply these tags. The model will tag sensitive columns (e.g., PII) as si=true.
  4. Enforcement: When an analyst queries dim_customer from Athena, Lake Formation automatically enforces this policy, returning NULL for (or blocking access to) the sensitive columns they are not authorized to see.

⚙️ AWS Infrastructure Details (Terraform)

The infra/aws/ directory contains all the Terraform modules for this stack.

  • S3 Data Lake: A single S3 bucket is provisioned to serve as the landing zone for all raw data (raw/cdc/, raw/ad_spend/) and the home for the Iceberg-based lakehouse tables.
  • Clickstream Ingestion Pipeline: This is a serverless pipeline:
    • API Gateway: Provides a public HTTPS endpoint (/prod/events).
    • Kinesis Firehose: Buffers incoming records, compresses them (GZIP), and writes them to S3.
    • Lambda: A simple Python function attached to Firehose that validates the incoming JSON.
    • Iceberg Table: Firehose is configured to write data in Apache Iceberg format, partitioned by event day.
  • Glue Crawlers: Two crawlers are defined: one for the raw/cdc/ prefix and one for raw/ad_spend/. They are scheduled to run every six hours (but can be run manually for immediate results).
  • Athena Workgroups: Three workgroups are created (engineers, analysts, engine-v3) with distinct S3 locations for query results. The analysts workgroup is locked down to prevent them from bypassing Lake Formation policies.

🔄 Data Transformation (dbt)

The dbt/aws/ project is configured to run against Athena. The ~/.dbt/profiles.yml is generated automatically by make infra-apply.

  • Sources: Reads from the prod_raw schema (populated by the Glue crawlers).
  • Staging: Creates simple views (staging schema) to clean up and cast types.
  • Marts: Materializes tables in the marts schema as Iceberg tables.
    • dim_customer and dim_product are SCD Type 2 (Slowly Changing Dimensions) tables. They use a custom cdc_to_scd2 macro to handle the Debezium CDC payload and maintain a history of changes.
    • fct_orders is an fact table.

📈 Orchestration (Dagster)

The orchestration/dg-aws/ project provides a Dagster workspace to schedule and run the dbt project.

  • It uses a DbtCliResource to execute dbt build commands.
  • It reads the manifest.json from the dbt project to automatically define Dagster assets for every dbt model.
  • It creates a daily schedule (@daily) for each dbt job, allowing for automated, daily builds of the data marts.
  • You can launch the local UI with make dagster-dev.

🛠️ Local Development & Testing

The repository is set up for local development and quality checks, primarily via the Makefile.

  • Linting & Formatting: make lint and make fmt (uses Ruff and MyPy).
  • Unit Tests: make test runs pytest on the core Python code (e.g., data generators, Lambda transform).
  • Smoke Tests: make smoke runs tests tagged with smoke. These are longer-running tests that interact with live services, such as spinning up the Docker stack to test the CDC commands.