This guide details the AWS infrastructure deployed by this project and provides instructions on how to set it up and run the data pipelines.
The Terraform configuration deploys a complete data platform on AWS. The architecture is designed to ingest data from our three local generators:
- Clickstream (Real-time): The
mcdp-clickstreamgenerator sends JSON events to an API Gateway endpoint. These are streamed through Kinesis Firehose, transformed by a Lambda function, and landed in S3 as an Apache Iceberg table. - Transactional (CDC): The local Debezium/Kafka stack captures changes from MySQL. A Kafka S3 Sink connector writes these CDC events to the
raw/cdc/prefix in S3. - Ad Spend (Batch): The
mcdp-ad-spendgenerator writes daily CSV reports directly to theraw/ad_spend/prefix in S3.
Once in S3, Glue Crawlers catalog the CDC and Ad Spend data. Lake Formation governs access for different user personas. Athena is used for querying, and dbt (orchestrated by Dagster) transforms the raw data into staging and mart layers.
Important: This is a playground, not a 24/7 service. To avoid unexpected charges, always run
make infra-destroyas soon as you are finished with your experiments, especially at the end of the day.
This platform is designed to be very low-cost when used for typical development and learning (i.e., running for a few hours at a time).
- Realistic Cost: In practice, running experiments intermittently for a month has resulted in charges as low as ~$2/month, even without the AWS Free Tier.
- High-Volume Estimate: A continuous, high-load estimate was created using the AWS Calculator. This assumed the platform ran 24/7 for 30 days with a heavy clickstream load (20 events/second) and came to ~$160/month.
- Main Cost Driver: The vast majority of the high-end estimate (~$140) comes from the API Gateway.
You can control costs by:
- Running
make infra-destroywhen you're done. - Reducing the event frequency of the
mcdp-clickstreamgenerator or running it for shorter periods.
Follow these steps to configure your environment, deploy the cloud infrastructure, and run the data pipelines.
Before you begin, ensure you have the following tools installed and configured:
- AWS CLI: Configured with credentials.
- Terraform:
1.5+ - Docker & Docker Compose
- Python:
3.9+(withuvinstalled)
The Makefile relies on a .env file for all configuration.
- Copy the example file:
cp .env.example .env
- Edit the new
.envfile and provide your AWS profile name and region:Note: These are the only values you need to provide manually.AWS_PROFILE=my-sso-profile AWS_REGION=eu-central-1
This step uses Terraform to create all the AWS resources (S3, API Gateway, Firehose, Glue, etc.).
make infra-applyThis command does two key things:
- Applies Terraform: Deploys all cloud resources.
- Populates
.env: After a successful run, it automatically queries the Terraform outputs and adds the live S3 bucket name, API endpoint, etc., to your.envfile. - Generates dbt Profile: It also uses the
dbt/.dbt/profiles.yml.j2template to create your final~/.dbt/profiles.ymlfile, pointingdbtanddagsterto your new Athena instance.
With the cloud infrastructure ready, start the local Docker containers for MySQL, Kafka, and Debezium.
# Starts the docker-compose stack
make cdc-stack-up
# On your first run, you MUST register the connectors
make register-connectorsNow you can start one or all of the data generators to send data to AWS:
# Run the Transactional (CDC) Stream
# (Streams live order changes from MySQL to Kafka/S3)
make transation-stream
# Run the Clickstream (Real-time)
# (Starts sending events to the live API Gateway endpoint)
make clickstream
# Run the Ad Spend (Batch)
# (Generates a CSV for today's date and uploads it to S3)
make ad-spendNote: The S3 sink connector is configured to flush events from Kafka to S3 every 5 minutes, so you may need to wait a few minutes for the first batch of CDC data to appear in S3.
The raw CDC and Ad Spend data is now in S3, but it is not yet visible in Athena.
The Glue Crawlers are configured to run on a 6-hour schedule. To see your data immediately, you must run them manually:
- Go to the AWS Glue Console in your region.
- Navigate to Crawlers.
- Select the
mcdp-cdc-crawlerandmcdp-ad-spend-crawler(or similar names). - Click Run.
After a minute or two, the crawlers will finish, and the prod_raw database will be populated with tables, making them queryable in Athena.
Now that the pipeline is running, you can explore the dbt models and jobs in the Dagster UI.
# Starts the Dagster UI
make dagster-devOpen http://localhost:3000 in your browser. You can explore the lineage of all the dbt models and see the defined jobs.
To tear down the resources:
# Destroy all AWS resources managed by Terraform
make infra-destroy
# Stop and remove the local Docker stack
make cdc-stack-down
# [optional] To stop, remove containers, AND clear all Docker volumes:
make cdc-stack-down-cleanA key feature of this playground is its pre-configured data governance model using IAM and Lake Formation. This demonstrates how to secure a data lake for different user personas.
We define three primary personas:
- Data Engineer Role: This role has full access. It can manage S3, run Glue crawlers, and has write access to create staging and mart tables. It is also empowered to manage Lake Formation tags (e.g., tag data as sensitive).
- Data Analyst Role: This role is for consumption only. It has read-only access and is confined to its own Athena workgroup. Crucially, its permissions are restricted by Lake Formation tags.
- Lake Formation Admin: This is the high-privilege role used by Terraform itself to register the S3 bucket, create databases, and set up the initial permission policies.
- Tag Definition: Terraform creates a Lake Formation tag called
si(for "Sensitive Information") with a default value offalse. - Permission Policy: The Data Analyst role is granted
SELECTaccess only on tables and columns that do not have thesitag (or wheresiisfalse). - dbt Integration: The
dbtmodels (likedim_customer) are configured to apply these tags. The model will tag sensitive columns (e.g., PII) assi=true. - Enforcement: When an analyst queries
dim_customerfrom Athena, Lake Formation automatically enforces this policy, returningNULLfor (or blocking access to) the sensitive columns they are not authorized to see.
The infra/aws/ directory contains all the Terraform modules for this stack.
- S3 Data Lake: A single S3 bucket is provisioned to serve as the landing zone for all raw data (
raw/cdc/,raw/ad_spend/) and the home for the Iceberg-based lakehouse tables. - Clickstream Ingestion Pipeline: This is a serverless pipeline:
- API Gateway: Provides a public HTTPS endpoint (
/prod/events). - Kinesis Firehose: Buffers incoming records, compresses them (GZIP), and writes them to S3.
- Lambda: A simple Python function attached to Firehose that validates the incoming JSON.
- Iceberg Table: Firehose is configured to write data in Apache Iceberg format, partitioned by event day.
- API Gateway: Provides a public HTTPS endpoint (
- Glue Crawlers: Two crawlers are defined: one for the
raw/cdc/prefix and one forraw/ad_spend/. They are scheduled to run every six hours (but can be run manually for immediate results). - Athena Workgroups: Three workgroups are created (
engineers,analysts,engine-v3) with distinct S3 locations for query results. Theanalystsworkgroup is locked down to prevent them from bypassing Lake Formation policies.
The dbt/aws/ project is configured to run against Athena. The ~/.dbt/profiles.yml is generated automatically by make infra-apply.
- Sources: Reads from the
prod_rawschema (populated by the Glue crawlers). - Staging: Creates simple views (
stagingschema) to clean up and cast types. - Marts: Materializes tables in the
martsschema as Iceberg tables.dim_customeranddim_productare SCD Type 2 (Slowly Changing Dimensions) tables. They use a customcdc_to_scd2macro to handle the Debezium CDC payload and maintain a history of changes.fct_ordersis an fact table.
The orchestration/dg-aws/ project provides a Dagster workspace to schedule and run the dbt project.
- It uses a
DbtCliResourceto executedbt buildcommands. - It reads the
manifest.jsonfrom thedbtproject to automatically define Dagster assets for everydbtmodel. - It creates a daily schedule (
@daily) for eachdbtjob, allowing for automated, daily builds of the data marts. - You can launch the local UI with
make dagster-dev.
The repository is set up for local development and quality checks, primarily via the Makefile.
- Linting & Formatting:
make lintandmake fmt(uses Ruff and MyPy). - Unit Tests:
make testrunspyteston the core Python code (e.g., data generators, Lambda transform). - Smoke Tests:
make smokeruns tests tagged withsmoke. These are longer-running tests that interact with live services, such as spinning up the Docker stack to test the CDC commands.
