HSLU — Course - Data Warehouse and Data Lake
Cloud-based data lake on AWS that ingests satellite deforestation alerts, fire detections, weather data, and soybean prices from Brazil's Cerrado biome. Exposes an interactive map via a serverless API and a QuickSight dashboard with aggregated risk scores per farm, updated daily.
External APIs AWS (Lambda + S3 + Athena) Web
───────────────── ────────────────────────── ────
INPE / TerraBrasilis ──────► ingest_deter_cerrado ──► S3 raw ──► transform_deter_cerrado ──► S3 curated
NASA FIRMS ──────► ingest_firms ──► S3 raw ──► transform_firms
Open-Meteo ──────► ingest_openmeteo ──► S3 raw ──► transform_openmeteo
IPEADATA (soybean) ──────► ingest_soybean ──► S3 raw ──► transform_soybean
MapBiomas (manual) ──────► S3 raw ──► transform_mapbiomas
Farms CSV (manual) ──────► S3 raw ──► transform_farms
│
Athena (setup_athena + setup_gold)
│
┌───────────────────────────────┤
│ │
query_map Lambda ──► API Gateway ──► Browser (interactive map)
QuickSight ──────────────────────────────► Dashboard (farm risk scores, daily)
Each ingest Lambda runs on a schedule (daily / weekly). Transforms run 30 min later. Athena partitions are registered at 20:00 UTC daily; gold views refresh at 20:30 UTC. The QuickSight dashboard reads from the gold layer and shows per-farm aggregated scores (sustainability risk, climate risk, etc.) updated daily.
| Dataset | Source | Refresh |
|---|---|---|
| Deforestation alerts | INPE / TerraBrasilis DETER Cerrado | Daily |
| Fire detections | NASA FIRMS VIIRS NOAA-20 | Daily |
| Weather archive | Open-Meteo | Weekly |
| Soybean prices | IPEADATA | Daily |
| Soybean land use | MapBiomas | Manual (yearly release) |
| Farm boundaries | Static CSV | Manual |
Tested on Python 3.12.3.
pip install -r requirements.txtCreate a .env file in the project root with the following keys:
# --- External API Keys ---
# Open-Meteo weather archive (free tier works; premium recommended for large historical ingests)
# Get key: https://open-meteo.com/en/pricing
OPEN_METEO_API_KEY=your_key_here
# NASA FIRMS fire detection API
# Get key: https://firms.modaps.eosdis.nasa.gov/api/
FIRMS_MAP_KEY=your_key_here
# Nasdaq Data Link (soybean prices)
# Get key: https://data.nasdaq.com/
NASDAQ_API_KEY=your_key_here
# --- AWS Credentials ---
# Get these from your AWS console / IAM / session token service
AWS_ACCOUNT_ID=123456789012
AWS_ACCESS_KEY_ID=AKIA...
AWS_SECRET_ACCESS_KEY=...
AWS_SESSION_TOKEN=... # required for temporary credentials (e.g. AWS Academy)
AWS_DEFAULT_REGION=us-east-1
# --- Web App Protection ---
# Choose any strong secret — this becomes the API Gateway key required to call /map
DETER_CERRADO_API_KEY=your_secret_hereLong historical ingests (back to 2018) take up to ~6 hours. AWS temporary credentials (e.g. from AWS Academy Learner Labs) expire after 4 hours. Use AWS SSO with an extended session duration to avoid mid-run credential expiry:
aws sso login --profile your-profileSet
AWS_PROFILE=your-profilein.envinstead of the individual key vars.
Open-Meteo: Free tier is rate-limited and may time out on large historical backfills. A premium subscription is recommended if ingesting multiple years of weather data.
MapBiomas data is a yearly static dataset and is not fetched automatically. Download it once:
python -m aws.download_mapbiomas_soyFollow the prompts. Then move the downloaded .geojson files into aws/upload_mapbiomas/. Name each file with a year tag, e.g.:
aws/upload_mapbiomas/
mapbiomas_soy_2018.geojson
mapbiomas_soy_2019.geojson
...
mapbiomas_soy_2024.geojson
setup.py will automatically upload all files found in this directory to S3.
Create an IAM role named LambdaExecutionRole in your AWS account with the following trust policy (allows Lambda to assume it):
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": { "Service": "lambda.amazonaws.com" },
"Action": "sts:AssumeRole"
}]
}Attach at minimum these managed policies:
AmazonAthenaFullAccessAmazonS3FullAccessAWSLambda_FullAccessAmazonSSMFullAccessCloudWatchLogsFullAccess
AWS Academy Learner Labs: You cannot create custom roles. Use the pre-provisioned
LabRoleinstead. Setroles.lambda_functions: "LabRole"inaws_config.yaml.
Open aws/aws_config.yaml and adjust:
S3 bucket names — bucket names must be globally unique across all AWS accounts. Rename them:
s3_buckets:
raw: my-project-raw-123456
curated: my-project-curated-123456
refined: my-project-refined-123456MapBiomas years to ingest:
mapbiomas_years: [2018, 2019, 2020, 2021, 2022, 2023, 2024]Historical ingest range:
historical_ingest_start_year: 2018 # 2018 → ~6h total runtime
historical_ingest_batch_months: 3 # process in 3-month batchesStarting from 2018 ingests ~8 years of data in 3-month batches. Reduce historical_ingest_start_year (e.g. 2024) for a faster first run.
The config also exposes per-step settings (memory, timeout, schedule cron expressions, API parameters). Additional pipeline steps can be added by following the existing step structure under pipeline_steps.
An example file is in aws/upload_farms/farms_1.csv.
Add your own farm locations and attributes to this CSV file.
Quicksight will provide you daily with new aggregated scores like "sustainability risk" or "climate risk" for each farm:
aws/upload_farms/
farms_1.csv
setup.py uploads these automatically.
python -m aws.setupThis script uses boto3 to provision the entire AWS infrastructure:
- Deletes any existing buckets, Lambdas, and EventBridge schedules from a previous run
- Creates S3 buckets
- Uploads MapBiomas GeoJSON and farms CSV to S3
- Deploys all Lambda functions (packages dependencies, zips, uploads)
- Runs historical ingestion in batches for each pipeline step
- Schedules daily / weekly EventBridge triggers
- Creates API Gateway with
/mapendpoint (API key protected) - Deploys the UI Lambda and wires
/uiroute
When complete, the script prints the web app URL:
=== Setup Complete ===
UI: https://<api-id>.execute-api.us-east-1.amazonaws.com/prod/ui
Open that URL in a browser. Use your DETER_CERRADO_API_KEY when prompted to authenticate against the /map endpoint.
| Error | Cause | Fix |
|---|---|---|
AccessDeniedException: iam:PassRole |
Role doesn't exist or no PassRole permission | Create LambdaExecutionRole (Step 3) or use LabRole in config |
ExpiredTokenException |
AWS temporary credentials expired mid-run | Use AWS SSO with extended session (Step 1) |
| Lambda timeout on historical ingest | Large date range + short timeout | Reduce historical_ingest_batch_months or increase timeout_s in config |
No GeoJSON files to upload |
aws/upload_mapbiomas/ empty or wrong filenames |
Follow Step 2 — files must end in .geojson |