Skip to content

paACode/dwl_project_queryosity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

312 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CerradoSignal

HSLU — Course - Data Warehouse and Data Lake

Cloud-based data lake on AWS that ingests satellite deforestation alerts, fire detections, weather data, and soybean prices from Brazil's Cerrado biome. Exposes an interactive map via a serverless API and a QuickSight dashboard with aggregated risk scores per farm, updated daily.

small_gif image

Architecture

External APIs                  AWS (Lambda + S3 + Athena)             Web
─────────────────              ──────────────────────────             ────
INPE / TerraBrasilis  ──────►  ingest_deter_cerrado  ──► S3 raw  ──►  transform_deter_cerrado  ──► S3 curated
NASA FIRMS            ──────►  ingest_firms          ──► S3 raw  ──►  transform_firms
Open-Meteo            ──────►  ingest_openmeteo      ──► S3 raw  ──►  transform_openmeteo
IPEADATA (soybean)    ──────►  ingest_soybean        ──► S3 raw  ──►  transform_soybean
MapBiomas (manual)    ──────►  S3 raw                             ──►  transform_mapbiomas
Farms CSV (manual)    ──────►  S3 raw                             ──►  transform_farms
                                                                        │
                                                              Athena (setup_athena + setup_gold)
                                                                        │
                                        ┌───────────────────────────────┤
                                        │                               │
                              query_map Lambda ──► API Gateway ──► Browser (interactive map)
                              QuickSight ──────────────────────────────► Dashboard (farm risk scores, daily)

Each ingest Lambda runs on a schedule (daily / weekly). Transforms run 30 min later. Athena partitions are registered at 20:00 UTC daily; gold views refresh at 20:30 UTC. The QuickSight dashboard reads from the gold layer and shows per-farm aggregated scores (sustainability risk, climate risk, etc.) updated daily.


Data Sources

Dataset Source Refresh
Deforestation alerts INPE / TerraBrasilis DETER Cerrado Daily
Fire detections NASA FIRMS VIIRS NOAA-20 Daily
Weather archive Open-Meteo Weekly
Soybean prices IPEADATA Daily
Soybean land use MapBiomas Manual (yearly release)
Farm boundaries Static CSV Manual

Setup Instructions

Step 0 — Install dependencies

Tested on Python 3.12.3.

pip install -r requirements.txt

Step 1 — Create a .env file

Create a .env file in the project root with the following keys:

# --- External API Keys ---

# Open-Meteo weather archive (free tier works; premium recommended for large historical ingests)
# Get key: https://open-meteo.com/en/pricing
OPEN_METEO_API_KEY=your_key_here

# NASA FIRMS fire detection API
# Get key: https://firms.modaps.eosdis.nasa.gov/api/
FIRMS_MAP_KEY=your_key_here

# Nasdaq Data Link (soybean prices)
# Get key: https://data.nasdaq.com/
NASDAQ_API_KEY=your_key_here

# --- AWS Credentials ---
# Get these from your AWS console / IAM / session token service
AWS_ACCOUNT_ID=123456789012
AWS_ACCESS_KEY_ID=AKIA...
AWS_SECRET_ACCESS_KEY=...
AWS_SESSION_TOKEN=...            # required for temporary credentials (e.g. AWS Academy)
AWS_DEFAULT_REGION=us-east-1

# --- Web App Protection ---
# Choose any strong secret — this becomes the API Gateway key required to call /map
DETER_CERRADO_API_KEY=your_secret_here

Long historical ingests (back to 2018) take up to ~6 hours. AWS temporary credentials (e.g. from AWS Academy Learner Labs) expire after 4 hours. Use AWS SSO with an extended session duration to avoid mid-run credential expiry:

aws sso login --profile your-profile

Set AWS_PROFILE=your-profile in .env instead of the individual key vars.

Open-Meteo: Free tier is rate-limited and may time out on large historical backfills. A premium subscription is recommended if ingesting multiple years of weather data.


Step 2 — Download MapBiomas soybean data

MapBiomas data is a yearly static dataset and is not fetched automatically. Download it once:

python -m aws.download_mapbiomas_soy

Follow the prompts. Then move the downloaded .geojson files into aws/upload_mapbiomas/. Name each file with a year tag, e.g.:

aws/upload_mapbiomas/
  mapbiomas_soy_2018.geojson
  mapbiomas_soy_2019.geojson
  ...
  mapbiomas_soy_2024.geojson

setup.py will automatically upload all files found in this directory to S3.


Step 3 — Create an IAM role on AWS

Create an IAM role named LambdaExecutionRole in your AWS account with the following trust policy (allows Lambda to assume it):

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": { "Service": "lambda.amazonaws.com" },
    "Action": "sts:AssumeRole"
  }]
}

Attach at minimum these managed policies:

  • AmazonAthenaFullAccess
  • AmazonS3FullAccess
  • AWSLambda_FullAccess
  • AmazonSSMFullAccess
  • CloudWatchLogsFullAccess

AWS Academy Learner Labs: You cannot create custom roles. Use the pre-provisioned LabRole instead. Set roles.lambda_functions: "LabRole" in aws_config.yaml.


Step 4 — Configure aws/aws_config.yaml

Open aws/aws_config.yaml and adjust:

S3 bucket names — bucket names must be globally unique across all AWS accounts. Rename them:

s3_buckets:
  raw:     my-project-raw-123456
  curated: my-project-curated-123456
  refined: my-project-refined-123456

MapBiomas years to ingest:

mapbiomas_years: [2018, 2019, 2020, 2021, 2022, 2023, 2024]

Historical ingest range:

historical_ingest_start_year: 2018   # 2018 → ~6h total runtime
historical_ingest_batch_months: 3    # process in 3-month batches

Starting from 2018 ingests ~8 years of data in 3-month batches. Reduce historical_ingest_start_year (e.g. 2024) for a faster first run.

The config also exposes per-step settings (memory, timeout, schedule cron expressions, API parameters). Additional pipeline steps can be added by following the existing step structure under pipeline_steps.


Step 5 — Upload farms data for aggregatedb QuickSight Dashboard

An example file is in aws/upload_farms/farms_1.csv. Add your own farm locations and attributes to this CSV file.
Quicksight will provide you daily with new aggregated scores like "sustainability risk" or "climate risk" for each farm:

aws/upload_farms/
  farms_1.csv

setup.py uploads these automatically.


Step 6 — Run setup

python -m aws.setup

This script uses boto3 to provision the entire AWS infrastructure:

  1. Deletes any existing buckets, Lambdas, and EventBridge schedules from a previous run
  2. Creates S3 buckets
  3. Uploads MapBiomas GeoJSON and farms CSV to S3
  4. Deploys all Lambda functions (packages dependencies, zips, uploads)
  5. Runs historical ingestion in batches for each pipeline step
  6. Schedules daily / weekly EventBridge triggers
  7. Creates API Gateway with /map endpoint (API key protected)
  8. Deploys the UI Lambda and wires /ui route

When complete, the script prints the web app URL:

=== Setup Complete ===
UI: https://<api-id>.execute-api.us-east-1.amazonaws.com/prod/ui

Open that URL in a browser. Use your DETER_CERRADO_API_KEY when prompted to authenticate against the /map endpoint.


Troubleshooting

Error Cause Fix
AccessDeniedException: iam:PassRole Role doesn't exist or no PassRole permission Create LambdaExecutionRole (Step 3) or use LabRole in config
ExpiredTokenException AWS temporary credentials expired mid-run Use AWS SSO with extended session (Step 1)
Lambda timeout on historical ingest Large date range + short timeout Reduce historical_ingest_batch_months or increase timeout_s in config
No GeoJSON files to upload aws/upload_mapbiomas/ empty or wrong filenames Follow Step 2 — files must end in .geojson

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors