Skip to content

Feature/clear disk space#9

Open
Caden-Helbling wants to merge 697 commits into
mainfrom
feature/clear_disk_space
Open

Feature/clear disk space#9
Caden-Helbling wants to merge 697 commits into
mainfrom
feature/clear_disk_space

Conversation

@Caden-Helbling
Copy link
Copy Markdown
Collaborator

Implements disk space clearing logic.

Caden-Helbling and others added 30 commits March 30, 2026 11:21
…suggested changes like the use of build over buildx and for loops. It also adds parallel builds, docker layer caching, old image cleanup, a disk space check, local image cleanup, stale tag removal, and prunes danglers
* merge heads

* add status, error stage and error message in inference

* add temporal worker, activities and workflow

* add inference model in pipeline

* add temporalio

* make download as background job

* add database url

* add deployment for floods worker

* add timeout =10 mins

* add entrypoint for worker and temporal env variables

* add temporal server url to predictor app

* add temporal env variables

* add space inbetween key and value in configmap yaml

* fixes of namespace and address url

* add max_workers=1

* add sync wrapper to run async background job

* add address and namespace with default value

* add retrypolicy and hearbeat

* add activity heartbeat

* remove heartbeat

* improve inference activities along with single db session

* move annotations at the top

* add activity executor

* add print for debug

* add time for invocations

* removing background job

* add worker=5 for upload

* add earthaccess login

* fix earthaccess login

* fix the time measure of the process

* comment uvicorn api run

* fixe entrypoint for flood

* add flood uvicorn back

* move to httpx call

* fix entrypoint.sh

* add async activity

* async fix

* increase db pool size to 50

* Modify concurrency settings in inference_worker.py

Comment out max_concurrent_activities and add max_concurrent_workflow_tasks.

* Add threading

* Add threading to task workflow

* Revert to limit workflows

* Adding debug print statement for infer time

* add activity executor with task polls

* add 2 worker for flood

* add filename in debug

* add async call for infer function

* add async call of infer in activities

* add merge result as async call

* Changes

* Optimizations for performance.

* add temporal env variable to pipelines

* add entrypoint for burn scar and crop classification

* Add loading the models once

* Change gdal order to fix the version

* Add ensure load models

* Add ensure load models

* Add model loading from the worker

* add status, error stage and error message in inference

* Add login for test purposes.

* Add empty response if nothing exists.

* Remove unwanted cuda cache removal.

* Add debug for better understanding.

* Pad batches to keep batch size fixed.

* Use numpy bin count.

* Optimize postprocessing.

* Remove unwanted print.

* Remove use of streams.

* Use dataloader for prefetching tiles for better GPU utilization.

* Update preprocess to reduce intermediate I/O.

* Remove unused code and use dataloader.

* Use dataloader.

* Update logic.

* Simplify logic.

* Update shm value to 2gb for better use of prefetching.

* Compile model for surya.

* Add jitter for gpu assignment.

* Use default mode.

* Update shm to 5gb for testing purposes.

* Add activity_executor.

* Revert use of default mode.

* switch to httpx

* Add httpx

* Change replica to one

* fix port

* Make two replicas

* Only read the band that is needed.

* Reduce startup time for process spawn.

* Optimize postprocess.

* Use threadpool for upload and get qa flags.

* Add cleanups.

* Use buildx as temp fix.

* Remove imports from predictor module.

* Fix typo.

* Add basemodel.

* Add missing import.

* Remove set -e.

* Fix typo.

* Add MPS.

* Add pod affinity.

* Remove MPS related changes for now.

* Fix issue with qa flag checks.

* Add warmup for better usage of models and model compilation.

* Reuse open connections.

* Parallelize merge and crop.

* Fix issues with multiple connection initialization.

* Fix issues with get_db

* Use proper imports.

* Load model in the beginning.

* Add app stratup.

* Remove startup load of model.

* Feat/add statefulset (#60)

* Adding statefulset

* Add headless service

* Add load balancer service

* Use POD_INDEX for better GPU assignment.

* Use LoadBalancer for service.

* Use proper GPU assignment.

* Remove warmup since this will be manual.

* Remove unwanted key.

* Use proper GPU IDs.

* Add print statement for better debugging.

* Use pretrained_backbone as false.

* Use lower batch size for testing.

* Revert back to 120.

* Use dynamic replica count from values fileAdd env for replica counts

Add count for replicas from helm template

* Update limits for predictor app.

* Add missing --.

* Implement Docker image cleanup in build script

Added a cleanup process for old Docker image tags based on defined image digests.

* Fix/tentative changes (#61)

* Make new changes

* Make new changes

* checkpoint

* Managing connectivity

* Avoid 503s

* Fix addded clients

* Update backlog and keepalive for better handling.

* Add logging to better understand issues.

* Tests

* Add orchestrator and worker.

* Add activities.

* DRY process.

* Start download background worker.

* Use entrypoint for worker start rather than lifetime.

* Fix permissions.

* Update Dockerfile to include entrypoint.sh.

* Fix path for entrypoint.

* Add HPA for perdiction app

* Reduce the resources allocation

* Add resources limit to helm

---------

Co-authored-by: amarouane-ABDELHAK <am0089@uah.edu>

* Add ruff linter (#63)

* Replace docker buildx with docker build commands

Buildx is not supported in DGX

* Adjust HPA

* Fix colormap issue.

* Add load balancer to surya

* Make all inferences a statefulsets instead of deployments

* Add queue servive

* Add databse models

* Add databasa pooling

* Add database pooling

* Adding queue service

* Add queuing system

* Support multiple models

* Reuse temporalio connection

* add queuing service

* Add queueing service

* Fix queue ingress

* Fix queue ingress

* [Test] Use GPU for postprocessing.

* Fix issues with race conditions.

* Use persistent worker.

* Fix preloaded events

* Fix orchestrator

* Finetune db engine and add heartbeat

* Add timeout to 2 days

* Add replicas

* Add log to check the server

* Update batch_size to 150.

* Update shm.

* Use asyncio’s functionality for threading.

* Add additional logic for batch size for h200.

* Fix preloaded event response

* Fix time on DB

* Fix event details type

* Add build and automated push

* Refactor go-services

* Fix docker build

* Fix docker build path

* Fix the go-services app name

* Change batch size

* Moving everything besides health and welcome to under v1.

* Use proper prefix.

* Fixing the orchestrator

* Fix batch size

* Fix batch size

* Fix batch size

* Make orchistrator fail if the chil failed

* Increase close to timeout

* New Change

* initial interval to be less than max interval

* Fixing ruff linter

---------

Co-authored-by: Deepak <shahkrdeepak@gmail.com>
Co-authored-by: xhagrg <grg.iksha@hotmail.com>
Converts build_and_push.sh into a portable GitHub Actions workflow that
builds all 7 Docker images (inference, base, floods, burn_scars,
crop_classification, surya_rollout, tiler) and pushes them to AWS ECR.

Supports workflow_call for cross-repo invocation, parallel builds for
independent images, and optional ECR cleanup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
In dry_run mode, AWS credential configuration, ECR login, and cache
pulls are all skipped. The base image is passed to service jobs via
GitHub Actions artifacts (docker save/load) instead of ECR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Static base (Dockerfile.base-static): CUDA 12.1.1, PROJ 9.3.0,
GDAL 3.11.3 compilation — rebuilt only when these versions change.

Dynamic base (Dockerfile.base): just installs shared Python
requirements on top of the static base (~2-3 min vs ~42 min).

Workflow changes:
- Remove all free-disk-space and cleanup steps
- Replace skip_cleanup with build_static_base input
- Add build-static-base job (runs only on demand)
- build-base now pulls pre-built static base from ECR

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove the static base image and its artifact tar before docker save
to free enough space for the base image export.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…port

The base image includes all static base layers (CUDA/PROJ/GDAL) so
docker save produces a ~5GB tar. Only needed for dry_run — production
runs push/pull via ECR and skip this step entirely.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Move shared/lib and shared/predictor.py into Dockerfile.base so
   floods, burn_scars, crop_classification don't each copy them.

2. Create pipelines/surya/Dockerfile.base with pre-downloaded
   HuggingFace model (~2GB), Python 3.12, and GDAL bindings.
   Surya service Dockerfile now only installs deps and code (~4 min
   vs ~14 min). Model base rebuilt only when model version changes.

3. Remove --no-cache flag from surya git install to allow layer caching.

4. Add build_surya_base workflow input and dedicated build-surya-base
   + build-surya jobs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GitHub Actions masks job outputs that contain secret values. The
base_image_ref output included the ECR URL (a secret), so it was
blanked for downstream jobs. Fix: output only repo:tag and have
consumers prepend the ECR URL secret themselves.

Also includes optimizations from prior commits:
- Shared code (shared/lib, predictor.py) baked into Dockerfile.base
- Surya model pre-build via Dockerfile.base and build_surya_base input
- Removed --no-cache from surya git install

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GitHub Actions skips downstream jobs when any ancestor in the needs
chain was skipped, even if the direct dependency succeeded. Adding
explicit success checks to build-services and build-surya-base
prevents the skip from cascading through build-base.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The pyproject.toml only lists dependencies with no package source code.
Editable install (-e) requires a discoverable package; non-editable
just installs the dependencies.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pulling both the cache image and base image exhausts runner disk.
The base image already provides the layer cache that matters —
service Dockerfiles only add a thin layer on top.

Also fixes surya editable install (uv pip install . instead of -e .).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Setuptools finds /app/lib from the base image and errors on
"multiple top-level packages in flat-layout." Setting packages=[]
tells it this is deps-only, no package to install.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The --cache-from registry pattern requires pulling the full previous
image, which combined with the base image exhausts runner disk space.
On ephemeral runners without persistent Docker cache, these pulls
provide minimal benefit anyway.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The surya-base image (CUDA + PROJ/GDAL + 2GB model + Python 3.12)
leaves insufficient room on ubuntu-latest runners for the Surya
library installation. This will not be needed on custom runners
with more disk space.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- .dockerignore at root and pipelines/ excludes .git, __pycache__,
  .venv, docs, helm charts, etc. from Docker build context
- Remove COPY + uv pip install of empty requirements.txt from
  floods, burn_scars, crop_classification Dockerfiles

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- actions/checkout: v4 → v6
- aws-actions/configure-aws-credentials: v4 → v6
- actions/upload-artifact: v4 → v7
- actions/download-artifact: v4 → v8
- jlumbroso/free-disk-space: @main → v1.3.1

Node.js 20 is deprecated and will be removed from runners on
September 16, 2026. Node.js 24 becomes default June 2, 2026.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New detect-changes job checks which files changed and gates each
build accordingly:
- base: runs if base files or any service files changed
- inference: only if Dockerfile/requirements/src changed
- services (matrix): each checks its own change_key
- surya: only if pipelines/surya/ changed
- force_all input overrides detection and builds everything

Changes to docs, workflows, or unrelated files skip all builds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Services now fall back to base:latest from ECR when build-base is
skipped. This avoids a 6-9 min base rebuild when only service-specific
files changed (e.g., floods/lib/).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Switch all jobs from ubuntu-latest to self-hosted runner
- Re-enable push trigger on dev branch
- Remove free-disk-space workarounds (not needed on custom runner)
- Revert test-only changes to floods, README, predictor.py

The runner label is set to 'self-hosted' as a placeholder — update
to your specific runner label when known.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolved conflicts:
- Dockerfile: took dev's entrypoint.sh approach
- build_and_push.sh: took dev's version (superseded by workflow)
- pipelines/{floods,burn_scars,crop_classification}/Dockerfile: kept
  our optimization (shared code in base) + added dev's new
  inference_worker.py and per-service entrypoint.sh
- src/api/v1/inferences.py: took dev's formatting + EMPTY_RESPONSE
- pipelines/Dockerfile.base: added inference_worker.py to shared code

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants