Feature/clear disk space by Caden-Helbling · Pull Request #9 · NASA-IMPACT/fm-inference-sagemaker

Caden-Helbling · 2026-01-22T22:58:38Z

Implements disk space clearing logic.

We are adding auth

…MPACT/fm-inference-service into feature/add-authorization

…m-inference-service into feature/add-data-pipelines

…suggested changes like the use of build over buildx and for loops. It also adds parallel builds, docker layer caching, old image cleanup, a disk space check, local image cleanup, stale tag removal, and prunes danglers

* merge heads * add status, error stage and error message in inference * add temporal worker, activities and workflow * add inference model in pipeline * add temporalio * make download as background job * add database url * add deployment for floods worker * add timeout =10 mins * add entrypoint for worker and temporal env variables * add temporal server url to predictor app * add temporal env variables * add space inbetween key and value in configmap yaml * fixes of namespace and address url * add max_workers=1 * add sync wrapper to run async background job * add address and namespace with default value * add retrypolicy and hearbeat * add activity heartbeat * remove heartbeat * improve inference activities along with single db session * move annotations at the top * add activity executor * add print for debug * add time for invocations * removing background job * add worker=5 for upload * add earthaccess login * fix earthaccess login * fix the time measure of the process * comment uvicorn api run * fixe entrypoint for flood * add flood uvicorn back * move to httpx call * fix entrypoint.sh * add async activity * async fix * increase db pool size to 50 * Modify concurrency settings in inference_worker.py Comment out max_concurrent_activities and add max_concurrent_workflow_tasks. * Add threading * Add threading to task workflow * Revert to limit workflows * Adding debug print statement for infer time * add activity executor with task polls * add 2 worker for flood * add filename in debug * add async call for infer function * add async call of infer in activities * add merge result as async call * Changes * Optimizations for performance. * add temporal env variable to pipelines * add entrypoint for burn scar and crop classification * Add loading the models once * Change gdal order to fix the version * Add ensure load models * Add ensure load models * Add model loading from the worker * add status, error stage and error message in inference * Add login for test purposes. * Add empty response if nothing exists. * Remove unwanted cuda cache removal. * Add debug for better understanding. * Pad batches to keep batch size fixed. * Use numpy bin count. * Optimize postprocessing. * Remove unwanted print. * Remove use of streams. * Use dataloader for prefetching tiles for better GPU utilization. * Update preprocess to reduce intermediate I/O. * Remove unused code and use dataloader. * Use dataloader. * Update logic. * Simplify logic. * Update shm value to 2gb for better use of prefetching. * Compile model for surya. * Add jitter for gpu assignment. * Use default mode. * Update shm to 5gb for testing purposes. * Add activity_executor. * Revert use of default mode. * switch to httpx * Add httpx * Change replica to one * fix port * Make two replicas * Only read the band that is needed. * Reduce startup time for process spawn. * Optimize postprocess. * Use threadpool for upload and get qa flags. * Add cleanups. * Use buildx as temp fix. * Remove imports from predictor module. * Fix typo. * Add basemodel. * Add missing import. * Remove set -e. * Fix typo. * Add MPS. * Add pod affinity. * Remove MPS related changes for now. * Fix issue with qa flag checks. * Add warmup for better usage of models and model compilation. * Reuse open connections. * Parallelize merge and crop. * Fix issues with multiple connection initialization. * Fix issues with get_db * Use proper imports. * Load model in the beginning. * Add app stratup. * Remove startup load of model. * Feat/add statefulset (#60) * Adding statefulset * Add headless service * Add load balancer service * Use POD_INDEX for better GPU assignment. * Use LoadBalancer for service. * Use proper GPU assignment. * Remove warmup since this will be manual. * Remove unwanted key. * Use proper GPU IDs. * Add print statement for better debugging. * Use pretrained_backbone as false. * Use lower batch size for testing. * Revert back to 120. * Use dynamic replica count from values fileAdd env for replica counts Add count for replicas from helm template * Update limits for predictor app. * Add missing --. * Implement Docker image cleanup in build script Added a cleanup process for old Docker image tags based on defined image digests. * Fix/tentative changes (#61) * Make new changes * Make new changes * checkpoint * Managing connectivity * Avoid 503s * Fix addded clients * Update backlog and keepalive for better handling. * Add logging to better understand issues. * Tests * Add orchestrator and worker. * Add activities. * DRY process. * Start download background worker. * Use entrypoint for worker start rather than lifetime. * Fix permissions. * Update Dockerfile to include entrypoint.sh. * Fix path for entrypoint. * Add HPA for perdiction app * Reduce the resources allocation * Add resources limit to helm --------- Co-authored-by: amarouane-ABDELHAK <am0089@uah.edu> * Add ruff linter (#63) * Replace docker buildx with docker build commands Buildx is not supported in DGX * Adjust HPA * Fix colormap issue. * Add load balancer to surya * Make all inferences a statefulsets instead of deployments * Add queue servive * Add databse models * Add databasa pooling * Add database pooling * Adding queue service * Add queuing system * Support multiple models * Reuse temporalio connection * add queuing service * Add queueing service * Fix queue ingress * Fix queue ingress * [Test] Use GPU for postprocessing. * Fix issues with race conditions. * Use persistent worker. * Fix preloaded events * Fix orchestrator * Finetune db engine and add heartbeat * Add timeout to 2 days * Add replicas * Add log to check the server * Update batch_size to 150. * Update shm. * Use asyncio’s functionality for threading. * Add additional logic for batch size for h200. * Fix preloaded event response * Fix time on DB * Fix event details type * Add build and automated push * Refactor go-services * Fix docker build * Fix docker build path * Fix the go-services app name * Change batch size * Moving everything besides health and welcome to under v1. * Use proper prefix. * Fixing the orchestrator * Fix batch size * Fix batch size * Fix batch size * Make orchistrator fail if the chil failed * Increase close to timeout * New Change * initial interval to be less than max interval * Fixing ruff linter --------- Co-authored-by: Deepak <shahkrdeepak@gmail.com> Co-authored-by: xhagrg <grg.iksha@hotmail.com>

Converts build_and_push.sh into a portable GitHub Actions workflow that builds all 7 Docker images (inference, base, floods, burn_scars, crop_classification, surya_rollout, tiler) and pushes them to AWS ECR. Supports workflow_call for cross-repo invocation, parallel builds for independent images, and optional ECR cleanup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

In dry_run mode, AWS credential configuration, ECR login, and cache pulls are all skipped. The base image is passed to service jobs via GitHub Actions artifacts (docker save/load) instead of ECR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…p call

Static base (Dockerfile.base-static): CUDA 12.1.1, PROJ 9.3.0, GDAL 3.11.3 compilation — rebuilt only when these versions change. Dynamic base (Dockerfile.base): just installs shared Python requirements on top of the static base (~2-3 min vs ~42 min). Workflow changes: - Remove all free-disk-space and cleanup steps - Replace skip_cleanup with build_static_base input - Add build-static-base job (runs only on demand) - build-base now pulls pre-built static base from ECR Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove the static base image and its artifact tar before docker save to free enough space for the base image export. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…port The base image includes all static base layers (CUDA/PROJ/GDAL) so docker save produces a ~5GB tar. Only needed for dry_run — production runs push/pull via ECR and skip this step entirely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1. Move shared/lib and shared/predictor.py into Dockerfile.base so floods, burn_scars, crop_classification don't each copy them. 2. Create pipelines/surya/Dockerfile.base with pre-downloaded HuggingFace model (~2GB), Python 3.12, and GDAL bindings. Surya service Dockerfile now only installs deps and code (~4 min vs ~14 min). Model base rebuilt only when model version changes. 3. Remove --no-cache flag from surya git install to allow layer caching. 4. Add build_surya_base workflow input and dedicated build-surya-base + build-surya jobs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

GitHub Actions masks job outputs that contain secret values. The base_image_ref output included the ECR URL (a secret), so it was blanked for downstream jobs. Fix: output only repo:tag and have consumers prepend the ECR URL secret themselves. Also includes optimizations from prior commits: - Shared code (shared/lib, predictor.py) baked into Dockerfile.base - Surya model pre-build via Dockerfile.base and build_surya_base input - Removed --no-cache from surya git install Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

GitHub Actions skips downstream jobs when any ancestor in the needs chain was skipped, even if the direct dependency succeeded. Adding explicit success checks to build-services and build-surya-base prevents the skip from cascading through build-base. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The pyproject.toml only lists dependencies with no package source code. Editable install (-e) requires a discoverable package; non-editable just installs the dependencies. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Pulling both the cache image and base image exhausts runner disk. The base image already provides the layer cache that matters — service Dockerfiles only add a thin layer on top. Also fixes surya editable install (uv pip install . instead of -e .). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Setuptools finds /app/lib from the base image and errors on "multiple top-level packages in flat-layout." Setting packages=[] tells it this is deps-only, no package to install. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The --cache-from registry pattern requires pulling the full previous image, which combined with the base image exhausts runner disk space. On ephemeral runners without persistent Docker cache, these pulls provide minimal benefit anyway. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The surya-base image (CUDA + PROJ/GDAL + 2GB model + Python 3.12) leaves insufficient room on ubuntu-latest runners for the Surya library installation. This will not be needed on custom runners with more disk space. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- .dockerignore at root and pipelines/ excludes .git, __pycache__, .venv, docs, helm charts, etc. from Docker build context - Remove COPY + uv pip install of empty requirements.txt from floods, burn_scars, crop_classification Dockerfiles Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@main

- actions/checkout: v4 → v6 - aws-actions/configure-aws-credentials: v4 → v6 - actions/upload-artifact: v4 → v7 - actions/download-artifact: v4 → v8 - jlumbroso/free-disk-space: @main → v1.3.1 Node.js 20 is deprecated and will be removed from runners on September 16, 2026. Node.js 24 becomes default June 2, 2026. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

New detect-changes job checks which files changed and gates each build accordingly: - base: runs if base files or any service files changed - inference: only if Dockerfile/requirements/src changed - services (matrix): each checks its own change_key - surya: only if pipelines/surya/ changed - force_all input overrides detection and builds everything Changes to docs, workflows, or unrelated files skip all builds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Services now fall back to base:latest from ECR when build-base is skipped. This avoids a 6-9 min base rebuild when only service-specific files changed (e.g., floods/lib/). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Switch all jobs from ubuntu-latest to self-hosted runner - Re-enable push trigger on dev branch - Remove free-disk-space workarounds (not needed on custom runner) - Revert test-only changes to floods, README, predictor.py The runner label is set to 'self-hosted' as a placeholder — update to your specific runner label when known. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Resolved conflicts: - Dockerfile: took dev's entrypoint.sh approach - build_and_push.sh: took dev's version (superseded by workflow) - pipelines/{floods,burn_scars,crop_classification}/Dockerfile: kept our optimization (shared code in base) + added dev's new inference_worker.py and per-service entrypoint.sh - src/api/v1/inferences.py: took dev's formatting + EMPTY_RESPONSE - pipelines/Dockerfile.base: added inference_worker.py to shared code Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

xhagrg and others added 30 commits September 16, 2025 12:03

Make sure proper values are updated.

dc4be55

Merge branch 'feature/add-authorization' into feature/add-data-pipelines

079d08d

Merge pull request #6 from NASA-IMPACT/feature/add-data-pipelines

e9acebe

We are adding auth

Add authorize

6a05a07

Merge branch 'feature/add-authorization' of https://github.com/NASA-I…

26673bd

…MPACT/fm-inference-service into feature/add-authorization

Search for a range if timeseries is true.

d198a84

Merge branch 'feature/add-data-pipelines' of github.com:nasa-impact/f…

aad9561

…m-inference-service into feature/add-data-pipelines

Add authz

12c8e83

Fix a typo

6296f20

Allow login

0e636b0

Fix the update model

37b6fcf

Fix the update model

f778dc5

Place holder for db migration

2cc7d1a

Place holder for db migration

019b3c4

Add auth

538a047

Adding alembic.ini

4249c1f

Add build

0c03e80

Ignore alembic

d5f977c

Add auth for delete

b9aba6d

Add auth for delete

2cf31b6

Add available GPU before adding models.

18ff6a6

Handle trailing slash.

fe5193a

Move CONSTs to proper place.

ba0db52

Make sure all results are captured properly.

950909b

Remove unwanted import.

c9e1bf6

Use proper function.

62744c3

Use no slash by default.

8257868

Skip trailing slash management for now.

b90d676

Retry trailing slash.

fca9a84

Comment out middleware.

0fd02f4

Caden-Helbling and others added 30 commits March 30, 2026 11:21

Use cognito:groups from token instead of an admin cognito group call

ac4a179

Move go queuing service over to /api/predict/v2

691d91e

Disable push trigger during testing

0d3c36b

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: use cognito:groups from token instead of an admin cognito grou…

4910cdc

…p call

Fix disk space exhaustion in dry_run: clean up static base before save

e0b3f41

Remove the static base image and its artifact tar before docker save to free enough space for the base image export. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: move go queuing service over to /api/predict/v2

ce8ae71

test: add comment to flood_infer.py (scenario 3 - single service change)

0c11ead

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

test: update README title (scenario 4 - docs only change)

fde4b6d

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

test: add comment to predictor.py (scenario 5 - shared base change)

74fad8c

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Skip base job when only service code changes

8d25c6b

Services now fall back to base:latest from ECR when build-base is skipped. This avoids a 6-9 min base rebuild when only service-specific files changed (e.g., floods/lib/). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

test: tweak floods comment for scenario 6

7566f61

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/clear disk space#9

Feature/clear disk space#9
Caden-Helbling wants to merge 697 commits into
mainfrom
feature/clear_disk_space

Caden-Helbling commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Caden-Helbling commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants