Feature/clear disk space#9
Open
Caden-Helbling wants to merge 697 commits into
Open
Conversation
We are adding auth
…MPACT/fm-inference-service into feature/add-authorization
…m-inference-service into feature/add-data-pipelines
…suggested changes like the use of build over buildx and for loops. It also adds parallel builds, docker layer caching, old image cleanup, a disk space check, local image cleanup, stale tag removal, and prunes danglers
* merge heads * add status, error stage and error message in inference * add temporal worker, activities and workflow * add inference model in pipeline * add temporalio * make download as background job * add database url * add deployment for floods worker * add timeout =10 mins * add entrypoint for worker and temporal env variables * add temporal server url to predictor app * add temporal env variables * add space inbetween key and value in configmap yaml * fixes of namespace and address url * add max_workers=1 * add sync wrapper to run async background job * add address and namespace with default value * add retrypolicy and hearbeat * add activity heartbeat * remove heartbeat * improve inference activities along with single db session * move annotations at the top * add activity executor * add print for debug * add time for invocations * removing background job * add worker=5 for upload * add earthaccess login * fix earthaccess login * fix the time measure of the process * comment uvicorn api run * fixe entrypoint for flood * add flood uvicorn back * move to httpx call * fix entrypoint.sh * add async activity * async fix * increase db pool size to 50 * Modify concurrency settings in inference_worker.py Comment out max_concurrent_activities and add max_concurrent_workflow_tasks. * Add threading * Add threading to task workflow * Revert to limit workflows * Adding debug print statement for infer time * add activity executor with task polls * add 2 worker for flood * add filename in debug * add async call for infer function * add async call of infer in activities * add merge result as async call * Changes * Optimizations for performance. * add temporal env variable to pipelines * add entrypoint for burn scar and crop classification * Add loading the models once * Change gdal order to fix the version * Add ensure load models * Add ensure load models * Add model loading from the worker * add status, error stage and error message in inference * Add login for test purposes. * Add empty response if nothing exists. * Remove unwanted cuda cache removal. * Add debug for better understanding. * Pad batches to keep batch size fixed. * Use numpy bin count. * Optimize postprocessing. * Remove unwanted print. * Remove use of streams. * Use dataloader for prefetching tiles for better GPU utilization. * Update preprocess to reduce intermediate I/O. * Remove unused code and use dataloader. * Use dataloader. * Update logic. * Simplify logic. * Update shm value to 2gb for better use of prefetching. * Compile model for surya. * Add jitter for gpu assignment. * Use default mode. * Update shm to 5gb for testing purposes. * Add activity_executor. * Revert use of default mode. * switch to httpx * Add httpx * Change replica to one * fix port * Make two replicas * Only read the band that is needed. * Reduce startup time for process spawn. * Optimize postprocess. * Use threadpool for upload and get qa flags. * Add cleanups. * Use buildx as temp fix. * Remove imports from predictor module. * Fix typo. * Add basemodel. * Add missing import. * Remove set -e. * Fix typo. * Add MPS. * Add pod affinity. * Remove MPS related changes for now. * Fix issue with qa flag checks. * Add warmup for better usage of models and model compilation. * Reuse open connections. * Parallelize merge and crop. * Fix issues with multiple connection initialization. * Fix issues with get_db * Use proper imports. * Load model in the beginning. * Add app stratup. * Remove startup load of model. * Feat/add statefulset (#60) * Adding statefulset * Add headless service * Add load balancer service * Use POD_INDEX for better GPU assignment. * Use LoadBalancer for service. * Use proper GPU assignment. * Remove warmup since this will be manual. * Remove unwanted key. * Use proper GPU IDs. * Add print statement for better debugging. * Use pretrained_backbone as false. * Use lower batch size for testing. * Revert back to 120. * Use dynamic replica count from values fileAdd env for replica counts Add count for replicas from helm template * Update limits for predictor app. * Add missing --. * Implement Docker image cleanup in build script Added a cleanup process for old Docker image tags based on defined image digests. * Fix/tentative changes (#61) * Make new changes * Make new changes * checkpoint * Managing connectivity * Avoid 503s * Fix addded clients * Update backlog and keepalive for better handling. * Add logging to better understand issues. * Tests * Add orchestrator and worker. * Add activities. * DRY process. * Start download background worker. * Use entrypoint for worker start rather than lifetime. * Fix permissions. * Update Dockerfile to include entrypoint.sh. * Fix path for entrypoint. * Add HPA for perdiction app * Reduce the resources allocation * Add resources limit to helm --------- Co-authored-by: amarouane-ABDELHAK <am0089@uah.edu> * Add ruff linter (#63) * Replace docker buildx with docker build commands Buildx is not supported in DGX * Adjust HPA * Fix colormap issue. * Add load balancer to surya * Make all inferences a statefulsets instead of deployments * Add queue servive * Add databse models * Add databasa pooling * Add database pooling * Adding queue service * Add queuing system * Support multiple models * Reuse temporalio connection * add queuing service * Add queueing service * Fix queue ingress * Fix queue ingress * [Test] Use GPU for postprocessing. * Fix issues with race conditions. * Use persistent worker. * Fix preloaded events * Fix orchestrator * Finetune db engine and add heartbeat * Add timeout to 2 days * Add replicas * Add log to check the server * Update batch_size to 150. * Update shm. * Use asyncio’s functionality for threading. * Add additional logic for batch size for h200. * Fix preloaded event response * Fix time on DB * Fix event details type * Add build and automated push * Refactor go-services * Fix docker build * Fix docker build path * Fix the go-services app name * Change batch size * Moving everything besides health and welcome to under v1. * Use proper prefix. * Fixing the orchestrator * Fix batch size * Fix batch size * Fix batch size * Make orchistrator fail if the chil failed * Increase close to timeout * New Change * initial interval to be less than max interval * Fixing ruff linter --------- Co-authored-by: Deepak <shahkrdeepak@gmail.com> Co-authored-by: xhagrg <grg.iksha@hotmail.com>
Converts build_and_push.sh into a portable GitHub Actions workflow that builds all 7 Docker images (inference, base, floods, burn_scars, crop_classification, surya_rollout, tiler) and pushes them to AWS ECR. Supports workflow_call for cross-repo invocation, parallel builds for independent images, and optional ECR cleanup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
In dry_run mode, AWS credential configuration, ECR login, and cache pulls are all skipped. The base image is passed to service jobs via GitHub Actions artifacts (docker save/load) instead of ECR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Static base (Dockerfile.base-static): CUDA 12.1.1, PROJ 9.3.0, GDAL 3.11.3 compilation — rebuilt only when these versions change. Dynamic base (Dockerfile.base): just installs shared Python requirements on top of the static base (~2-3 min vs ~42 min). Workflow changes: - Remove all free-disk-space and cleanup steps - Replace skip_cleanup with build_static_base input - Add build-static-base job (runs only on demand) - build-base now pulls pre-built static base from ECR Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove the static base image and its artifact tar before docker save to free enough space for the base image export. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…port The base image includes all static base layers (CUDA/PROJ/GDAL) so docker save produces a ~5GB tar. Only needed for dry_run — production runs push/pull via ECR and skip this step entirely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. Move shared/lib and shared/predictor.py into Dockerfile.base so floods, burn_scars, crop_classification don't each copy them. 2. Create pipelines/surya/Dockerfile.base with pre-downloaded HuggingFace model (~2GB), Python 3.12, and GDAL bindings. Surya service Dockerfile now only installs deps and code (~4 min vs ~14 min). Model base rebuilt only when model version changes. 3. Remove --no-cache flag from surya git install to allow layer caching. 4. Add build_surya_base workflow input and dedicated build-surya-base + build-surya jobs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GitHub Actions masks job outputs that contain secret values. The base_image_ref output included the ECR URL (a secret), so it was blanked for downstream jobs. Fix: output only repo:tag and have consumers prepend the ECR URL secret themselves. Also includes optimizations from prior commits: - Shared code (shared/lib, predictor.py) baked into Dockerfile.base - Surya model pre-build via Dockerfile.base and build_surya_base input - Removed --no-cache from surya git install Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GitHub Actions skips downstream jobs when any ancestor in the needs chain was skipped, even if the direct dependency succeeded. Adding explicit success checks to build-services and build-surya-base prevents the skip from cascading through build-base. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The pyproject.toml only lists dependencies with no package source code. Editable install (-e) requires a discoverable package; non-editable just installs the dependencies. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pulling both the cache image and base image exhausts runner disk. The base image already provides the layer cache that matters — service Dockerfiles only add a thin layer on top. Also fixes surya editable install (uv pip install . instead of -e .). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Setuptools finds /app/lib from the base image and errors on "multiple top-level packages in flat-layout." Setting packages=[] tells it this is deps-only, no package to install. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The --cache-from registry pattern requires pulling the full previous image, which combined with the base image exhausts runner disk space. On ephemeral runners without persistent Docker cache, these pulls provide minimal benefit anyway. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The surya-base image (CUDA + PROJ/GDAL + 2GB model + Python 3.12) leaves insufficient room on ubuntu-latest runners for the Surya library installation. This will not be needed on custom runners with more disk space. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- .dockerignore at root and pipelines/ excludes .git, __pycache__, .venv, docs, helm charts, etc. from Docker build context - Remove COPY + uv pip install of empty requirements.txt from floods, burn_scars, crop_classification Dockerfiles Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- actions/checkout: v4 → v6 - aws-actions/configure-aws-credentials: v4 → v6 - actions/upload-artifact: v4 → v7 - actions/download-artifact: v4 → v8 - jlumbroso/free-disk-space: @main → v1.3.1 Node.js 20 is deprecated and will be removed from runners on September 16, 2026. Node.js 24 becomes default June 2, 2026. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New detect-changes job checks which files changed and gates each build accordingly: - base: runs if base files or any service files changed - inference: only if Dockerfile/requirements/src changed - services (matrix): each checks its own change_key - surya: only if pipelines/surya/ changed - force_all input overrides detection and builds everything Changes to docs, workflows, or unrelated files skip all builds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Services now fall back to base:latest from ECR when build-base is skipped. This avoids a 6-9 min base rebuild when only service-specific files changed (e.g., floods/lib/). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Switch all jobs from ubuntu-latest to self-hosted runner - Re-enable push trigger on dev branch - Remove free-disk-space workarounds (not needed on custom runner) - Revert test-only changes to floods, README, predictor.py The runner label is set to 'self-hosted' as a placeholder — update to your specific runner label when known. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolved conflicts:
- Dockerfile: took dev's entrypoint.sh approach
- build_and_push.sh: took dev's version (superseded by workflow)
- pipelines/{floods,burn_scars,crop_classification}/Dockerfile: kept
our optimization (shared code in base) + added dev's new
inference_worker.py and per-service entrypoint.sh
- src/api/v1/inferences.py: took dev's formatting + EMPTY_RESPONSE
- pipelines/Dockerfile.base: added inference_worker.py to shared code
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements disk space clearing logic.