From 3d4ebd67906637ec43bfec9524461c315abd1151 Mon Sep 17 00:00:00 2001 From: Drew Newberry Date: Mon, 16 Mar 2026 18:52:54 -0700 Subject: [PATCH 01/15] refactor(build): unify image build graph for cache reuse Signed-off-by: Drew Newberry --- .../skills/debug-openshell-cluster/SKILL.md | 2 +- AGENTS.md | 2 +- architecture/build-containers.md | 8 +- architecture/gateway-single-node.md | 8 +- deploy/docker/Dockerfile.cluster | 272 ------------------ deploy/docker/Dockerfile.gateway | 107 ------- deploy/docker/Dockerfile.images | 232 +++++++++++++++ tasks/scripts/cluster-deploy-fast.sh | 35 +-- tasks/scripts/docker-build-cluster.sh | 79 +---- tasks/scripts/docker-build-component.sh | 180 ++---------- tasks/scripts/docker-build-image.sh | 160 +++++++++++ tasks/scripts/docker-publish-multiarch.sh | 210 +++----------- 12 files changed, 482 insertions(+), 813 deletions(-) delete mode 100644 deploy/docker/Dockerfile.cluster delete mode 100644 deploy/docker/Dockerfile.gateway create mode 100644 deploy/docker/Dockerfile.images create mode 100755 tasks/scripts/docker-build-image.sh diff --git a/.agents/skills/debug-openshell-cluster/SKILL.md b/.agents/skills/debug-openshell-cluster/SKILL.md index 115a2aa5..ceb8ae84 100644 --- a/.agents/skills/debug-openshell-cluster/SKILL.md +++ b/.agents/skills/debug-openshell-cluster/SKILL.md @@ -312,7 +312,7 @@ If DNS is broken, all image pulls from the distribution registry will fail, as w | `metrics-server` errors in logs | Normal k3s noise, not the root cause | These errors are benign — look for the actual failing health check component | | Stale NotReady nodes from previous deploys | Volume reused across container recreations | The deploy flow now auto-cleans stale nodes; if it still fails, manually delete NotReady nodes (see Step 2) or choose "Recreate" when prompted | | gRPC `UNIMPLEMENTED` for newer RPCs in push mode | Helm values still point at older pulled images instead of the pushed refs | Verify rendered `openshell-helmchart.yaml` uses the expected push refs (`server`, `sandbox`, `pki-job`) and not `:latest` | -| Sandbox pods crash with `/opt/openshell/bin/openshell-sandbox: no such file or directory` | Supervisor binary missing from cluster image | The cluster image was built/published without the `supervisor-builder` stage. Rebuild with `mise run docker:build:cluster` and recreate gateway. Bootstrap auto-detects via `HEALTHCHECK_MISSING_SUPERVISOR` marker | +| Sandbox pods crash with `/opt/openshell/bin/openshell-sandbox: no such file or directory` | Supervisor binary missing from cluster image | The cluster image was built/published without the `supervisor-builder` target in `deploy/docker/Dockerfile.images`. Rebuild with `mise run docker:build:cluster` and recreate gateway. Bootstrap auto-detects via `HEALTHCHECK_MISSING_SUPERVISOR` marker | | `HEALTHCHECK_MISSING_SUPERVISOR` in health check logs | `/opt/openshell/bin/openshell-sandbox` not found in gateway container | Rebuild cluster image: `mise run docker:build:cluster`, then `openshell gateway destroy && openshell gateway start` | ## Full Diagnostic Dump diff --git a/AGENTS.md b/AGENTS.md index 8e31c4ac..79dc29d1 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -99,7 +99,7 @@ These pipelines connect skills into end-to-end workflows. Individual skill files ## Cluster Infrastructure Changes -- If you change cluster bootstrap infrastructure (e.g., `openshell-bootstrap` crate, `Dockerfile.cluster`, `cluster-entrypoint.sh`, `cluster-healthcheck.sh`, deploy logic in `openshell-cli`), update the `debug-openshell-cluster` skill in `.agents/skills/debug-openshell-cluster/SKILL.md` to reflect those changes. +- If you change cluster bootstrap infrastructure (e.g., `openshell-bootstrap` crate, `deploy/docker/Dockerfile.images`, `cluster-entrypoint.sh`, `cluster-healthcheck.sh`, deploy logic in `openshell-cli`), update the `debug-openshell-cluster` skill in `.agents/skills/debug-openshell-cluster/SKILL.md` to reflect those changes. ## Documentation diff --git a/architecture/build-containers.md b/architecture/build-containers.md index 705b00d6..2e0d664d 100644 --- a/architecture/build-containers.md +++ b/architecture/build-containers.md @@ -6,7 +6,7 @@ OpenShell produces two container images, both published for `linux/amd64` and `l The gateway runs the control plane API server. It is deployed as a StatefulSet inside the cluster container via a bundled Helm chart. -- **Dockerfile**: `deploy/docker/Dockerfile.gateway` +- **Docker target**: `gateway` in `deploy/docker/Dockerfile.images` - **Registry**: `ghcr.io/nvidia/openshell/gateway:latest` - **Pulled when**: Cluster startup (the Helm chart triggers the pull) - **Entrypoint**: `openshell-server --port 8080` (gRPC + HTTP, mTLS) @@ -15,11 +15,11 @@ The gateway runs the control plane API server. It is deployed as a StatefulSet i The cluster image is a single-container Kubernetes distribution that bundles the Helm charts, Kubernetes manifests, and the `openshell-sandbox` supervisor binary needed to bootstrap the control plane. -- **Dockerfile**: `deploy/docker/Dockerfile.cluster` +- **Docker target**: `cluster` in `deploy/docker/Dockerfile.images` - **Registry**: `ghcr.io/nvidia/openshell/cluster:latest` - **Pulled when**: `openshell gateway start` -The supervisor binary (`openshell-sandbox`) is cross-compiled in a build stage and placed at `/opt/openshell/bin/openshell-sandbox`. It is exposed to sandbox pods at runtime via a read-only `hostPath` volume mount — it is not baked into sandbox images. +The supervisor binary (`openshell-sandbox`) is built by the shared `supervisor-builder` stage in `deploy/docker/Dockerfile.images` and placed at `/opt/openshell/bin/openshell-sandbox`. It is exposed to sandbox pods at runtime via a read-only `hostPath` volume mount — it is not baked into sandbox images. ## Sandbox Images @@ -42,7 +42,7 @@ The incremental deploy (`cluster-deploy-fast.sh`) fingerprints local Git changes | Changed files | Rebuild triggered | |---|---| | Cargo manifests, proto definitions, cross-build script | Gateway + supervisor | -| `crates/openshell-server/*`, `Dockerfile.gateway` | Gateway | +| `crates/openshell-server/*`, `deploy/docker/Dockerfile.images` | Gateway | | `crates/openshell-sandbox/*`, `crates/openshell-policy/*` | Supervisor | | `deploy/helm/openshell/*` | Helm upgrade | diff --git a/architecture/gateway-single-node.md b/architecture/gateway-single-node.md index 8dc270ac..2b911d80 100644 --- a/architecture/gateway-single-node.md +++ b/architecture/gateway-single-node.md @@ -29,7 +29,7 @@ Out of scope: - `crates/openshell-bootstrap/src/push.rs`: Local development image push into k3s containerd. - `crates/openshell-bootstrap/src/paths.rs`: XDG path resolution. - `crates/openshell-bootstrap/src/constants.rs`: Shared constants (image name, container/volume/network naming). -- `deploy/docker/Dockerfile.cluster`: Container image definition (k3s base + Helm charts + manifests + entrypoint). +- `deploy/docker/Dockerfile.images` (target `cluster`): Container image definition (k3s base + Helm charts + manifests + entrypoint). - `deploy/docker/cluster-entrypoint.sh`: Container entrypoint (DNS proxy, registry config, manifest injection). - `deploy/docker/cluster-healthcheck.sh`: Docker HEALTHCHECK script. - Docker daemon(s): @@ -228,7 +228,7 @@ After deploy, the CLI calls `save_active_gateway(name)`, writing the gateway nam ## Container Image -The gateway image is defined in `deploy/docker/Dockerfile.cluster`: +The cluster image is defined by target `cluster` in `deploy/docker/Dockerfile.images`: ``` Base: rancher/k3s:v1.35.2-k3s1 @@ -298,7 +298,7 @@ GPU support is part of the single-node gateway bootstrap path rather than a sepa - `openshell gateway start --gpu` threads a boolean deploy option through `crates/openshell-cli`, `crates/openshell-bootstrap`, and `crates/openshell-bootstrap/src/docker.rs`. - When enabled, the cluster container is created with Docker `DeviceRequests`, which is the API equivalent of `docker run --gpus all`. -- `deploy/docker/Dockerfile.cluster` installs NVIDIA Container Toolkit packages in a dedicated Ubuntu stage and copies the runtime binaries, config, and `libnvidia-container` shared libraries into the final Ubuntu-based cluster image. +- `deploy/docker/Dockerfile.images` installs NVIDIA Container Toolkit packages in a dedicated Ubuntu stage and copies the runtime binaries, config, and `libnvidia-container` shared libraries into the final Ubuntu-based cluster image. - `deploy/docker/cluster-entrypoint.sh` checks `GPU_ENABLED=true` and copies GPU-only manifests from `/opt/openshell/gpu-manifests/` into k3s's manifests directory. - `deploy/kube/gpu-manifests/nvidia-device-plugin-helmchart.yaml` installs the NVIDIA device plugin chart, currently pinned to `0.18.2`, along with GPU Feature Discovery and Node Feature Discovery. - k3s auto-detects `nvidia-container-runtime` on `PATH`, registers the `nvidia` containerd runtime, and creates the `nvidia` `RuntimeClass` automatically. @@ -454,7 +454,7 @@ openshell/ - `crates/openshell-cli/src/main.rs` -- CLI command definitions - `crates/openshell-cli/src/run.rs` -- CLI command implementations - `crates/openshell-cli/src/bootstrap.rs` -- auto-bootstrap from sandbox create -- `deploy/docker/Dockerfile.cluster` -- container image definition +- `deploy/docker/Dockerfile.images` -- shared image build definition (cluster target) - `deploy/docker/cluster-entrypoint.sh` -- container entrypoint script - `deploy/docker/cluster-healthcheck.sh` -- Docker HEALTHCHECK script - `deploy/kube/manifests/openshell-helmchart.yaml` -- OpenShell Helm chart manifest diff --git a/deploy/docker/Dockerfile.cluster b/deploy/docker/Dockerfile.cluster deleted file mode 100644 index 56084076..00000000 --- a/deploy/docker/Dockerfile.cluster +++ /dev/null @@ -1,272 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 - -# k3s cluster image with OpenShell Helm charts and manifests -# -# Multi-stage build: extracts k3s artifacts from the upstream rancher/k3s -# Alpine image and layers them onto the NVIDIA Ubuntu base image. -# -# This image includes: -# - k3s binary and all supporting binaries (containerd-shim, runc, CNI, etc.) -# - k9s for interactive cluster debugging (via `openshell doctor exec -- k9s`) -# - openshell-sandbox supervisor binary (side-loaded into sandbox pods via hostPath) -# - Packaged OpenShell Helm chart -# - HelmChart CR for auto-deploying OpenShell -# - Custom entrypoint for DNS configuration in Docker environments -# -# The gateway image (openshell/gateway) is pulled at runtime from the -# distribution registry. Sandbox images are pulled from the community registry -# (ghcr.io/nvidia/openshell-community/sandboxes). The supervisor binary is -# embedded in this cluster image and exposed to sandbox pods via a hostPath -# volume mount. -# Registry credentials are generated by the entrypoint script at container start. -# -# The helm charts are built by the docker:build:cluster mise task -# and placed in deploy/docker/.build/ before this Dockerfile is built. - -# Tracked upstream vulns in rancher/k3s:v1.35.2-k3s1 (bundled Go dependencies): -# GHSA-pwhc-rpq9-4c8w containerd v2.1.5-k3s1 (local privesc via CRI dir perms; -# upstream patched in 2.1.5 -- may be scanner false positive -# from the -k3s1 suffix) -# GHSA-p436-gjf2-799p docker/cli v28.3.2 (Windows-only plugin path hijack; N/A) -# GHSA-9h8m-3fm2-qjrq otel/sdk v1.39.0 (macOS-only PATH hijack; N/A for Linux) -# CVE-2024-36623 docker/docker v25.0.8 (streamformatter race condition) -# Bump K3S_VERSION when a release with updated dependencies ships. - -ARG K3S_VERSION=v1.35.2-k3s1 -ARG K9S_VERSION=v0.50.18 -ARG HELM_VERSION=v3.17.3 -ARG NVIDIA_CONTAINER_TOOLKIT_VERSION=1.18.2-1 - -# --------------------------------------------------------------------------- -# Stage 1: Extract k3s artifacts from upstream rancher image (Alpine-based) -# --------------------------------------------------------------------------- -FROM rancher/k3s:${K3S_VERSION} AS k3s - -# --------------------------------------------------------------------------- -# Stage 1b: Download k9s binary for interactive cluster debugging -# --------------------------------------------------------------------------- -FROM ubuntu:24.04 AS k9s -ARG K9S_VERSION -ARG TARGETARCH -RUN apt-get update && apt-get install -y --no-install-recommends curl ca-certificates && \ - curl -fsSL "https://github.com/derailed/k9s/releases/download/${K9S_VERSION}/k9s_Linux_${TARGETARCH}.tar.gz" \ - | tar xz -C /tmp k9s && \ - chmod +x /tmp/k9s && \ - rm -rf /var/lib/apt/lists/* - -# --------------------------------------------------------------------------- -# Stage 1c: Download helm binary for in-container chart upgrades -# --------------------------------------------------------------------------- -FROM ubuntu:24.04 AS helm -ARG HELM_VERSION -ARG TARGETARCH -RUN apt-get update && apt-get install -y --no-install-recommends curl ca-certificates && \ - curl -fsSL "https://get.helm.sh/helm-${HELM_VERSION}-linux-${TARGETARCH}.tar.gz" \ - | tar xz --strip-components=1 -C /tmp "linux-${TARGETARCH}/helm" && \ - chmod +x /tmp/helm && \ - rm -rf /var/lib/apt/lists/* - -# --------------------------------------------------------------------------- -# Stage 1d: Build openshell-sandbox supervisor binary -# --------------------------------------------------------------------------- -# The supervisor binary runs inside every sandbox pod. It is built here and -# placed on the k3s node filesystem at /opt/openshell/bin/openshell-sandbox, -# then mounted into sandbox pods via a read-only hostPath volume. -FROM --platform=$BUILDPLATFORM rust:1.88-slim AS supervisor-builder -ARG TARGETARCH -ARG BUILDARCH -ARG CARGO_TARGET_CACHE_SCOPE=default -ARG SCCACHE_MEMCACHED_ENDPOINT - -# Install build dependencies -RUN apt-get update && apt-get install -y --no-install-recommends \ - cmake g++ make protobuf-compiler curl && rm -rf /var/lib/apt/lists/* - -# Install cross-compilation toolchain, sccache, + Rust target (no-ops for native builds) -COPY deploy/docker/cross-build.sh /usr/local/bin/ -RUN . cross-build.sh && install_cross_toolchain && install_sccache && add_rust_target - -WORKDIR /build - -# Copy dependency manifests first for better caching -COPY Cargo.toml Cargo.lock ./ -COPY crates/openshell-cli/Cargo.toml crates/openshell-cli/Cargo.toml -COPY crates/openshell-core/Cargo.toml crates/openshell-core/Cargo.toml -COPY crates/openshell-policy/Cargo.toml crates/openshell-policy/Cargo.toml -COPY crates/openshell-providers/Cargo.toml crates/openshell-providers/Cargo.toml -COPY crates/openshell-router/Cargo.toml crates/openshell-router/Cargo.toml -COPY crates/openshell-sandbox/Cargo.toml crates/openshell-sandbox/Cargo.toml -COPY crates/openshell-server/Cargo.toml crates/openshell-server/Cargo.toml -COPY crates/openshell-bootstrap/Cargo.toml crates/openshell-bootstrap/Cargo.toml - -# Create dummy source files to build dependencies -RUN mkdir -p crates/openshell-cli/src crates/openshell-core/src crates/openshell-policy/src \ - crates/openshell-providers/src crates/openshell-router/src crates/openshell-sandbox/src \ - crates/openshell-server/src crates/openshell-bootstrap/src && \ - echo "fn main() {}" > crates/openshell-cli/src/main.rs && \ - echo "fn main() {}" > crates/openshell-sandbox/src/main.rs && \ - echo "fn main() {}" > crates/openshell-server/src/main.rs && \ - touch crates/openshell-core/src/lib.rs && \ - touch crates/openshell-policy/src/lib.rs && \ - touch crates/openshell-providers/src/lib.rs && \ - touch crates/openshell-router/src/lib.rs && \ - touch crates/openshell-bootstrap/src/lib.rs - -# Copy proto files needed for build -COPY proto/ proto/ - -# Build dependencies only (cached unless Cargo.toml/lock changes) -RUN --mount=type=cache,id=cargo-registry-supervisor-${TARGETARCH},sharing=locked,target=/usr/local/cargo/registry \ - --mount=type=cache,id=cargo-target-supervisor-${TARGETARCH}-${CARGO_TARGET_CACHE_SCOPE},sharing=locked,target=/build/target \ - --mount=type=cache,id=sccache-supervisor-${TARGETARCH},sharing=locked,target=/tmp/sccache \ - . cross-build.sh && cargo_cross_build -p openshell-sandbox 2>/dev/null || true - -# Copy actual source code -COPY crates/ crates/ - -# Touch source files to ensure they're rebuilt (not the cached dummy) -RUN touch crates/openshell-sandbox/src/main.rs \ - crates/openshell-core/build.rs \ - proto/*.proto - -# Build the supervisor binary -# Declare version ARG here (not earlier) so the git-hash-bearing value does not -# invalidate the expensive dependency-build layers above on every commit. -ARG OPENSHELL_CARGO_VERSION -RUN --mount=type=cache,id=cargo-registry-supervisor-${TARGETARCH},sharing=locked,target=/usr/local/cargo/registry \ - --mount=type=cache,id=cargo-target-supervisor-${TARGETARCH}-${CARGO_TARGET_CACHE_SCOPE},sharing=locked,target=/build/target \ - --mount=type=cache,id=sccache-supervisor-${TARGETARCH},sharing=locked,target=/tmp/sccache \ - . cross-build.sh && \ - if [ -n "${OPENSHELL_CARGO_VERSION:-}" ]; then \ - sed -i -E '/^\[workspace\.package\]/,/^\[/{s/^version[[:space:]]*=[[:space:]]*".*"/version = "'"${OPENSHELL_CARGO_VERSION}"'"/}' Cargo.toml; \ - fi && \ - cargo_cross_build --release -p openshell-sandbox && \ - mkdir -p /build/out && \ - cp "$(cross_output_dir release)/openshell-sandbox" /build/out/ - -# --------------------------------------------------------------------------- -# Stage 2: Install NVIDIA container toolkit on Ubuntu -# --------------------------------------------------------------------------- -FROM ubuntu:24.04 AS nvidia-toolkit - -ARG NVIDIA_CONTAINER_TOOLKIT_VERSION - -RUN apt-get update && apt-get install -y --no-install-recommends \ - gpg curl ca-certificates && \ - curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \ - | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && \ - curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \ - | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \ - | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list && \ - apt-get update && \ - apt-get install -y --no-install-recommends \ - "nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION}" \ - "nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION}" \ - "libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION}" \ - "libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}" && \ - rm -rf /var/lib/apt/lists/* - -# --------------------------------------------------------------------------- -# Stage 3: Runtime on NVIDIA hardened Ubuntu base -# --------------------------------------------------------------------------- -FROM nvcr.io/nvidia/base/ubuntu:noble-20251013 - -# Install runtime dependencies that k3s expects from the host OS. -# - iptables: used by flannel/kube-proxy for network policy and NAT rules -# - mount/umount: needed by kubelet for volume mounts (provided by mount package) -# - ca-certificates: TLS verification for registry pulls -# - conntrack: k3s/kube-proxy uses conntrack for connection tracking -# - dnsutils: nslookup used by entrypoint/healthcheck for DNS probe -RUN apt-get update && apt-get install -y --no-install-recommends \ - ca-certificates \ - iptables \ - mount \ - dnsutils \ - && rm -rf /var/lib/apt/lists/* - -# Copy the full /bin directory from k3s (contains all statically-linked -# binaries and their symlinks: k3s, kubectl, crictl, ctr, containerd, -# containerd-shim-runc-v2, runc, cni plugins, busybox, coreutils, -# ip, ipset, conntrack, nsenter, pigz, etc.) -COPY --from=k3s /bin/ /bin/ - -# Copy k9s binary for interactive cluster debugging via `openshell doctor exec -- k9s` -COPY --from=k9s /tmp/k9s /usr/local/bin/k9s - -# Copy helm binary for in-container chart upgrades (used by cluster-deploy-fast.sh) -COPY --from=helm /tmp/helm /usr/local/bin/helm - -# Copy iptables/nftables tooling (xtables-nft-multi, iptables-detect.sh, etc.) -# These are in /bin/aux/ in the k3s image and must be on PATH. -# Note: the Ubuntu iptables package provides /usr/sbin/iptables, but k3s -# expects its own bundled version at /bin/aux/iptables. Both are on PATH; -# k3s finds its copy via /bin/aux in PATH. - -# Copy CA certificates from k3s (bundled Alpine CA bundle). -# The Ubuntu ca-certificates package also installs certs; having both is fine. -COPY --from=k3s /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/k3s-ca-certificates.crt - -# Copy timezone data used by k3s/Go for time.LoadLocation -COPY --from=k3s /usr/share/zoneinfo/ /usr/share/zoneinfo/ - -# Set environment variables matching the upstream k3s image. -# PATH includes /bin/aux for iptables tooling and /var/lib/rancher/k3s/data/cni -# for runtime-extracted CNI binaries. -ENV PATH="/var/lib/rancher/k3s/data/cni:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/bin/aux" \ - CRI_CONFIG_FILE="/var/lib/rancher/k3s/agent/etc/crictl.yaml" - -# Copy NVIDIA Container Toolkit files from the build stage. -# k3s auto-detects nvidia-container-runtime on PATH and registers it as a -# containerd runtime + creates the "nvidia" RuntimeClass automatically. -COPY --from=nvidia-toolkit /usr/bin/nvidia-cdi-hook /usr/bin/ -COPY --from=nvidia-toolkit /usr/bin/nvidia-container-runtime /usr/bin/ -COPY --from=nvidia-toolkit /usr/bin/nvidia-container-runtime-hook /usr/bin/ -COPY --from=nvidia-toolkit /usr/bin/nvidia-container-cli /usr/bin/ -COPY --from=nvidia-toolkit /usr/bin/nvidia-ctk /usr/bin/ -COPY --from=nvidia-toolkit /etc/nvidia-container-runtime /etc/nvidia-container-runtime -COPY --from=nvidia-toolkit /usr/lib/*-linux-gnu/libnvidia-container*.so* /usr/lib/ - -# Copy the openshell-sandbox supervisor binary to the node filesystem. -# Sandbox pods mount /opt/openshell/bin as a read-only hostPath volume -# to side-load the supervisor without baking it into every sandbox image. -COPY --from=supervisor-builder /build/out/openshell-sandbox /opt/openshell/bin/openshell-sandbox - -# Create directories for manifests, charts, and configuration -RUN mkdir -p /var/lib/rancher/k3s/server/manifests \ - /var/lib/rancher/k3s/server/static/charts \ - /etc/rancher/k3s \ - /opt/openshell/manifests \ - /opt/openshell/charts \ - /opt/openshell/gpu-manifests \ - /run/flannel - -# Copy entrypoint script that configures DNS for Docker environments -# This script detects the host gateway IP and configures CoreDNS to use it -COPY deploy/docker/cluster-entrypoint.sh /usr/local/bin/cluster-entrypoint.sh -RUN chmod +x /usr/local/bin/cluster-entrypoint.sh - -# Copy healthcheck script that verifies cluster readiness -COPY deploy/docker/cluster-healthcheck.sh /usr/local/bin/cluster-healthcheck.sh -RUN chmod +x /usr/local/bin/cluster-healthcheck.sh - -# Registry credentials for pulling component images at runtime are generated -# by the entrypoint script at /etc/rancher/k3s/registries.yaml. - -# Copy packaged helm charts to a staging directory that won't be -# overwritten by the /var/lib/rancher/k3s volume mount. The entrypoint -# script copies them into the k3s static charts directory at container start. -COPY deploy/docker/.build/charts/*.tgz /opt/openshell/charts/ - -# Copy Kubernetes manifests to a persistent location that won't be overwritten by the volume mount. -# The bootstrap code will copy these to /var/lib/rancher/k3s/server/manifests/ after cluster start. -COPY deploy/kube/manifests/*.yaml /opt/openshell/manifests/ - -# Copy GPU-specific manifests (deployed conditionally by entrypoint when GPU_ENABLED=true) -COPY deploy/kube/gpu-manifests/*.yaml /opt/openshell/gpu-manifests/ - -# Use custom entrypoint that configures DNS before starting k3s -ENTRYPOINT ["/usr/local/bin/cluster-entrypoint.sh"] - -HEALTHCHECK --interval=5s --timeout=5s --start-period=20s --retries=60 \ - CMD ["/usr/local/bin/cluster-healthcheck.sh"] diff --git a/deploy/docker/Dockerfile.gateway b/deploy/docker/Dockerfile.gateway deleted file mode 100644 index acf186c6..00000000 --- a/deploy/docker/Dockerfile.gateway +++ /dev/null @@ -1,107 +0,0 @@ -# syntax=docker/dockerfile:1.4 - -# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 - -# OpenShell Gateway Docker image -# Multi-stage build with cross-compilation support for multi-arch - -# Stage 1: Rust builder (runs on build platform, cross-compiles for target) -FROM --platform=$BUILDPLATFORM rust:1.88-slim AS builder -ARG TARGETARCH -ARG BUILDARCH -ARG CARGO_TARGET_CACHE_SCOPE=default - -# Install build dependencies -RUN apt-get update && apt-get install -y --no-install-recommends \ - cmake g++ make protobuf-compiler curl && rm -rf /var/lib/apt/lists/* - -# Install cross-compilation toolchain, sccache, + Rust target (no-ops for native builds) -COPY deploy/docker/cross-build.sh /usr/local/bin/ -RUN . cross-build.sh && install_cross_toolchain && install_sccache && add_rust_target - -ARG SCCACHE_MEMCACHED_ENDPOINT - -WORKDIR /build - -# Copy dependency manifests first for better caching -COPY Cargo.toml Cargo.lock ./ -COPY crates/openshell-cli/Cargo.toml crates/openshell-cli/Cargo.toml -COPY crates/openshell-core/Cargo.toml crates/openshell-core/Cargo.toml -COPY crates/openshell-providers/Cargo.toml crates/openshell-providers/Cargo.toml -COPY crates/openshell-router/Cargo.toml crates/openshell-router/Cargo.toml -COPY crates/openshell-sandbox/Cargo.toml crates/openshell-sandbox/Cargo.toml -COPY crates/openshell-server/Cargo.toml crates/openshell-server/Cargo.toml -COPY crates/openshell-bootstrap/Cargo.toml crates/openshell-bootstrap/Cargo.toml - -# Create dummy source files to build dependencies -RUN mkdir -p crates/openshell-cli/src crates/openshell-core/src crates/openshell-providers/src crates/openshell-router/src crates/openshell-sandbox/src crates/openshell-server/src crates/openshell-bootstrap/src && \ - echo "fn main() {}" > crates/openshell-cli/src/main.rs && \ - echo "fn main() {}" > crates/openshell-sandbox/src/main.rs && \ - echo "fn main() {}" > crates/openshell-server/src/main.rs && \ - touch crates/openshell-core/src/lib.rs && \ - touch crates/openshell-providers/src/lib.rs && \ - touch crates/openshell-router/src/lib.rs && \ - touch crates/openshell-bootstrap/src/lib.rs - -# Copy proto files needed for build -COPY proto/ proto/ - -# Build dependencies only (cached unless Cargo.toml/lock changes). -# sccache uses memcached in CI (SCCACHE_MEMCACHED_ENDPOINT) or the local -# disk cache mount for local dev builds. The cargo-target mount gives cargo -# a persistent target/ dir for true incremental rebuilds on source changes. -RUN --mount=type=cache,id=cargo-registry-gateway-${TARGETARCH},sharing=locked,target=/usr/local/cargo/registry \ - --mount=type=cache,id=cargo-target-gateway-${TARGETARCH}-${CARGO_TARGET_CACHE_SCOPE},sharing=locked,target=/build/target \ - --mount=type=cache,id=sccache-gateway-${TARGETARCH},sharing=locked,target=/tmp/sccache \ - . cross-build.sh && cargo_cross_build --release -p openshell-server 2>/dev/null || true - -# Copy actual source code -COPY crates/ crates/ - -# Touch source files to ensure they're rebuilt (not the cached dummy). -# Touch build.rs and proto files to force proto code regeneration when the -# cargo target cache mount retains stale OUT_DIR artifacts from prior builds. -RUN touch crates/openshell-server/src/main.rs \ - crates/openshell-core/build.rs \ - proto/*.proto - -# Build the actual application -# Declare version ARG here (not earlier) so the git-hash-bearing value does not -# invalidate the expensive dependency-build layers above on every commit. -ARG OPENSHELL_CARGO_VERSION -RUN --mount=type=cache,id=cargo-registry-gateway-${TARGETARCH},sharing=locked,target=/usr/local/cargo/registry \ - --mount=type=cache,id=cargo-target-gateway-${TARGETARCH}-${CARGO_TARGET_CACHE_SCOPE},sharing=locked,target=/build/target \ - --mount=type=cache,id=sccache-gateway-${TARGETARCH},sharing=locked,target=/tmp/sccache \ - . cross-build.sh && \ - if [ -n "${OPENSHELL_CARGO_VERSION:-}" ]; then \ - sed -i -E '/^\[workspace\.package\]/,/^\[/{s/^version[[:space:]]*=[[:space:]]*".*"/version = "'"${OPENSHELL_CARGO_VERSION}"'"/}' Cargo.toml; \ - fi && \ - cargo_cross_build --release -p openshell-server && \ - cp "$(cross_output_dir release)/openshell-server" /build/openshell-server - -# Stage 2: Runtime (uses target platform) -# NVIDIA hardened Ubuntu base for supply chain consistency. -FROM nvcr.io/nvidia/base/ubuntu:noble-20251013 AS runtime - -RUN apt-get update && apt-get install -y --no-install-recommends \ - ca-certificates && rm -rf /var/lib/apt/lists/* - -RUN useradd --create-home --user-group openshell - -WORKDIR /app - -COPY --from=builder /build/openshell-server /usr/local/bin/ - -# Copy migrations to the build-time manifest directory expected by sqlx -RUN mkdir -p /build/crates/openshell-server -COPY crates/openshell-server/migrations /build/crates/openshell-server/migrations - -USER openshell -EXPOSE 8080 - -# Health checks are handled by Kubernetes liveness/readiness probes (tcpSocket). -# No Docker HEALTHCHECK is needed since this image runs inside a k3s cluster. - -ENTRYPOINT ["openshell-server"] -CMD ["--port", "8080"] diff --git a/deploy/docker/Dockerfile.images b/deploy/docker/Dockerfile.images new file mode 100644 index 00000000..84edf449 --- /dev/null +++ b/deploy/docker/Dockerfile.images @@ -0,0 +1,232 @@ +# syntax=docker/dockerfile:1.4 + +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +# Shared OpenShell image build graph. +# +# Targets: +# gateway Final gateway image +# cluster Final cluster image +# gateway-builder Release openshell-server binary +# supervisor-builder Release openshell-sandbox binary + +ARG K3S_VERSION=v1.35.2-k3s1 +ARG K9S_VERSION=v0.50.18 +ARG HELM_VERSION=v3.17.3 +ARG NVIDIA_CONTAINER_TOOLKIT_VERSION=1.18.2-1 + +# --------------------------------------------------------------------------- +# Shared Rust build stages +# --------------------------------------------------------------------------- +FROM --platform=$BUILDPLATFORM rust:1.88-slim AS rust-builder-base +ARG TARGETARCH +ARG BUILDARCH +ARG OPENSHELL_CARGO_VERSION +ARG CARGO_TARGET_CACHE_SCOPE=default +ARG SCCACHE_MEMCACHED_ENDPOINT + +RUN apt-get update && apt-get install -y --no-install-recommends \ + cmake g++ make protobuf-compiler curl && rm -rf /var/lib/apt/lists/* + +COPY deploy/docker/cross-build.sh /usr/local/bin/ +RUN . cross-build.sh && install_cross_toolchain && install_sccache && add_rust_target + +WORKDIR /build + +FROM rust-builder-base AS rust-builder-skeleton + +COPY Cargo.toml Cargo.lock ./ +COPY crates/openshell-bootstrap/Cargo.toml crates/openshell-bootstrap/Cargo.toml +COPY crates/openshell-cli/Cargo.toml crates/openshell-cli/Cargo.toml +COPY crates/openshell-core/Cargo.toml crates/openshell-core/Cargo.toml +COPY crates/openshell-policy/Cargo.toml crates/openshell-policy/Cargo.toml +COPY crates/openshell-providers/Cargo.toml crates/openshell-providers/Cargo.toml +COPY crates/openshell-router/Cargo.toml crates/openshell-router/Cargo.toml +COPY crates/openshell-sandbox/Cargo.toml crates/openshell-sandbox/Cargo.toml +COPY crates/openshell-server/Cargo.toml crates/openshell-server/Cargo.toml +COPY crates/openshell-tui/Cargo.toml crates/openshell-tui/Cargo.toml +COPY crates/openshell-core/build.rs crates/openshell-core/build.rs +COPY proto/ proto/ + +RUN mkdir -p \ + crates/openshell-bootstrap/src \ + crates/openshell-cli/src \ + crates/openshell-core/src \ + crates/openshell-policy/src \ + crates/openshell-providers/src \ + crates/openshell-router/src \ + crates/openshell-sandbox/src \ + crates/openshell-server/src \ + crates/openshell-tui/src && \ + touch crates/openshell-bootstrap/src/lib.rs && \ + printf 'fn main() {}\n' > crates/openshell-cli/src/main.rs && \ + touch crates/openshell-core/src/lib.rs && \ + touch crates/openshell-policy/src/lib.rs && \ + touch crates/openshell-providers/src/lib.rs && \ + touch crates/openshell-router/src/lib.rs && \ + touch crates/openshell-sandbox/src/lib.rs && \ + printf 'fn main() {}\n' > crates/openshell-sandbox/src/main.rs && \ + touch crates/openshell-server/src/lib.rs && \ + printf 'fn main() {}\n' > crates/openshell-server/src/main.rs && \ + touch crates/openshell-tui/src/lib.rs + +FROM rust-builder-skeleton AS rust-deps + +RUN --mount=type=cache,id=cargo-registry-${TARGETARCH},sharing=locked,target=/usr/local/cargo/registry \ + --mount=type=cache,id=cargo-git-${TARGETARCH},sharing=locked,target=/usr/local/cargo/git \ + --mount=type=cache,id=cargo-target-${TARGETARCH}-${CARGO_TARGET_CACHE_SCOPE},sharing=locked,target=/build/target \ + --mount=type=cache,id=sccache-${TARGETARCH},sharing=locked,target=/tmp/sccache \ + . cross-build.sh && cargo_cross_build --release -p openshell-server -p openshell-sandbox + +FROM rust-deps AS rust-workspace + +COPY crates/ crates/ + +RUN touch \ + crates/openshell-core/build.rs \ + crates/openshell-sandbox/src/main.rs \ + crates/openshell-server/src/main.rs \ + proto/*.proto && \ + if [ -n "${OPENSHELL_CARGO_VERSION:-}" ]; then \ + sed -i -E '/^\[workspace\.package\]/,/^\[/{s/^version[[:space:]]*=[[:space:]]*".*"/version = "'"${OPENSHELL_CARGO_VERSION}"'"/}' Cargo.toml; \ + fi + +FROM rust-workspace AS gateway-builder + +RUN --mount=type=cache,id=cargo-registry-${TARGETARCH},sharing=locked,target=/usr/local/cargo/registry \ + --mount=type=cache,id=cargo-git-${TARGETARCH},sharing=locked,target=/usr/local/cargo/git \ + --mount=type=cache,id=cargo-target-${TARGETARCH}-${CARGO_TARGET_CACHE_SCOPE},sharing=locked,target=/build/target \ + --mount=type=cache,id=sccache-${TARGETARCH},sharing=locked,target=/tmp/sccache \ + . cross-build.sh && \ + cargo_cross_build --release -p openshell-server && \ + mkdir -p /build/out && \ + cp "$(cross_output_dir release)/openshell-server" /build/out/ + +FROM rust-workspace AS supervisor-builder + +RUN --mount=type=cache,id=cargo-registry-${TARGETARCH},sharing=locked,target=/usr/local/cargo/registry \ + --mount=type=cache,id=cargo-git-${TARGETARCH},sharing=locked,target=/usr/local/cargo/git \ + --mount=type=cache,id=cargo-target-${TARGETARCH}-${CARGO_TARGET_CACHE_SCOPE},sharing=locked,target=/build/target \ + --mount=type=cache,id=sccache-${TARGETARCH},sharing=locked,target=/tmp/sccache \ + . cross-build.sh && \ + cargo_cross_build --release -p openshell-sandbox && \ + mkdir -p /build/out && \ + cp "$(cross_output_dir release)/openshell-sandbox" /build/out/ + +# --------------------------------------------------------------------------- +# Final gateway image +# --------------------------------------------------------------------------- +FROM nvcr.io/nvidia/base/ubuntu:noble-20251013 AS gateway + +RUN apt-get update && apt-get install -y --no-install-recommends \ + ca-certificates && rm -rf /var/lib/apt/lists/* + +RUN useradd --create-home --user-group openshell + +WORKDIR /app + +COPY --from=gateway-builder /build/out/openshell-server /usr/local/bin/ + +RUN mkdir -p /build/crates/openshell-server +COPY crates/openshell-server/migrations /build/crates/openshell-server/migrations + +USER openshell +EXPOSE 8080 + +ENTRYPOINT ["openshell-server"] +CMD ["--port", "8080"] + +# --------------------------------------------------------------------------- +# Cluster asset stages +# --------------------------------------------------------------------------- +FROM rancher/k3s:${K3S_VERSION} AS k3s + +FROM ubuntu:24.04 AS k9s +ARG K9S_VERSION +ARG TARGETARCH +RUN apt-get update && apt-get install -y --no-install-recommends curl ca-certificates && \ + curl -fsSL "https://github.com/derailed/k9s/releases/download/${K9S_VERSION}/k9s_Linux_${TARGETARCH}.tar.gz" \ + | tar xz -C /tmp k9s && \ + chmod +x /tmp/k9s && \ + rm -rf /var/lib/apt/lists/* + +FROM ubuntu:24.04 AS helm +ARG HELM_VERSION +ARG TARGETARCH +RUN apt-get update && apt-get install -y --no-install-recommends curl ca-certificates && \ + curl -fsSL "https://get.helm.sh/helm-${HELM_VERSION}-linux-${TARGETARCH}.tar.gz" \ + | tar xz --strip-components=1 -C /tmp "linux-${TARGETARCH}/helm" && \ + chmod +x /tmp/helm && \ + rm -rf /var/lib/apt/lists/* + +FROM ubuntu:24.04 AS nvidia-toolkit +ARG NVIDIA_CONTAINER_TOOLKIT_VERSION + +RUN apt-get update && apt-get install -y --no-install-recommends \ + gpg curl ca-certificates && \ + curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \ + | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && \ + curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \ + | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \ + | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list && \ + apt-get update && \ + apt-get install -y --no-install-recommends \ + "nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION}" \ + "nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION}" \ + "libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION}" \ + "libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}" && \ + rm -rf /var/lib/apt/lists/* + +# --------------------------------------------------------------------------- +# Final cluster image +# --------------------------------------------------------------------------- +FROM nvcr.io/nvidia/base/ubuntu:noble-20251013 AS cluster + +RUN apt-get update && apt-get install -y --no-install-recommends \ + ca-certificates \ + iptables \ + mount \ + dnsutils \ + && rm -rf /var/lib/apt/lists/* + +COPY --from=k3s /bin/ /bin/ +COPY --from=k9s /tmp/k9s /usr/local/bin/k9s +COPY --from=helm /tmp/helm /usr/local/bin/helm +COPY --from=k3s /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/k3s-ca-certificates.crt +COPY --from=k3s /usr/share/zoneinfo/ /usr/share/zoneinfo/ + +ENV PATH="/var/lib/rancher/k3s/data/cni:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/bin/aux" \ + CRI_CONFIG_FILE="/var/lib/rancher/k3s/agent/etc/crictl.yaml" + +COPY --from=nvidia-toolkit /usr/bin/nvidia-cdi-hook /usr/bin/ +COPY --from=nvidia-toolkit /usr/bin/nvidia-container-runtime /usr/bin/ +COPY --from=nvidia-toolkit /usr/bin/nvidia-container-runtime-hook /usr/bin/ +COPY --from=nvidia-toolkit /usr/bin/nvidia-container-cli /usr/bin/ +COPY --from=nvidia-toolkit /usr/bin/nvidia-ctk /usr/bin/ +COPY --from=nvidia-toolkit /etc/nvidia-container-runtime /etc/nvidia-container-runtime +COPY --from=nvidia-toolkit /usr/lib/*-linux-gnu/libnvidia-container*.so* /usr/lib/ +COPY --from=supervisor-builder /build/out/openshell-sandbox /opt/openshell/bin/openshell-sandbox + +RUN mkdir -p /var/lib/rancher/k3s/server/manifests \ + /var/lib/rancher/k3s/server/static/charts \ + /etc/rancher/k3s \ + /opt/openshell/manifests \ + /opt/openshell/charts \ + /opt/openshell/gpu-manifests \ + /run/flannel + +COPY deploy/docker/cluster-entrypoint.sh /usr/local/bin/cluster-entrypoint.sh +RUN chmod +x /usr/local/bin/cluster-entrypoint.sh + +COPY deploy/docker/cluster-healthcheck.sh /usr/local/bin/cluster-healthcheck.sh +RUN chmod +x /usr/local/bin/cluster-healthcheck.sh + +COPY deploy/docker/.build/charts/*.tgz /opt/openshell/charts/ +COPY deploy/kube/manifests/*.yaml /opt/openshell/manifests/ +COPY deploy/kube/gpu-manifests/*.yaml /opt/openshell/gpu-manifests/ + +ENTRYPOINT ["/usr/local/bin/cluster-entrypoint.sh"] + +HEALTHCHECK --interval=5s --timeout=5s --start-period=20s --retries=60 \ + CMD ["/usr/local/bin/cluster-healthcheck.sh"] diff --git a/tasks/scripts/cluster-deploy-fast.sh b/tasks/scripts/cluster-deploy-fast.sh index 213c2e25..1b2991da 100755 --- a/tasks/scripts/cluster-deploy-fast.sh +++ b/tasks/scripts/cluster-deploy-fast.sh @@ -149,13 +149,13 @@ matches_gateway() { Cargo.toml|Cargo.lock|proto/*|deploy/docker/cross-build.sh) return 0 ;; - crates/openshell-core/*|crates/openshell-providers/*) + crates/openshell-core/*|crates/openshell-policy/*|crates/openshell-providers/*) return 0 ;; crates/openshell-router/*) return 0 ;; - crates/openshell-server/*|deploy/docker/Dockerfile.gateway) + crates/openshell-server/*|deploy/docker/Dockerfile.images) return 0 ;; *) @@ -173,7 +173,7 @@ matches_supervisor() { crates/openshell-core/*|crates/openshell-policy/*|crates/openshell-router/*) return 0 ;; - crates/openshell-sandbox/*) + crates/openshell-sandbox/*|deploy/docker/Dockerfile.images) return 0 ;; *) @@ -206,7 +206,7 @@ compute_fingerprint() { local committed_trees="" case "${component}" in gateway) - committed_trees=$(git ls-tree HEAD Cargo.toml Cargo.lock proto/ deploy/docker/cross-build.sh crates/openshell-core/ crates/openshell-providers/ crates/openshell-router/ crates/openshell-server/ deploy/docker/Dockerfile.gateway 2>/dev/null || true) + committed_trees=$(git ls-tree HEAD Cargo.toml Cargo.lock proto/ deploy/docker/cross-build.sh crates/openshell-core/ crates/openshell-policy/ crates/openshell-providers/ crates/openshell-router/ crates/openshell-server/ deploy/docker/Dockerfile.images 2>/dev/null || true) ;; supervisor) committed_trees=$(git ls-tree HEAD Cargo.toml Cargo.lock proto/ deploy/docker/cross-build.sh crates/openshell-core/ crates/openshell-policy/ crates/openshell-router/ crates/openshell-sandbox/ 2>/dev/null || true) @@ -315,32 +315,21 @@ if [[ "${build_supervisor}" == "1" ]]; then _cluster_image=$(docker inspect --format '{{.Config.Image}}' "${CONTAINER_NAME}" 2>/dev/null) CLUSTER_ARCH=$(docker image inspect --format '{{.Architecture}}' "${_cluster_image}" 2>/dev/null || echo "amd64") - # Build the supervisor binary using docker buildx with a lightweight build. - # We use the same cross-build.sh helpers as the full cluster image but only - # compile openshell-sandbox, then extract the binary via --output. + # Build the supervisor binary from the shared image build graph, then + # extract it via --output so fast deploys reuse the same Rust cache. SUPERVISOR_BUILD_DIR=$(mktemp -d) trap 'rm -rf "${SUPERVISOR_BUILD_DIR}"' EXIT # Compute cargo version from git tags for the supervisor binary. - SUPERVISOR_VERSION_ARGS=() - if [[ -n "${OPENSHELL_CARGO_VERSION:-}" ]]; then - SUPERVISOR_VERSION_ARGS=(--build-arg "OPENSHELL_CARGO_VERSION=${OPENSHELL_CARGO_VERSION}") - else + _cargo_version=${OPENSHELL_CARGO_VERSION:-} + if [[ -z "${_cargo_version}" ]]; then _cargo_version=$(uv run python tasks/scripts/release.py get-version --cargo 2>/dev/null || true) - if [[ -n "${_cargo_version}" ]]; then - SUPERVISOR_VERSION_ARGS=(--build-arg "OPENSHELL_CARGO_VERSION=${_cargo_version}") - fi fi - docker buildx build \ - --file deploy/docker/Dockerfile.cluster \ - --target supervisor-builder \ - --build-arg "BUILDARCH=$(docker version --format '{{.Server.Arch}}')" \ - --build-arg "TARGETARCH=${CLUSTER_ARCH}" \ - ${SUPERVISOR_VERSION_ARGS[@]+"${SUPERVISOR_VERSION_ARGS[@]}"} \ - --output "type=local,dest=${SUPERVISOR_BUILD_DIR}" \ - --platform "linux/${CLUSTER_ARCH}" \ - . + DOCKER_PLATFORM="linux/${CLUSTER_ARCH}" \ + DOCKER_OUTPUT="type=local,dest=${SUPERVISOR_BUILD_DIR}" \ + OPENSHELL_CARGO_VERSION="${_cargo_version}" \ + tasks/scripts/docker-build-image.sh supervisor-builder # Copy the built binary into the running k3s container docker exec "${CONTAINER_NAME}" mkdir -p /opt/openshell/bin diff --git a/tasks/scripts/docker-build-cluster.sh b/tasks/scripts/docker-build-cluster.sh index 80dc2a48..425d8e75 100755 --- a/tasks/scripts/docker-build-cluster.sh +++ b/tasks/scripts/docker-build-cluster.sh @@ -3,89 +3,12 @@ # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: Apache-2.0 -# Build the k3s cluster image with bundled helm charts. -# -# Environment: -# IMAGE_TAG - Image tag (default: dev) -# K3S_VERSION - k3s version override (optional; default in Dockerfile.cluster) - -# DOCKER_PLATFORM - Target platform (optional) -# DOCKER_BUILDER - Buildx builder name (default: auto-select) -# DOCKER_PUSH - When set to "1", push instead of loading into local daemon -# IMAGE_REGISTRY - Registry prefix for image name (e.g. ghcr.io/org/repo) set -euo pipefail -IMAGE_TAG=${IMAGE_TAG:-dev} -IMAGE_NAME="openshell/cluster" -if [[ -n "${IMAGE_REGISTRY:-}" ]]; then - IMAGE_NAME="${IMAGE_REGISTRY}/cluster" -fi -DOCKER_BUILD_CACHE_DIR=${DOCKER_BUILD_CACHE_DIR:-.cache/buildkit} -CACHE_PATH="${DOCKER_BUILD_CACHE_DIR}/cluster" - -mkdir -p "${CACHE_PATH}" - -# Select builder — prefer native "docker" driver for local single-arch builds -# to avoid slow tarball export from the docker-container driver. -BUILDER_ARGS=() -if [[ -n "${DOCKER_BUILDER:-}" ]]; then - BUILDER_ARGS=(--builder "${DOCKER_BUILDER}") -elif [[ -z "${DOCKER_PLATFORM:-}" && -z "${CI:-}" ]]; then - _ctx=$(docker context inspect --format '{{.Name}}' 2>/dev/null || echo default) - BUILDER_ARGS=(--builder "${_ctx}") -fi - -CACHE_ARGS=() -if [[ -z "${CI:-}" ]]; then - # Local development: use filesystem cache with docker-container driver. - if docker buildx inspect ${BUILDER_ARGS[@]+"${BUILDER_ARGS[@]}"} 2>/dev/null | grep -q "Driver: docker-container"; then - CACHE_ARGS=( - --cache-from "type=local,src=${CACHE_PATH}" - --cache-to "type=local,dest=${CACHE_PATH},mode=max" - ) - fi -fi - -# Create build directory for charts mkdir -p deploy/docker/.build/charts -# Package helm chart echo "Packaging helm chart..." helm package deploy/helm/openshell -d deploy/docker/.build/charts/ -# Build cluster image (no bundled component images — they are pulled at runtime -# from the distribution registry; credentials are injected at deploy time) echo "Building cluster image..." - -OUTPUT_FLAG="--load" -if [[ "${DOCKER_PUSH:-}" == "1" ]]; then - OUTPUT_FLAG="--push" -elif [[ "${DOCKER_PLATFORM:-}" == *","* ]]; then - # Multi-platform builds cannot use --load; push is required. - OUTPUT_FLAG="--push" -fi - -# Compute cargo version from git tags (same scheme as docker-build-component.sh). -VERSION_ARGS=() -if [[ -n "${OPENSHELL_CARGO_VERSION:-}" ]]; then - VERSION_ARGS=(--build-arg "OPENSHELL_CARGO_VERSION=${OPENSHELL_CARGO_VERSION}") -else - CARGO_VERSION=$(uv run python tasks/scripts/release.py get-version --cargo 2>/dev/null || true) - if [[ -n "${CARGO_VERSION}" ]]; then - VERSION_ARGS=(--build-arg "OPENSHELL_CARGO_VERSION=${CARGO_VERSION}") - fi -fi - -docker buildx build \ - ${BUILDER_ARGS[@]+"${BUILDER_ARGS[@]}"} \ - ${DOCKER_PLATFORM:+--platform ${DOCKER_PLATFORM}} \ - ${CACHE_ARGS[@]+"${CACHE_ARGS[@]}"} \ - ${VERSION_ARGS[@]+"${VERSION_ARGS[@]}"} \ - -f deploy/docker/Dockerfile.cluster \ - -t ${IMAGE_NAME}:${IMAGE_TAG} \ - ${K3S_VERSION:+--build-arg K3S_VERSION=${K3S_VERSION}} \ - --provenance=false \ - ${OUTPUT_FLAG} \ - . - -echo "Done! Cluster image: ${IMAGE_NAME}:${IMAGE_TAG}" +exec tasks/scripts/docker-build-image.sh cluster "$@" diff --git a/tasks/scripts/docker-build-component.sh b/tasks/scripts/docker-build-component.sh index f20d7295..312e5ac0 100755 --- a/tasks/scripts/docker-build-component.sh +++ b/tasks/scripts/docker-build-component.sh @@ -3,159 +3,35 @@ # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: Apache-2.0 -# Generic Docker image builder for OpenShell components. -# Usage: docker-build-component.sh [extra docker build args...] -# -# docker-build-component.sh gateway -> Dockerfile.gateway -> openshell/gateway:dev -# docker-build-component.sh cluster -> Dockerfile.cluster -> openshell/cluster:dev -# -# Environment: -# IMAGE_TAG - Image tag (default: dev) -# DOCKER_PLATFORM - Target platform (optional, e.g. linux/amd64) -# DOCKER_BUILDER - Buildx builder name (default: auto-select) -# DOCKER_PUSH - When set to "1", push instead of loading into local daemon -# IMAGE_REGISTRY - Registry prefix for image name (e.g. ghcr.io/org/repo) set -euo pipefail -sha256_16() { - if command -v sha256sum >/dev/null 2>&1; then - sha256sum "$1" | awk '{print substr($1, 1, 16)}' - else - shasum -a 256 "$1" | awk '{print substr($1, 1, 16)}' - fi -} - -sha256_16_stdin() { - if command -v sha256sum >/dev/null 2>&1; then - sha256sum | awk '{print substr($1, 1, 16)}' - else - shasum -a 256 | awk '{print substr($1, 1, 16)}' - fi -} - -detect_rust_scope() { - local dockerfile="$1" - local rust_from - rust_from=$(grep -E '^FROM --platform=\$BUILDPLATFORM rust:[^ ]+' "$dockerfile" | head -n1 | sed -E 's/^FROM --platform=\$BUILDPLATFORM rust:([^ ]+).*/\1/' || true) - if [[ -n "${rust_from}" ]]; then - echo "rust-${rust_from}" - return - fi - - if grep -q "rustup.rs" "$dockerfile"; then - echo "rustup-stable" - return - fi - - echo "no-rust" -} - -COMPONENT=${1:?"Usage: docker-build-component.sh [variant] [extra-args...]"} +COMPONENT=${1:?"Usage: docker-build-component.sh [extra-args...]"} shift -# Resolve Dockerfile path and image name. -# If the component has a subdirectory layout, consume the next positional arg -# as a variant name (default: base). -VARIANT="" -COMPONENT_DIR="deploy/docker/${COMPONENT}" -if [[ -d "${COMPONENT_DIR}" ]]; then - # Subdirectory layout — check for a variant argument. - if [[ $# -gt 0 && ! "$1" == --* ]]; then - VARIANT="$1" - shift - fi - VARIANT=${VARIANT:-base} - DOCKERFILE="${COMPONENT_DIR}/Dockerfile.${VARIANT}" - if [[ "${VARIANT}" == "base" ]]; then - IMAGE_NAME="openshell/${COMPONENT}" - else - IMAGE_NAME="openshell/${COMPONENT}-${VARIANT}" - fi -else - # Flat layout: deploy/docker/Dockerfile. - DOCKERFILE="deploy/docker/Dockerfile.${COMPONENT}" - IMAGE_NAME="openshell/${COMPONENT}" -fi - -if [[ ! -f "${DOCKERFILE}" ]]; then - echo "Error: Dockerfile not found: ${DOCKERFILE}" >&2 - exit 1 -fi - -# Prefix with registry when set (e.g. ghcr.io/org/repo/gateway:tag). -# Replaces the default "openshell/" prefix with the registry path. -if [[ -n "${IMAGE_REGISTRY:-}" ]]; then - _suffix="${IMAGE_NAME#openshell/}" - IMAGE_NAME="${IMAGE_REGISTRY}/${_suffix}" -fi - -IMAGE_TAG=${IMAGE_TAG:-dev} -DOCKER_BUILD_CACHE_DIR=${DOCKER_BUILD_CACHE_DIR:-.cache/buildkit} -CACHE_PATH="${DOCKER_BUILD_CACHE_DIR}/${COMPONENT}${VARIANT:+-${VARIANT}}" - -mkdir -p "${CACHE_PATH}" - -# Select the builder. For local (single-arch) builds use a builder with the -# native "docker" driver so images land directly in the Docker image store — -# no slow tarball export via the docker-container driver. -# Multi-platform builds (DOCKER_PLATFORM set) keep the current builder which -# is typically docker-container. -BUILDER_ARGS=() -if [[ -n "${DOCKER_BUILDER:-}" ]]; then - BUILDER_ARGS=(--builder "${DOCKER_BUILDER}") -elif [[ -z "${DOCKER_PLATFORM:-}" && -z "${CI:-}" ]]; then - # Pick the builder matching the active docker context (uses docker driver). - _ctx=$(docker context inspect --format '{{.Name}}' 2>/dev/null || echo default) - BUILDER_ARGS=(--builder "${_ctx}") -fi - -CACHE_ARGS=() -if [[ -z "${CI:-}" ]]; then - # Local development: use filesystem cache with docker-container driver. - if docker buildx inspect ${BUILDER_ARGS[@]+"${BUILDER_ARGS[@]}"} 2>/dev/null | grep -q "Driver: docker-container"; then - CACHE_ARGS=( - --cache-from "type=local,src=${CACHE_PATH}" - --cache-to "type=local,dest=${CACHE_PATH},mode=max" - ) - fi -fi - -OUTPUT_FLAG="--load" -if [[ "${DOCKER_PUSH:-}" == "1" ]]; then - OUTPUT_FLAG="--push" -elif [[ "${DOCKER_PLATFORM:-}" == *","* ]]; then - # Multi-platform builds cannot use --load; push is required. - OUTPUT_FLAG="--push" -fi - -SCCACHE_ARGS=() -if [[ -n "${SCCACHE_MEMCACHED_ENDPOINT:-}" ]]; then - SCCACHE_ARGS=(--build-arg "SCCACHE_MEMCACHED_ENDPOINT=${SCCACHE_MEMCACHED_ENDPOINT}") -fi - -VERSION_ARGS=() -if [[ -n "${OPENSHELL_CARGO_VERSION:-}" ]]; then - VERSION_ARGS=(--build-arg "OPENSHELL_CARGO_VERSION=${OPENSHELL_CARGO_VERSION}") -elif [[ "${COMPONENT}" == "gateway" ]]; then - CARGO_VERSION=$(uv run python tasks/scripts/release.py get-version --cargo) - VERSION_ARGS=(--build-arg "OPENSHELL_CARGO_VERSION=${CARGO_VERSION}") -fi - -LOCK_HASH=$(sha256_16 Cargo.lock) -RUST_SCOPE=${RUST_TOOLCHAIN_SCOPE:-$(detect_rust_scope "${DOCKERFILE}")} -CACHE_SCOPE_INPUT="v1|${COMPONENT}|${VARIANT:-base}|${LOCK_HASH}|${RUST_SCOPE}" -CARGO_TARGET_CACHE_SCOPE=$(printf '%s' "${CACHE_SCOPE_INPUT}" | sha256_16_stdin) - -docker buildx build \ - ${BUILDER_ARGS[@]+"${BUILDER_ARGS[@]}"} \ - ${DOCKER_PLATFORM:+--platform ${DOCKER_PLATFORM}} \ - ${CACHE_ARGS[@]+"${CACHE_ARGS[@]}"} \ - ${SCCACHE_ARGS[@]+"${SCCACHE_ARGS[@]}"} \ - ${VERSION_ARGS[@]+"${VERSION_ARGS[@]}"} \ - --build-arg "CARGO_TARGET_CACHE_SCOPE=${CARGO_TARGET_CACHE_SCOPE}" \ - -f "${DOCKERFILE}" \ - -t "${IMAGE_NAME}:${IMAGE_TAG}" \ - --provenance=false \ - "$@" \ - ${OUTPUT_FLAG} \ - . +case "${COMPONENT}" in + gateway) + exec tasks/scripts/docker-build-image.sh gateway "$@" + ;; + ci) + OUTPUT_ARGS=(--load) + if [[ "${DOCKER_PUSH:-}" == "1" ]]; then + OUTPUT_ARGS=(--push) + elif [[ "${DOCKER_PLATFORM:-}" == *","* ]]; then + OUTPUT_ARGS=(--push) + fi + + exec docker buildx build \ + ${DOCKER_BUILDER:+--builder ${DOCKER_BUILDER}} \ + ${DOCKER_PLATFORM:+--platform ${DOCKER_PLATFORM}} \ + -f deploy/docker/Dockerfile.ci \ + -t "openshell/ci:${IMAGE_TAG:-dev}" \ + --provenance=false \ + "$@" \ + ${OUTPUT_ARGS[@]+"${OUTPUT_ARGS[@]}"} \ + . + ;; + *) + echo "Error: unsupported component '${COMPONENT}'" >&2 + exit 1 + ;; +esac diff --git a/tasks/scripts/docker-build-image.sh b/tasks/scripts/docker-build-image.sh new file mode 100755 index 00000000..80f36786 --- /dev/null +++ b/tasks/scripts/docker-build-image.sh @@ -0,0 +1,160 @@ +#!/usr/bin/env bash + +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +set -euo pipefail + +sha256_16() { + if command -v sha256sum >/dev/null 2>&1; then + sha256sum "$1" | awk '{print substr($1, 1, 16)}' + else + shasum -a 256 "$1" | awk '{print substr($1, 1, 16)}' + fi +} + +sha256_16_stdin() { + if command -v sha256sum >/dev/null 2>&1; then + sha256sum | awk '{print substr($1, 1, 16)}' + else + shasum -a 256 | awk '{print substr($1, 1, 16)}' + fi +} + +detect_rust_scope() { + local dockerfile="$1" + local rust_from + rust_from=$(grep -E '^FROM --platform=\$BUILDPLATFORM rust:[^ ]+' "$dockerfile" | head -n1 | sed -E 's/^FROM --platform=\$BUILDPLATFORM rust:([^ ]+).*/\1/' || true) + if [[ -n "${rust_from}" ]]; then + echo "rust-${rust_from}" + return + fi + + if grep -q "rustup.rs" "$dockerfile"; then + echo "rustup-stable" + return + fi + + echo "no-rust" +} + +TARGET=${1:?"Usage: docker-build-image.sh [extra-args...]"} +shift + +DOCKERFILE="deploy/docker/Dockerfile.images" +if [[ ! -f "${DOCKERFILE}" ]]; then + echo "Error: Dockerfile not found: ${DOCKERFILE}" >&2 + exit 1 +fi + +IS_FINAL_IMAGE=0 +IMAGE_NAME="" +DOCKER_TARGET="" +case "${TARGET}" in + gateway) + IS_FINAL_IMAGE=1 + IMAGE_NAME="openshell/gateway" + DOCKER_TARGET="gateway" + ;; + cluster) + IS_FINAL_IMAGE=1 + IMAGE_NAME="openshell/cluster" + DOCKER_TARGET="cluster" + ;; + supervisor-builder) + DOCKER_TARGET="supervisor-builder" + ;; + *) + echo "Error: unsupported target '${TARGET}'" >&2 + exit 1 + ;; +esac + +if [[ -n "${IMAGE_REGISTRY:-}" && "${IS_FINAL_IMAGE}" == "1" ]]; then + IMAGE_NAME="${IMAGE_REGISTRY}/${IMAGE_NAME#openshell/}" +fi + +IMAGE_TAG=${IMAGE_TAG:-dev} +DOCKER_BUILD_CACHE_DIR=${DOCKER_BUILD_CACHE_DIR:-.cache/buildkit} +CACHE_PATH="${DOCKER_BUILD_CACHE_DIR}/images" +mkdir -p "${CACHE_PATH}" + +BUILDER_ARGS=() +if [[ -n "${DOCKER_BUILDER:-}" ]]; then + BUILDER_ARGS=(--builder "${DOCKER_BUILDER}") +elif [[ -z "${DOCKER_PLATFORM:-}" && -z "${CI:-}" ]]; then + _ctx=$(docker context inspect --format '{{.Name}}' 2>/dev/null || echo default) + BUILDER_ARGS=(--builder "${_ctx}") +fi + +CACHE_ARGS=() +if [[ -z "${CI:-}" ]]; then + if docker buildx inspect ${BUILDER_ARGS[@]+"${BUILDER_ARGS[@]}"} 2>/dev/null | grep -q "Driver: docker-container"; then + CACHE_ARGS=( + --cache-from "type=local,src=${CACHE_PATH}" + --cache-to "type=local,dest=${CACHE_PATH},mode=max" + ) + fi +fi + +SCCACHE_ARGS=() +if [[ -n "${SCCACHE_MEMCACHED_ENDPOINT:-}" ]]; then + SCCACHE_ARGS=(--build-arg "SCCACHE_MEMCACHED_ENDPOINT=${SCCACHE_MEMCACHED_ENDPOINT}") +fi + +VERSION_ARGS=() +if [[ -n "${OPENSHELL_CARGO_VERSION:-}" ]]; then + VERSION_ARGS=(--build-arg "OPENSHELL_CARGO_VERSION=${OPENSHELL_CARGO_VERSION}") +else + CARGO_VERSION=$(uv run python tasks/scripts/release.py get-version --cargo 2>/dev/null || true) + if [[ -n "${CARGO_VERSION}" ]]; then + VERSION_ARGS=(--build-arg "OPENSHELL_CARGO_VERSION=${CARGO_VERSION}") + fi +fi + +LOCK_HASH=$(sha256_16 Cargo.lock) +RUST_SCOPE=${RUST_TOOLCHAIN_SCOPE:-$(detect_rust_scope "${DOCKERFILE}")} +CACHE_SCOPE_INPUT="v2|shared|release|${LOCK_HASH}|${RUST_SCOPE}" +CARGO_TARGET_CACHE_SCOPE=$(printf '%s' "${CACHE_SCOPE_INPUT}" | sha256_16_stdin) + +K3S_ARGS=() +if [[ "${TARGET}" == "cluster" && -n "${K3S_VERSION:-}" ]]; then + K3S_ARGS=(--build-arg "K3S_VERSION=${K3S_VERSION}") +fi + +TAG_ARGS=() +if [[ "${IS_FINAL_IMAGE}" == "1" ]]; then + TAG_ARGS=(-t "${IMAGE_NAME}:${IMAGE_TAG}") +fi + +OUTPUT_ARGS=() +if [[ -n "${DOCKER_OUTPUT:-}" ]]; then + OUTPUT_ARGS=(--output "${DOCKER_OUTPUT}") +elif [[ "${IS_FINAL_IMAGE}" == "1" ]]; then + if [[ "${DOCKER_PUSH:-}" == "1" ]]; then + OUTPUT_ARGS=(--push) + elif [[ "${DOCKER_PLATFORM:-}" == *","* ]]; then + OUTPUT_ARGS=(--push) + else + OUTPUT_ARGS=(--load) + fi +else + echo "Error: DOCKER_OUTPUT must be set when building target '${TARGET}'" >&2 + exit 1 +fi + +docker buildx build \ + ${BUILDER_ARGS[@]+"${BUILDER_ARGS[@]}"} \ + ${DOCKER_PLATFORM:+--platform ${DOCKER_PLATFORM}} \ + ${CACHE_ARGS[@]+"${CACHE_ARGS[@]}"} \ + ${SCCACHE_ARGS[@]+"${SCCACHE_ARGS[@]}"} \ + ${VERSION_ARGS[@]+"${VERSION_ARGS[@]}"} \ + ${K3S_ARGS[@]+"${K3S_ARGS[@]}"} \ + --build-arg "CARGO_TARGET_CACHE_SCOPE=${CARGO_TARGET_CACHE_SCOPE}" \ + -f "${DOCKERFILE}" \ + --target "${DOCKER_TARGET}" \ + ${TAG_ARGS[@]+"${TAG_ARGS[@]}"} \ + --provenance=false \ + "$@" \ + ${OUTPUT_ARGS[@]+"${OUTPUT_ARGS[@]}"} \ + . diff --git a/tasks/scripts/docker-publish-multiarch.sh b/tasks/scripts/docker-publish-multiarch.sh index 7bb6dc84..6847c90f 100755 --- a/tasks/scripts/docker-publish-multiarch.sh +++ b/tasks/scripts/docker-publish-multiarch.sh @@ -3,88 +3,31 @@ # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: Apache-2.0 -# Unified multi-arch build and push for all OpenShell images. -# -# Usage: -# docker-publish-multiarch.sh --mode registry # Push to DOCKER_REGISTRY -# docker-publish-multiarch.sh --mode ecr # Push to ECR -# -# Environment: -# IMAGE_TAG - Image tag (default: dev) -# K3S_VERSION - k3s version override (optional; default in Dockerfile.cluster) - -# DOCKER_PLATFORMS - Target platforms (default: linux/amd64,linux/arm64) -# RUST_BUILD_PROFILE - Rust build profile for sandbox (default: release) -# TAG_LATEST - If true, add/update :latest tag (default: false) -# EXTRA_DOCKER_TAGS - Additional tags to add (comma or space separated) -# -# Registry mode env: -# DOCKER_REGISTRY - Registry URL (required, e.g. ghcr.io/myorg) -# -# ECR mode env: -# AWS_ACCOUNT_ID - AWS account ID (default: 012345678901) -# AWS_REGION - AWS region (default: us-west-2) set -euo pipefail -sha256_16() { - if command -v sha256sum >/dev/null 2>&1; then - sha256sum "$1" | awk '{print substr($1, 1, 16)}' - else - shasum -a 256 "$1" | awk '{print substr($1, 1, 16)}' - fi -} - -sha256_16_stdin() { - if command -v sha256sum >/dev/null 2>&1; then - sha256sum | awk '{print substr($1, 1, 16)}' - else - shasum -a 256 | awk '{print substr($1, 1, 16)}' - fi -} - -detect_rust_scope() { - local dockerfile="$1" - local rust_from - rust_from=$(grep -E '^FROM --platform=\$BUILDPLATFORM rust:[^ ]+' "$dockerfile" | head -n1 | sed -E 's/^FROM --platform=\$BUILDPLATFORM rust:([^ ]+).*/\1/' || true) - if [[ -n "${rust_from}" ]]; then - echo "rust-${rust_from}" - return - fi - - if grep -q "rustup.rs" "$dockerfile"; then - echo "rustup-stable" - return - fi - - echo "no-rust" +usage() { + echo "Usage: docker-publish-multiarch.sh --mode " >&2 + exit 1 } -# --------------------------------------------------------------------------- -# Parse arguments -# --------------------------------------------------------------------------- MODE="" while [[ $# -gt 0 ]]; do - case $1 in - --mode) MODE="$2"; shift 2 ;; - *) echo "Unknown argument: $1" >&2; exit 1 ;; + case "$1" in + --mode) + MODE="$2" + shift 2 + ;; + *) + echo "Unknown argument: $1" >&2 + usage + ;; esac done -if [[ -z "$MODE" ]]; then - echo "Usage: docker-publish-multiarch.sh --mode " >&2 - exit 1 -fi +[[ -n "${MODE}" ]] || usage -# --------------------------------------------------------------------------- -# Common variables -# --------------------------------------------------------------------------- IMAGE_TAG=${IMAGE_TAG:-dev} PLATFORMS=${DOCKER_PLATFORMS:-linux/amd64,linux/arm64} -CARGO_VERSION=${OPENSHELL_CARGO_VERSION:-} -if [[ -z "${CARGO_VERSION}" ]]; then - CARGO_VERSION=$(uv run python tasks/scripts/release.py get-version --cargo) -fi -EXTRA_BUILD_FLAGS="" TAG_LATEST=${TAG_LATEST:-false} EXTRA_DOCKER_TAGS_RAW=${EXTRA_DOCKER_TAGS:-} EXTRA_TAGS=() @@ -92,39 +35,25 @@ EXTRA_TAGS=() if [[ -n "${EXTRA_DOCKER_TAGS_RAW}" ]]; then EXTRA_DOCKER_TAGS_RAW=${EXTRA_DOCKER_TAGS_RAW//,/ } for tag in ${EXTRA_DOCKER_TAGS_RAW}; do - if [[ -n "${tag}" ]]; then - EXTRA_TAGS+=("${tag}") - fi + [[ -n "${tag}" ]] && EXTRA_TAGS+=("${tag}") done fi -# --------------------------------------------------------------------------- -# Mode-specific configuration -# --------------------------------------------------------------------------- -case "$MODE" in +case "${MODE}" in registry) REGISTRY=${DOCKER_REGISTRY:?Set DOCKER_REGISTRY to push multi-arch images (e.g. ghcr.io/myorg)} - IMAGE_PREFIX="openshell-" ;; ecr) AWS_ACCOUNT_ID=${AWS_ACCOUNT_ID:-012345678901} AWS_REGION=${AWS_REGION:-us-west-2} - ECR_HOST="${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com" - REGISTRY="${ECR_HOST}/openshell" - IMAGE_PREFIX="" - EXTRA_BUILD_FLAGS="--provenance=false --sbom=false" + REGISTRY="${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/openshell" ;; *) - echo "Unknown mode: $MODE (expected 'registry' or 'ecr')" >&2 - exit 1 + echo "Unknown mode: ${MODE}" >&2 + usage ;; esac -# --------------------------------------------------------------------------- -# Select or create a multi-platform buildx builder. -# If DOCKER_BUILDER is set (e.g. by CI with remote BuildKit nodes), use it. -# Otherwise fall back to creating a local "multiarch" builder. -# --------------------------------------------------------------------------- BUILDER_NAME=${DOCKER_BUILDER:-multiarch} if docker buildx inspect "${BUILDER_NAME}" >/dev/null 2>&1; then echo "Using existing buildx builder: ${BUILDER_NAME}" @@ -134,109 +63,48 @@ else docker buildx create --name "${BUILDER_NAME}" --use --bootstrap fi -# --------------------------------------------------------------------------- -# Resolve Dockerfile path for a component. -# --------------------------------------------------------------------------- -resolve_dockerfile() { - local comp="$1" - echo "deploy/docker/Dockerfile.${comp}" -} +export DOCKER_BUILDER="${BUILDER_NAME}" +export DOCKER_PLATFORM="${PLATFORMS}" +export DOCKER_PUSH=1 +export IMAGE_REGISTRY="${REGISTRY}" -# --------------------------------------------------------------------------- -# Step 1: Build and push the gateway image as a multi-arch manifest. -# Uses cross-compilation in the Dockerfile (BUILDPLATFORM != TARGETPLATFORM) -# so Rust compiles natively and only the final stage runs on the target arch. -# Sandbox images are maintained in the community repo and not built here. -# --------------------------------------------------------------------------- echo "Building multi-arch gateway image..." -LOCK_HASH=$(sha256_16 Cargo.lock) -GATEWAY_DOCKERFILE=$(resolve_dockerfile "gateway") -BUILD_ARGS="--build-arg OPENSHELL_CARGO_VERSION=${CARGO_VERSION}" -if [ -n "${SCCACHE_MEMCACHED_ENDPOINT:-}" ]; then - BUILD_ARGS="${BUILD_ARGS} --build-arg SCCACHE_MEMCACHED_ENDPOINT=${SCCACHE_MEMCACHED_ENDPOINT}" -fi -RUST_SCOPE=${RUST_TOOLCHAIN_SCOPE:-$(detect_rust_scope "${GATEWAY_DOCKERFILE}")} -CACHE_SCOPE_INPUT="v1|gateway|base|${LOCK_HASH}|${RUST_SCOPE}" -CARGO_TARGET_CACHE_SCOPE=$(printf '%s' "${CACHE_SCOPE_INPUT}" | sha256_16_stdin) -BUILD_ARGS="${BUILD_ARGS} --build-arg CARGO_TARGET_CACHE_SCOPE=${CARGO_TARGET_CACHE_SCOPE}" -FULL_IMAGE="${REGISTRY}/${IMAGE_PREFIX}gateway" -docker buildx build \ - --platform "${PLATFORMS}" \ - -f "${GATEWAY_DOCKERFILE}" \ - -t "${FULL_IMAGE}:${IMAGE_TAG}" \ - ${EXTRA_BUILD_FLAGS} \ - ${BUILD_ARGS} \ - --push \ - . - -# --------------------------------------------------------------------------- -# Step 2: Package helm charts (architecture-independent) -# --------------------------------------------------------------------------- +tasks/scripts/docker-build-image.sh gateway + mkdir -p deploy/docker/.build/charts echo "Packaging helm chart..." helm package deploy/helm/openshell -d deploy/docker/.build/charts/ -# --------------------------------------------------------------------------- -# Step 3: Build and push multi-arch cluster image. -# The cluster image includes the supervisor binary (built from Rust source) -# and k3s. Gateway images are pulled at runtime from the registry; sandbox -# images are pulled from the community registry. -# --------------------------------------------------------------------------- -echo "" +echo echo "Building multi-arch cluster image..." -CLUSTER_DOCKERFILE="deploy/docker/Dockerfile.cluster" -CLUSTER_RUST_SCOPE=${RUST_TOOLCHAIN_SCOPE:-$(detect_rust_scope "${CLUSTER_DOCKERFILE}")} -CLUSTER_CACHE_SCOPE_INPUT="v1|cluster|base|${LOCK_HASH}|${CLUSTER_RUST_SCOPE}" -CLUSTER_CARGO_SCOPE=$(printf '%s' "${CLUSTER_CACHE_SCOPE_INPUT}" | sha256_16_stdin) -CLUSTER_BUILD_ARGS="" -if [ -n "${SCCACHE_MEMCACHED_ENDPOINT:-}" ]; then - CLUSTER_BUILD_ARGS="--build-arg SCCACHE_MEMCACHED_ENDPOINT=${SCCACHE_MEMCACHED_ENDPOINT}" -fi -CLUSTER_IMAGE="${REGISTRY}/${IMAGE_PREFIX:+${IMAGE_PREFIX}}cluster" -docker buildx build \ - --platform "${PLATFORMS}" \ - -f "${CLUSTER_DOCKERFILE}" \ - -t "${CLUSTER_IMAGE}:${IMAGE_TAG}" \ - ${K3S_VERSION:+--build-arg K3S_VERSION=${K3S_VERSION}} \ - --build-arg "CARGO_TARGET_CACHE_SCOPE=${CLUSTER_CARGO_SCOPE}" \ - ${CLUSTER_BUILD_ARGS} \ - ${EXTRA_BUILD_FLAGS} \ - --push \ - . - -# --------------------------------------------------------------------------- -# Step 4: Apply additional tags by copying manifests. -# Use --prefer-index=false to carbon-copy the source manifest format instead of -# wrapping it in an OCI image index (which the registry v3 proxy can't serve). -# --------------------------------------------------------------------------- +tasks/scripts/docker-build-image.sh cluster + TAGS_TO_APPLY=("${EXTRA_TAGS[@]}") -if [ "$TAG_LATEST" = true ]; then +if [[ "${TAG_LATEST}" == "true" ]]; then TAGS_TO_APPLY+=("latest") fi -if [ ${#TAGS_TO_APPLY[@]} -gt 0 ]; then +if [[ ${#TAGS_TO_APPLY[@]} -gt 0 ]]; then for component in gateway cluster; do - FULL_IMAGE="${REGISTRY}/${IMAGE_PREFIX:+${IMAGE_PREFIX}}${component}" + full_image="${REGISTRY}/${component}" for tag in "${TAGS_TO_APPLY[@]}"; do - if [ "${tag}" = "${IMAGE_TAG}" ]; then - continue - fi - echo "Tagging ${FULL_IMAGE}:${tag}..." + [[ "${tag}" == "${IMAGE_TAG}" ]] && continue + echo "Tagging ${full_image}:${tag}..." docker buildx imagetools create \ --prefer-index=false \ - -t "${FULL_IMAGE}:${tag}" \ - "${FULL_IMAGE}:${IMAGE_TAG}" + -t "${full_image}:${tag}" \ + "${full_image}:${IMAGE_TAG}" done done fi -echo "" +echo echo "Done! Multi-arch images pushed to ${REGISTRY}:" -echo " ${REGISTRY}/${IMAGE_PREFIX}gateway:${IMAGE_TAG}" -echo " ${REGISTRY}/${IMAGE_PREFIX:+${IMAGE_PREFIX}}cluster:${IMAGE_TAG}" -if [ "$TAG_LATEST" = true ]; then +echo " ${REGISTRY}/gateway:${IMAGE_TAG}" +echo " ${REGISTRY}/cluster:${IMAGE_TAG}" +if [[ "${TAG_LATEST}" == "true" ]]; then echo " (all also tagged :latest)" fi -if [ ${#EXTRA_TAGS[@]} -gt 0 ]; then +if [[ ${#EXTRA_TAGS[@]} -gt 0 ]]; then echo " (all also tagged: ${EXTRA_TAGS[*]})" fi From 2446de900e4184d29bdb4a48c539c91db24ce9a8 Mon Sep 17 00:00:00 2001 From: Drew Newberry Date: Mon, 16 Mar 2026 19:05:14 -0700 Subject: [PATCH 02/15] update mise commands --- CONTRIBUTING.md | 3 +++ tasks/ci.toml | 14 ++++++++++++++ tasks/cluster.toml | 2 +- tasks/docker.toml | 39 ++++++++++++++++++++++++++++++++------- tasks/publish.toml | 8 ++++---- tasks/python.toml | 41 +++++++++++++++++++++++++++++++++-------- 6 files changed, 87 insertions(+), 20 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index f9e1fc1c..249a6b47 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -168,6 +168,9 @@ These are the primary `mise` tasks for day-to-day development: | Task | Purpose | | ------------------ | ------------------------------------------------------- | +| `mise run build` | Build the whole project (Rust crates, Docker images, Python wheel) | +| `mise run build:docker:gateway` | Build one Docker image using the `build:[tool]:[component]` layout | +| `mise run build:python:wheel` | Build one Python wheel using the `build:[tool]:[component]` layout | | `mise run cluster` | Bootstrap or incremental deploy | | `mise run sandbox` | Create a sandbox on the running cluster | | `mise run test` | Default test suite | diff --git a/tasks/ci.toml b/tasks/ci.toml index 515b5abd..64112ff4 100644 --- a/tasks/ci.toml +++ b/tasks/ci.toml @@ -4,11 +4,25 @@ # CI, build, and quality tasks [build] +description = "Build the whole project" +depends = ["build:rust:workspace", "build:docker", "build:python:wheel"] + +["build:rust"] +description = "Alias for build:rust:workspace" +depends = ["build:rust:workspace"] +hide = true + +["build:rust:workspace"] description = "Build all Rust crates" run = "cargo build --workspace" hide = true ["build:release"] +description = "Alias for build:rust:workspace:release" +depends = ["build:rust:workspace:release"] +hide = true + +["build:rust:workspace:release"] description = "Build all Rust crates in release mode" run = "cargo build --workspace --release" hide = true diff --git a/tasks/cluster.toml b/tasks/cluster.toml index debda04c..da06178f 100644 --- a/tasks/cluster.toml +++ b/tasks/cluster.toml @@ -10,7 +10,7 @@ run = "tasks/scripts/cluster.sh" ["cluster:build:full"] description = "Build and deploy local k3s cluster with OpenShell" depends = [ - "docker:build:gateway", + "build:docker:gateway", ] run = "tasks/scripts/cluster-bootstrap.sh build" hide = true diff --git a/tasks/docker.toml b/tasks/docker.toml index 8194a04a..4ca52e56 100644 --- a/tasks/docker.toml +++ b/tasks/docker.toml @@ -3,34 +3,59 @@ # Docker image build tasks -["docker:build"] +["build:docker"] description = "Build all Docker images" depends = [ - "docker:build:gateway", - "docker:build:cluster", + "build:docker:gateway", + "build:docker:cluster", ] hide = true -["docker:build:ci"] +["docker:build"] +description = "Alias for build:docker" +depends = ["build:docker"] +hide = true + +["build:docker:ci"] description = "Build the CI Docker image" run = "tasks/scripts/docker-build-component.sh ci" hide = true -["docker:build:gateway"] +["docker:build:ci"] +description = "Alias for build:docker:ci" +depends = ["build:docker:ci"] +hide = true + +["build:docker:gateway"] description = "Build the gateway Docker image" run = "tasks/scripts/docker-build-component.sh gateway" hide = true -["docker:build:cluster"] +["docker:build:gateway"] +description = "Alias for build:docker:gateway" +depends = ["build:docker:gateway"] +hide = true + +["build:docker:cluster"] description = "Build the k3s cluster image (component images pulled at runtime from registry)" run = "tasks/scripts/docker-build-cluster.sh" hide = true -["docker:build:cluster:multiarch"] +["docker:build:cluster"] +description = "Alias for build:docker:cluster" +depends = ["build:docker:cluster"] +hide = true + +["build:docker:cluster:multiarch"] description = "Build multi-arch cluster image and push to a registry" run = "tasks/scripts/docker-publish-multiarch.sh --mode registry" hide = true +["docker:build:cluster:multiarch"] +description = "Alias for build:docker:cluster:multiarch" +depends = ["build:docker:cluster:multiarch"] +hide = true + ["docker:publish:cluster:multiarch"] description = "Build and publish multi-arch cluster image to ECR" run = "tasks/scripts/docker-publish-multiarch.sh --mode ecr" diff --git a/tasks/publish.toml b/tasks/publish.toml index cd9994a4..c7ade875 100644 --- a/tasks/publish.toml +++ b/tasks/publish.toml @@ -12,8 +12,8 @@ VERSION_DOCKER=$(uv run python tasks/scripts/release.py get-version --docker) VERSION_PYTHON=$(uv run python tasks/scripts/release.py get-version --python) CARGO_VERSION=$(uv run python tasks/scripts/release.py get-version --cargo) IMAGE_TAG=dev TAG_LATEST=true EXTRA_DOCKER_TAGS="$VERSION_DOCKER" mise run docker:publish:cluster:multiarch -OPENSHELL_CARGO_VERSION="$CARGO_VERSION" mise run python:build:multiarch -OPENSHELL_CARGO_VERSION="$CARGO_VERSION" mise run python:build:macos +OPENSHELL_CARGO_VERSION="$CARGO_VERSION" mise run build:python:wheel:multiarch +OPENSHELL_CARGO_VERSION="$CARGO_VERSION" mise run build:python:wheel:macos uv run python tasks/scripts/release.py python-publish --version "$VERSION_PYTHON" """ hide = true @@ -27,8 +27,8 @@ VERSION_DOCKER=$(uv run python tasks/scripts/release.py get-version --docker) VERSION_PYTHON=$(uv run python tasks/scripts/release.py get-version --python) CARGO_VERSION=$(uv run python tasks/scripts/release.py get-version --cargo) IMAGE_TAG="$VERSION_DOCKER" TAG_LATEST=false mise run docker:publish:cluster:multiarch -OPENSHELL_CARGO_VERSION="$CARGO_VERSION" mise run python:build:multiarch -OPENSHELL_CARGO_VERSION="$CARGO_VERSION" mise run python:build:macos +OPENSHELL_CARGO_VERSION="$CARGO_VERSION" mise run build:python:wheel:multiarch +OPENSHELL_CARGO_VERSION="$CARGO_VERSION" mise run build:python:wheel:macos uv run python tasks/scripts/release.py python-publish --version "$VERSION_PYTHON" """ hide = true diff --git a/tasks/python.toml b/tasks/python.toml index cf861e9f..51bac54b 100644 --- a/tasks/python.toml +++ b/tasks/python.toml @@ -9,13 +9,18 @@ depends = ["python:proto"] run = "uv sync --group dev && uv pip install ." hide = true -["python:build"] +["build:python:wheel"] description = "Build Python wheel with CLI binary (native)" depends = ["python:proto"] run = "uv run maturin build --release" hide = true -["python:build:multiarch"] +["python:build"] +description = "Alias for build:python:wheel" +depends = ["build:python:wheel"] +hide = true + +["build:python:wheel:multiarch"] description = "Build Python wheels for Linux amd64/arm64 with buildx" depends = ["python:proto"] run = """ @@ -102,7 +107,12 @@ ls -la target/wheels/*.whl """ hide = true -["python:build:macos"] +["python:build:multiarch"] +description = "Alias for build:python:wheel:multiarch" +depends = ["build:python:wheel:multiarch"] +hide = true + +["build:python:wheel:macos"] description = "Build Python wheel for macOS arm64" depends = ["python:proto"] run = """ @@ -124,14 +134,19 @@ if [ "$(uname -s)" = "Darwin" ] && [ -z "${OPENSHELL_CARGO_VERSION:-}" ]; then uv run maturin build --release --target aarch64-apple-darwin --out target/wheels else echo "Building macOS wheel via Docker cross-toolchain" - mise run python:build:macos:docker + mise run build:python:wheel:macos:docker fi ls -la target/wheels/*.whl """ hide = true -["python:build:macos:docker"] +["python:build:macos"] +description = "Alias for build:python:wheel:macos" +depends = ["build:python:wheel:macos"] +hide = true + +["build:python:wheel:macos:docker"] description = "Build Python wheel for macOS arm64 from Docker" depends = ["python:proto"] run = """ @@ -181,9 +196,19 @@ ls -la target/wheels/*macosx*arm64.whl """ hide = true -["python:build:all"] +["python:build:macos:docker"] +description = "Alias for build:python:wheel:macos:docker" +depends = ["build:python:wheel:macos:docker"] +hide = true + +["build:python:wheel:all"] description = "Build Python wheels for Linux and macOS" -depends = ["python:build", "python:build:multiarch", "python:build:macos"] +depends = ["build:python:wheel", "build:python:wheel:multiarch", "build:python:wheel:macos"] +hide = true + +["python:build:all"] +description = "Alias for build:python:wheel:all" +depends = ["build:python:wheel:all"] hide = true ["python:lint"] @@ -216,7 +241,7 @@ run = """ set -euo pipefail VERSION=$(uv run python tasks/scripts/release.py get-version --python) CARGO_VERSION=$(uv run python tasks/scripts/release.py get-version --cargo) -OPENSHELL_CARGO_VERSION="$CARGO_VERSION" mise run python:build:all +OPENSHELL_CARGO_VERSION="$CARGO_VERSION" mise run build:python:wheel:all uv run python tasks/scripts/release.py python-publish --version "$VERSION" """ hide = true From b1ef4c9b68fa5eec06c3befca1f45b2e7af44fff Mon Sep 17 00:00:00 2001 From: Drew Newberry Date: Tue, 17 Mar 2026 17:45:58 -0700 Subject: [PATCH 03/15] perf(build): reduce local cache invalidation Signed-off-by: Drew Newberry --- deploy/docker/Dockerfile.cli-macos | 3 +-- deploy/docker/Dockerfile.images | 2 +- deploy/docker/Dockerfile.python-wheels | 3 +-- deploy/docker/Dockerfile.python-wheels-macos | 3 +-- tasks/python.toml | 11 +++++++---- tasks/scripts/docker-build-image.sh | 2 +- 6 files changed, 12 insertions(+), 12 deletions(-) diff --git a/deploy/docker/Dockerfile.cli-macos b/deploy/docker/Dockerfile.cli-macos index c495d533..2e93acea 100644 --- a/deploy/docker/Dockerfile.cli-macos +++ b/deploy/docker/Dockerfile.cli-macos @@ -18,6 +18,7 @@ FROM ${OSXCROSS_IMAGE} AS osxcross FROM python:3.12-slim AS builder +ARG OPENSHELL_IMAGE_TAG ARG CARGO_TARGET_CACHE_SCOPE=default ENV PATH="/root/.cargo/bin:/usr/local/bin:/osxcross/bin:${PATH}" @@ -103,8 +104,6 @@ RUN touch crates/openshell-cli/src/main.rs \ crates/openshell-core/build.rs \ proto/*.proto -# Declare version ARG here (not earlier) so the git-hash-bearing value does not -# invalidate the expensive dependency-build layers above on every commit. ARG OPENSHELL_CARGO_VERSION RUN --mount=type=cache,id=cargo-registry-cli-macos,sharing=locked,target=/root/.cargo/registry \ --mount=type=cache,id=cargo-git-cli-macos,sharing=locked,target=/root/.cargo/git \ diff --git a/deploy/docker/Dockerfile.images b/deploy/docker/Dockerfile.images index 84edf449..b16436c1 100644 --- a/deploy/docker/Dockerfile.images +++ b/deploy/docker/Dockerfile.images @@ -22,7 +22,6 @@ ARG NVIDIA_CONTAINER_TOOLKIT_VERSION=1.18.2-1 FROM --platform=$BUILDPLATFORM rust:1.88-slim AS rust-builder-base ARG TARGETARCH ARG BUILDARCH -ARG OPENSHELL_CARGO_VERSION ARG CARGO_TARGET_CACHE_SCOPE=default ARG SCCACHE_MEMCACHED_ENDPOINT @@ -80,6 +79,7 @@ RUN --mount=type=cache,id=cargo-registry-${TARGETARCH},sharing=locked,target=/us . cross-build.sh && cargo_cross_build --release -p openshell-server -p openshell-sandbox FROM rust-deps AS rust-workspace +ARG OPENSHELL_CARGO_VERSION COPY crates/ crates/ diff --git a/deploy/docker/Dockerfile.python-wheels b/deploy/docker/Dockerfile.python-wheels index be0c62af..a8e299c3 100644 --- a/deploy/docker/Dockerfile.python-wheels +++ b/deploy/docker/Dockerfile.python-wheels @@ -28,6 +28,7 @@ FROM base AS builder ARG TARGETARCH ARG BUILDARCH +ARG OPENSHELL_IMAGE_TAG ARG CARGO_TARGET_CACHE_SCOPE=default ARG SCCACHE_MEMCACHED_ENDPOINT @@ -84,8 +85,6 @@ RUN touch crates/openshell-cli/src/main.rs \ crates/openshell-core/build.rs \ proto/*.proto -# Declare version ARG here (not earlier) so the git-hash-bearing value does not -# invalidate the expensive dependency-build layers above on every commit. ARG OPENSHELL_CARGO_VERSION RUN --mount=type=cache,id=cargo-registry-python-wheels-${TARGETARCH},sharing=locked,target=/root/.cargo/registry \ --mount=type=cache,id=cargo-git-python-wheels-${TARGETARCH},sharing=locked,target=/root/.cargo/git \ diff --git a/deploy/docker/Dockerfile.python-wheels-macos b/deploy/docker/Dockerfile.python-wheels-macos index d8885f97..ec16f411 100644 --- a/deploy/docker/Dockerfile.python-wheels-macos +++ b/deploy/docker/Dockerfile.python-wheels-macos @@ -12,6 +12,7 @@ FROM ${OSXCROSS_IMAGE} AS osxcross FROM python:${PYTHON_VERSION}-slim AS builder ARG TARGETARCH +ARG OPENSHELL_IMAGE_TAG ARG CARGO_TARGET_CACHE_SCOPE=default ENV PATH="/root/.cargo/bin:/usr/local/bin:/osxcross/bin:${PATH}" @@ -91,8 +92,6 @@ RUN touch crates/openshell-cli/src/main.rs \ crates/openshell-core/build.rs \ proto/*.proto -# Declare version ARG here (not earlier) so the git-hash-bearing value does not -# invalidate the expensive dependency-build layers above on every commit. ARG OPENSHELL_CARGO_VERSION RUN --mount=type=cache,id=cargo-registry-python-wheels-macos-${TARGETARCH},sharing=locked,target=/root/.cargo/registry \ --mount=type=cache,id=cargo-git-python-wheels-macos-${TARGETARCH},sharing=locked,target=/root/.cargo/git \ diff --git a/tasks/python.toml b/tasks/python.toml index 51bac54b..211e474c 100644 --- a/tasks/python.toml +++ b/tasks/python.toml @@ -45,7 +45,7 @@ sha256_16_stdin() { PLATFORMS=${DOCKER_PLATFORMS:-linux/amd64,linux/arm64} CARGO_VERSION=${OPENSHELL_CARGO_VERSION:-} -if [ -z "$CARGO_VERSION" ]; then +if [ -z "$CARGO_VERSION" ] && [ -n "${CI:-}" ]; then CARGO_VERSION=$(uv run python tasks/scripts/release.py get-version --cargo) fi @@ -55,11 +55,14 @@ CACHE_SCOPE_INPUT="v1|python-wheels|base|${LOCK_HASH}|${RUST_SCOPE}" CARGO_TARGET_CACHE_SCOPE=$(printf '%s' "$CACHE_SCOPE_INPUT" | sha256_16_stdin) VERSION_ARGS=( - --build-arg "OPENSHELL_CARGO_VERSION=${CARGO_VERSION}" --build-arg "CARGO_TARGET_CACHE_SCOPE=${CARGO_TARGET_CACHE_SCOPE}" ${OPENSHELL_IMAGE_TAG:+--build-arg "OPENSHELL_IMAGE_TAG=${OPENSHELL_IMAGE_TAG}"} ) +if [ -n "$CARGO_VERSION" ]; then + VERSION_ARGS+=(--build-arg "OPENSHELL_CARGO_VERSION=${CARGO_VERSION}") +fi + if ! docker buildx inspect multiarch >/dev/null 2>&1; then echo "Creating multi-platform buildx builder..." docker buildx create --name multiarch --use --bootstrap @@ -171,7 +174,7 @@ sha256_16_stdin() { CARGO_VERSION=${OPENSHELL_CARGO_VERSION:-} OSXCROSS_IMAGE_REF=${OSXCROSS_IMAGE:-crazymax/osxcross:latest} -if [ -z "$CARGO_VERSION" ]; then +if [ -z "$CARGO_VERSION" ] && [ -n "${CI:-}" ]; then CARGO_VERSION=$(uv run python tasks/scripts/release.py get-version --cargo) fi @@ -186,8 +189,8 @@ docker build \ -f deploy/docker/Dockerfile.python-wheels-macos \ --target wheels \ --build-arg "OSXCROSS_IMAGE=${OSXCROSS_IMAGE_REF}" \ - --build-arg "OPENSHELL_CARGO_VERSION=${CARGO_VERSION}" \ --build-arg "CARGO_TARGET_CACHE_SCOPE=${CARGO_TARGET_CACHE_SCOPE}" \ + ${CARGO_VERSION:+--build-arg "OPENSHELL_CARGO_VERSION=${CARGO_VERSION}"} \ ${OPENSHELL_IMAGE_TAG:+--build-arg "OPENSHELL_IMAGE_TAG=${OPENSHELL_IMAGE_TAG}"} \ --output type=local,dest=target/wheels \ . diff --git a/tasks/scripts/docker-build-image.sh b/tasks/scripts/docker-build-image.sh index 80f36786..8f29f57a 100755 --- a/tasks/scripts/docker-build-image.sh +++ b/tasks/scripts/docker-build-image.sh @@ -105,7 +105,7 @@ fi VERSION_ARGS=() if [[ -n "${OPENSHELL_CARGO_VERSION:-}" ]]; then VERSION_ARGS=(--build-arg "OPENSHELL_CARGO_VERSION=${OPENSHELL_CARGO_VERSION}") -else +elif [[ -n "${CI:-}" ]]; then CARGO_VERSION=$(uv run python tasks/scripts/release.py get-version --cargo 2>/dev/null || true) if [[ -n "${CARGO_VERSION}" ]]; then VERSION_ARGS=(--build-arg "OPENSHELL_CARGO_VERSION=${CARGO_VERSION}") From d2ff7285f76f5bd0ab51f2bf9e01ab165810b875 Mon Sep 17 00:00:00 2001 From: Drew Newberry Date: Tue, 17 Mar 2026 21:55:24 -0700 Subject: [PATCH 04/15] wip --- .../workflows/cluster-deploy-fast-nightly.yml | 98 +++ CONTRIBUTING.md | 45 +- architecture/build-containers.md | 14 + tasks/cluster.toml | 4 + .../scripts/cluster-deploy-fast-test-junit.py | 110 +++ tasks/scripts/cluster-deploy-fast-test.sh | 625 ++++++++++++++++++ 6 files changed, 872 insertions(+), 24 deletions(-) create mode 100644 .github/workflows/cluster-deploy-fast-nightly.yml create mode 100644 tasks/scripts/cluster-deploy-fast-test-junit.py create mode 100755 tasks/scripts/cluster-deploy-fast-test.sh diff --git a/.github/workflows/cluster-deploy-fast-nightly.yml b/.github/workflows/cluster-deploy-fast-nightly.yml new file mode 100644 index 00000000..67c323f0 --- /dev/null +++ b/.github/workflows/cluster-deploy-fast-nightly.yml @@ -0,0 +1,98 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +name: Cluster Deploy Fast Nightly + +on: + workflow_dispatch: {} + schedule: + - cron: "0 8 * * *" # 12 AM PT (08:00 UTC during PDT) + +permissions: + contents: read + packages: read + actions: read + checks: write + +concurrency: + group: cluster-deploy-fast-nightly + cancel-in-progress: false + +jobs: + fast-deploy-cache: + name: Fast Deploy Cache + runs-on: build-amd64 + timeout-minutes: 90 + container: + image: ghcr.io/nvidia/openshell/ci:latest + credentials: + username: ${{ github.actor }} + password: ${{ secrets.GITHUB_TOKEN }} + options: --privileged + volumes: + - /var/run/docker.sock:/var/run/docker.sock + env: + MISE_GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} + FAST_DEPLOY_TEST_REPORT_DIR: ${{ github.workspace }}/.cache/cluster-deploy-fast-test/nightly-${{ github.run_id }} + steps: + - uses: actions/checkout@v4 + + - name: Log in to GitHub Container Registry + uses: docker/login-action@v3 + with: + registry: ghcr.io + username: ${{ github.actor }} + password: ${{ secrets.GITHUB_TOKEN }} + + - name: Install Python dependencies and generate protobuf stubs + run: uv sync --frozen && mise run --no-prepare python:proto + + - name: Bootstrap cluster + run: mise run --no-prepare --skip-deps cluster + + - name: Run fast deploy cache harness + id: harness + if: ${{ !cancelled() }} + run: | + set -euo pipefail + set +e + tasks/scripts/cluster-deploy-fast-test.sh + status=$? + set -e + junit_path="${FAST_DEPLOY_TEST_REPORT_DIR}/junit.xml" + if [ -f "${FAST_DEPLOY_TEST_REPORT_DIR}/summary.tsv" ]; then + uv run python tasks/scripts/cluster-deploy-fast-test-junit.py \ + --input "${FAST_DEPLOY_TEST_REPORT_DIR}/summary.tsv" \ + --output "${junit_path}" \ + --suite-name "cluster-deploy-fast-nightly" + echo "junit_path=${junit_path}" >> "$GITHUB_OUTPUT" + fi + if [ -f "${FAST_DEPLOY_TEST_REPORT_DIR}/summary.md" ]; then + echo "summary_path=${FAST_DEPLOY_TEST_REPORT_DIR}/summary.md" >> "$GITHUB_OUTPUT" + fi + exit "${status}" + + - name: Publish Markdown summary + if: ${{ always() && !cancelled() && steps.harness.outputs.summary_path != '' }} + run: cat "${{ steps.harness.outputs.summary_path }}" >> "$GITHUB_STEP_SUMMARY" + + - name: Upload fast deploy nightly artifacts + if: ${{ always() && !cancelled() }} + uses: actions/upload-artifact@v4 + with: + name: cluster-deploy-fast-nightly-${{ github.run_id }} + path: ${{ env.FAST_DEPLOY_TEST_REPORT_DIR }} + if-no-files-found: error + retention-days: 14 + + - name: Publish test report + if: ${{ always() && !cancelled() && steps.harness.outputs.junit_path != '' }} + uses: dorny/test-reporter@v2 + with: + name: Cluster Deploy Fast Nightly + path: ${{ steps.harness.outputs.junit_path }} + reporter: java-junit + fail-on-error: true + use-actions-summary: true + report-title: Cluster Deploy Fast Nightly + badge-title: fast-deploy diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 249a6b47..c25d30b9 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -63,24 +63,24 @@ This project ships with [agent skills](#agent-skills-for-contributors) that can Skills live in `.agents/skills/`. Your agent's harness can discover and load them natively. Here is the full inventory: -| Category | Skill | Purpose | -|----------|-------|---------| -| Getting Started | `openshell-cli` | CLI usage, sandbox lifecycle, provider management, BYOC workflows | -| Getting Started | `debug-openshell-cluster` | Diagnose cluster startup failures and health issues | -| Getting Started | `debug-inference` | Diagnose `inference.local`, host-backed local inference, and direct external inference setup issues | -| Contributing | `create-spike` | Investigate a problem, produce a structured GitHub issue | -| Contributing | `build-from-issue` | Plan and implement work from a GitHub issue (maintainer workflow) | -| Contributing | `create-github-issue` | Create well-structured GitHub issues | -| Contributing | `create-github-pr` | Create pull requests with proper conventions | -| Reviewing | `review-github-pr` | Summarize PR diffs and key design decisions | -| Reviewing | `review-security-issue` | Assess security issues for severity and remediation | -| Reviewing | `watch-github-actions` | Monitor CI pipeline status and logs | -| Triage | `triage-issue` | Assess, classify, and route community-filed issues | -| Platform | `generate-sandbox-policy` | Generate YAML sandbox policies from requirements or API docs | -| Platform | `tui-development` | Development guide for the ratatui-based terminal UI | -| Documentation | `update-docs` | Scan recent commits and draft doc updates for user-facing changes | -| Maintenance | `sync-agent-infra` | Detect and fix drift across agent-first infrastructure files | -| Reference | `sbom` | Generate SBOMs and resolve dependency licenses | +| Category | Skill | Purpose | +| --------------- | ------------------------- | --------------------------------------------------------------------------------------------------- | +| Getting Started | `openshell-cli` | CLI usage, sandbox lifecycle, provider management, BYOC workflows | +| Getting Started | `debug-openshell-cluster` | Diagnose cluster startup failures and health issues | +| Getting Started | `debug-inference` | Diagnose `inference.local`, host-backed local inference, and direct external inference setup issues | +| Contributing | `create-spike` | Investigate a problem, produce a structured GitHub issue | +| Contributing | `build-from-issue` | Plan and implement work from a GitHub issue (maintainer workflow) | +| Contributing | `create-github-issue` | Create well-structured GitHub issues | +| Contributing | `create-github-pr` | Create pull requests with proper conventions | +| Reviewing | `review-github-pr` | Summarize PR diffs and key design decisions | +| Reviewing | `review-security-issue` | Assess security issues for severity and remediation | +| Reviewing | `watch-github-actions` | Monitor CI pipeline status and logs | +| Triage | `triage-issue` | Assess, classify, and route community-filed issues | +| Platform | `generate-sandbox-policy` | Generate YAML sandbox policies from requirements or API docs | +| Platform | `tui-development` | Development guide for the ratatui-based terminal UI | +| Documentation | `update-docs` | Scan recent commits and draft doc updates for user-facing changes | +| Maintenance | `sync-agent-infra` | Detect and fix drift across agent-first infrastructure files | +| Reference | `sbom` | Generate SBOMs and resolve dependency licenses | ### Workflow Chains @@ -148,10 +148,10 @@ openshell sandbox create -- codex Two additional scripts in `scripts/bin/` provide gateway-aware wrappers for cluster debugging: -| Script | What it does | -|--------|-------------| +| Script | What it does | +| --------- | ------------------------------------------------------------------------------------ | | `kubectl` | Runs `kubectl` inside the active gateway's k3s container via `openshell doctor exec` | -| `k9s` | Runs `k9s` inside the active gateway's k3s container via `openshell doctor exec` | +| `k9s` | Runs `k9s` inside the active gateway's k3s container via `openshell doctor exec` | These work for both local and remote gateways (SSH is handled automatically). Examples: @@ -168,9 +168,6 @@ These are the primary `mise` tasks for day-to-day development: | Task | Purpose | | ------------------ | ------------------------------------------------------- | -| `mise run build` | Build the whole project (Rust crates, Docker images, Python wheel) | -| `mise run build:docker:gateway` | Build one Docker image using the `build:[tool]:[component]` layout | -| `mise run build:python:wheel` | Build one Python wheel using the `build:[tool]:[component]` layout | | `mise run cluster` | Bootstrap or incremental deploy | | `mise run sandbox` | Create a sandbox on the running cluster | | `mise run test` | Default test suite | diff --git a/architecture/build-containers.md b/architecture/build-containers.md index 2e0d664d..28f2e9ad 100644 --- a/architecture/build-containers.md +++ b/architecture/build-containers.md @@ -58,3 +58,17 @@ mise run cluster -- supervisor # rebuild supervisor only mise run cluster -- chart # helm upgrade only mise run cluster -- all # rebuild everything ``` + +To validate incremental routing and BuildKit cache reuse locally, run: + +```bash +mise run cluster:test:fast-deploy-cache +``` + +The harness runs isolated scenarios in temporary git worktrees, keeps its own state and cache under `.cache/cluster-deploy-fast-test/`, and writes a Markdown summary with: + +- auto-detection checks for gateway-only, supervisor-only, shared, Helm-only, unrelated, and explicit-target changes +- cold vs warm rebuild comparisons for gateway and supervisor code changes +- container-ID invalidation coverage to verify gateway + Helm are retriggered when the cluster container changes + +Nightly CI runs the same harness in `.github/workflows/cluster-deploy-fast-nightly.yml`, uploads the full report directory as an artifact, and publishes the generated JUnit summary through `dorny/test-reporter` so failures and regressions are visible from the workflow run. diff --git a/tasks/cluster.toml b/tasks/cluster.toml index da06178f..6ddb0526 100644 --- a/tasks/cluster.toml +++ b/tasks/cluster.toml @@ -30,6 +30,10 @@ description = "Pull-mode deploy using local registry pushes" run = "tasks/scripts/cluster-deploy-fast.sh all" hide = true +["cluster:test:fast-deploy-cache"] +description = "Validate fast deploy routing and image cache reuse across component changes" +run = "tasks/scripts/cluster-deploy-fast-test.sh" + ["cluster:push:gateway"] description = "Tag and push gateway image to pull registry" run = "tasks/scripts/cluster-push-component.sh gateway" diff --git a/tasks/scripts/cluster-deploy-fast-test-junit.py b/tasks/scripts/cluster-deploy-fast-test-junit.py new file mode 100644 index 00000000..94bd862e --- /dev/null +++ b/tasks/scripts/cluster-deploy-fast-test-junit.py @@ -0,0 +1,110 @@ +#!/usr/bin/env python3 + +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +from __future__ import annotations + +import argparse +import csv +import sys +import xml.etree.ElementTree as ET +from pathlib import Path + + +def parse_args() -> argparse.Namespace: + parser = argparse.ArgumentParser( + description="Convert cluster-deploy-fast TSV summary into JUnit XML." + ) + parser.add_argument("--input", required=True, help="Path to summary.tsv") + parser.add_argument("--output", required=True, help="Path to junit.xml") + parser.add_argument( + "--suite-name", + default="cluster-deploy-fast-nightly", + help="JUnit testsuite name", + ) + return parser.parse_args() + + +def as_seconds(value: str) -> str: + if not value or value == "n/a": + return "0" + try: + return str(float(value)) + except ValueError: + return "0" + + +def build_xml(rows: list[dict[str, str]], suite_name: str) -> ET.ElementTree: + total = len(rows) + failures = sum(1 for row in rows if row["pass"] == "FAIL") + skipped = sum(1 for row in rows if row["pass"] == "INFO") + suite = ET.Element( + "testsuite", + name=suite_name, + tests=str(total), + failures=str(failures), + skipped=str(skipped), + ) + + for row in rows: + case = ET.SubElement( + suite, + "testcase", + classname=f"cluster_deploy_fast.{row['mode']}", + name=row["scenario"], + time=as_seconds(row["total_seconds"]), + ) + + if row["pass"] == "FAIL": + failure = ET.SubElement( + case, + "failure", + message=row["notes"] or "scenario failed", + ) + failure.text = ( + f"Expected: {row['expected']}\n" + f"Observed: {row['observed']}\n" + f"Notes: {row['notes']}" + ) + elif row["pass"] == "INFO": + skipped = ET.SubElement( + case, + "skipped", + message=row["notes"] or "informational baseline", + ) + skipped.text = f"Observed: {row['observed']}" + + properties = ET.SubElement(case, "properties") + for key in ( + "expected", + "observed", + "build_seconds", + "cached_lines", + "notes", + ): + ET.SubElement(properties, "property", name=key, value=row[key]) + + return ET.ElementTree(suite) + + +def main() -> int: + args = parse_args() + input_path = Path(args.input) + output_path = Path(args.output) + + with input_path.open(newline="", encoding="utf-8") as handle: + rows = list(csv.DictReader(handle, delimiter="\t")) + + if not rows: + print(f"no rows found in {input_path}", file=sys.stderr) + return 1 + + output_path.parent.mkdir(parents=True, exist_ok=True) + tree = build_xml(rows, args.suite_name) + tree.write(output_path, encoding="utf-8", xml_declaration=True) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/tasks/scripts/cluster-deploy-fast-test.sh b/tasks/scripts/cluster-deploy-fast-test.sh new file mode 100755 index 00000000..5d3048d0 --- /dev/null +++ b/tasks/scripts/cluster-deploy-fast-test.sh @@ -0,0 +1,625 @@ +#!/usr/bin/env bash + +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +set -euo pipefail + +usage() { + cat <<'EOF' +Usage: cluster-deploy-fast-test.sh [scenario...] + +Repeatable validation harness for tasks/scripts/cluster-deploy-fast.sh. + +Scenarios: + noop Validate clean-tree auto deploy is a no-op after state is primed + gateway-auto Gateway-only change triggers gateway rebuild + Helm upgrade + supervisor-auto Supervisor-only change triggers supervisor refresh only + shared-auto Shared change triggers gateway + supervisor rebuild + helm-auto Helm-only change triggers Helm upgrade only + unrelated-auto Unrelated change stays a no-op + explicit-targets Explicit targets override change detection + gateway-cache Compare cold vs warm gateway rebuild after a code change + supervisor-cache Compare cold vs warm supervisor rebuild after a code change + container-invalidation Mismatched container ID invalidates gateway + Helm state + +If no scenarios are provided, the full suite runs. + +Environment: + CLUSTER_NAME Override cluster name to test against + FAST_DEPLOY_TEST_REPORT_DIR Output directory (default: .cache/cluster-deploy-fast-test/) + FAST_DEPLOY_TEST_KEEP_WORKTREES Keep temporary worktrees when set to 1 + FAST_DEPLOY_TEST_SKIP_CACHE Skip the cache timing scenarios when set to 1 +EOF +} + +if [[ "${1:-}" == "-h" || "${1:-}" == "--help" ]]; then + usage + exit 0 +fi + +SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd) +REPO_ROOT=$(cd "${SCRIPT_DIR}/../.." && pwd) +RUN_ID=$(date +"%Y%m%d-%H%M%S") +REPORT_DIR=${FAST_DEPLOY_TEST_REPORT_DIR:-"${REPO_ROOT}/.cache/cluster-deploy-fast-test/${RUN_ID}"} +WORKTREE_ROOT="${REPORT_DIR}/worktrees" +LOG_DIR="${REPORT_DIR}/logs" +STATE_DIR="${REPORT_DIR}/state" +CACHE_DIR="${REPORT_DIR}/buildkit-cache" +SUMMARY_TSV="${REPORT_DIR}/summary.tsv" +SUMMARY_MD="${REPORT_DIR}/summary.md" +KEEP_WORKTREES=${FAST_DEPLOY_TEST_KEEP_WORKTREES:-0} +SKIP_CACHE=${FAST_DEPLOY_TEST_SKIP_CACHE:-0} + +normalize_name() { + echo "$1" | tr '[:upper:]' '[:lower:]' | sed 's/[^a-z0-9-]/-/g' | sed 's/--*/-/g' | sed 's/^-//;s/-$//' +} + +ROOT_BASENAME=$(basename "${REPO_ROOT}") +CLUSTER_NAME=${CLUSTER_NAME:-$(normalize_name "${ROOT_BASENAME}")} + +mkdir -p "${WORKTREE_ROOT}" "${LOG_DIR}" "${STATE_DIR}" "${CACHE_DIR}" + +declare -a SCENARIOS=() +if [[ "$#" -gt 0 ]]; then + SCENARIOS=("$@") +else + SCENARIOS=( + noop + gateway-auto + supervisor-auto + shared-auto + helm-auto + unrelated-auto + explicit-targets + gateway-cache + supervisor-cache + container-invalidation + ) +fi + +if [[ "${SKIP_CACHE}" == "1" ]]; then + declare -a filtered=() + filtered=() + for scenario in "${SCENARIOS[@]}"; do + if [[ "${scenario}" != "gateway-cache" && "${scenario}" != "supervisor-cache" ]]; then + filtered+=("${scenario}") + fi + done + SCENARIOS=("${filtered[@]}") +fi + +declare -a CREATED_WORKTREES=() + +cleanup() { + if [[ "${KEEP_WORKTREES}" == "1" ]]; then + return + fi + + local dir + for dir in "${CREATED_WORKTREES[@]:-}"; do + if [[ -d "${dir}" ]]; then + git -C "${REPO_ROOT}" worktree remove --force "${dir}" >/dev/null 2>&1 || true + fi + done +} +trap cleanup EXIT + +buildx_driver() { + local -a builder_args=() + local ctx + + if [[ -n "${DOCKER_BUILDER:-}" ]]; then + builder_args=(--builder "${DOCKER_BUILDER}") + elif [[ -z "${DOCKER_PLATFORM:-}" && -z "${CI:-}" ]]; then + ctx=$(docker context inspect --format '{{.Name}}' 2>/dev/null || echo default) + builder_args=(--builder "${ctx}") + fi + + docker buildx inspect ${builder_args[@]+"${builder_args[@]}"} 2>/dev/null \ + | awk -F': ' '/Driver:/ {gsub(/^[[:space:]]+|[[:space:]]+$/, "", $2); print $2; exit}' +} + +current_cluster_container_id() { + docker inspect --format '{{.Id}}' "openshell-cluster-${CLUSTER_NAME}" 2>/dev/null || true +} + +require_cluster() { + if ! docker ps -q --filter "name=^openshell-cluster-${CLUSTER_NAME}$" --filter "health=healthy" | grep -q .; then + echo "Error: cluster container 'openshell-cluster-${CLUSTER_NAME}' is not running or healthy." >&2 + echo "Start it first with: mise run cluster" >&2 + exit 1 + fi +} + +create_worktree() { + local name=$1 + local dir="${WORKTREE_ROOT}/${name}" + rm -rf "${dir}" + git -C "${REPO_ROOT}" worktree add --detach "${dir}" HEAD >/dev/null + CREATED_WORKTREES+=("${dir}") + printf '%s\n' "${dir}" +} + +append_marker() { + local file=$1 + local marker=$2 + printf '\n%s\n' "${marker}" >> "${file}" +} + +extract_plan_value() { + local log_file=$1 + local label=$2 + awk -F': +' -v pattern="${label}" '$0 ~ pattern {print $2; exit}' "${log_file}" +} + +extract_duration() { + local log_file=$1 + local label=$2 + awk -v prefix="${label} took " 'index($0, prefix) == 1 {sub(/^.* took /, "", $0); sub(/s$/, "", $0); print; exit}' "${log_file}" +} + +count_cached_lines() { + local log_file=$1 + grep -c " CACHED" "${log_file}" 2>/dev/null || true +} + +check_required_patterns() { + local log_file=$1 + local patterns=${2:-} + local pattern + + if [[ -z "${patterns}" ]]; then + return 0 + fi + + IFS='|' read -r -a pattern_array <<< "${patterns}" + for pattern in "${pattern_array[@]}"; do + if ! grep -Fq "${pattern}" "${log_file}"; then + return 1 + fi + done + + return 0 +} + +record_result() { + local scenario=$1 + local mode=$2 + local expected=$3 + local observed=$4 + local pass=$5 + local total_duration=$6 + local build_duration=$7 + local cached_lines=$8 + local notes=$9 + + printf '%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n' \ + "${scenario}" "${mode}" "${expected}" "${observed}" "${pass}" \ + "${total_duration}" "${build_duration}" "${cached_lines}" "${notes}" >> "${SUMMARY_TSV}" +} + +write_summary_md() { + { + echo "# Fast Deploy Cache Test Summary" + echo + echo "- Cluster: \`${CLUSTER_NAME}\`" + echo "- Buildx driver: \`${BUILDX_DRIVER:-unknown}\`" + echo "- Report dir: \`${REPORT_DIR}\`" + echo + echo "| Scenario | Mode | Expected | Observed | Pass | Total (s) | Builds (s) | Cached lines | Notes |" + echo "|---|---|---|---|---|---:|---:|---:|---|" + awk -F '\t' 'NR > 1 {printf "| %s | %s | `%s` | `%s` | %s | %s | %s | %s | %s |\n", $1, $2, $3, $4, $5, $6, $7, $8, $9}' "${SUMMARY_TSV}" + } > "${SUMMARY_MD}" +} + +run_fast_deploy() { + local worktree=$1 + local state_file=$2 + local log_file=$3 + shift 3 + + local start end status + start=$(date +%s) + ( + cd "${worktree}" + env \ + BUILDKIT_PROGRESS=plain \ + CLUSTER_NAME="${CLUSTER_NAME}" \ + DEPLOY_FAST_STATE_FILE="${state_file}" \ + DOCKER_BUILD_CACHE_DIR="${CACHE_DIR}" \ + "$@" \ + ./tasks/scripts/cluster-deploy-fast.sh + ) >"${log_file}" 2>&1 || status=$? + status=${status:-0} + end=$(date +%s) + printf '%s\n' $((end - start)) + return "${status}" +} + +run_fast_deploy_args() { + local worktree=$1 + local state_file=$2 + local log_file=$3 + shift 3 + + local start end status + start=$(date +%s) + ( + cd "${worktree}" + env \ + BUILDKIT_PROGRESS=plain \ + CLUSTER_NAME="${CLUSTER_NAME}" \ + DEPLOY_FAST_STATE_FILE="${state_file}" \ + DOCKER_BUILD_CACHE_DIR="${CACHE_DIR}" \ + ./tasks/scripts/cluster-deploy-fast.sh "$@" + ) >"${log_file}" 2>&1 || status=$? + status=${status:-0} + end=$(date +%s) + printf '%s\n' $((end - start)) + return "${status}" +} + +validate_plan() { + local log_file=$1 + local expected_gateway=$2 + local expected_supervisor=$3 + local expected_helm=$4 + + local gateway supervisor helm + gateway=$(extract_plan_value "${log_file}" "build gateway") + supervisor=$(extract_plan_value "${log_file}" "build supervisor") + helm=$(extract_plan_value "${log_file}" "helm upgrade") + + if [[ "${gateway}" == "${expected_gateway}" && "${supervisor}" == "${expected_supervisor}" && "${helm}" == "${expected_helm}" ]]; then + printf '%s\n' "build gateway=${gateway}, build supervisor=${supervisor}, helm upgrade=${helm}" + return 0 + fi + + printf '%s\n' "build gateway=${gateway:-missing}, build supervisor=${supervisor:-missing}, helm upgrade=${helm:-missing}" + return 1 +} + +clear_cache() { + rm -rf "${CACHE_DIR}" + mkdir -p "${CACHE_DIR}" +} + +prime_state() { + local name=$1 + local worktree state_file log_file + worktree=$(create_worktree "${name}-prime") + state_file="${STATE_DIR}/${name}.state" + log_file="${LOG_DIR}/${name}-prime.log" + run_fast_deploy "${worktree}" "${state_file}" "${log_file}" >/dev/null +} + +run_auto_scenario() { + local scenario=$1 + local file=$2 + local marker=$3 + local expected_gateway=$4 + local expected_supervisor=$5 + local expected_helm=$6 + local note=$7 + local required_patterns=${8:-} + + local worktree state_file log_file total_duration build_duration observed pass + worktree=$(create_worktree "${scenario}") + state_file="${STATE_DIR}/${scenario}.state" + log_file="${LOG_DIR}/${scenario}.log" + + prime_state "${scenario}" + append_marker "${worktree}/${file}" "${marker}" + + total_duration=$(run_fast_deploy "${worktree}" "${state_file}" "${log_file}") + build_duration=$(extract_duration "${log_file}" "Builds") + if observed=$(validate_plan "${log_file}" "${expected_gateway}" "${expected_supervisor}" "${expected_helm}"); then + pass=PASS + else + pass=FAIL + fi + + if [[ "${pass}" == "PASS" ]] && ! check_required_patterns "${log_file}" "${required_patterns}"; then + pass=FAIL + note="${note}; missing expected deploy log pattern" + fi + + record_result \ + "${scenario}" \ + "auto" \ + "build gateway=${expected_gateway}, build supervisor=${expected_supervisor}, helm upgrade=${expected_helm}" \ + "${observed}" \ + "${pass}" \ + "${total_duration}" \ + "${build_duration:-n/a}" \ + "$(count_cached_lines "${log_file}")" \ + "${note}" +} + +run_noop_scenario() { + local scenario=noop + local worktree state_file log_file total_duration build_duration observed pass notes + worktree=$(create_worktree "${scenario}") + state_file="${STATE_DIR}/${scenario}.state" + log_file="${LOG_DIR}/${scenario}.log" + + prime_state "${scenario}" + + total_duration=$(run_fast_deploy "${worktree}" "${state_file}" "${log_file}") + build_duration=$(extract_duration "${log_file}" "Builds") + if observed=$(validate_plan "${log_file}" 0 0 0); then + pass=PASS + else + pass=FAIL + fi + notes="clean tree should print no-op plan" + + if ! grep -q "No new local changes since last deploy." "${log_file}"; then + pass=FAIL + notes="missing no-op message" + fi + + record_result \ + "${scenario}" \ + "auto" \ + "build gateway=0, build supervisor=0, helm upgrade=0" \ + "${observed}" \ + "${pass}" \ + "${total_duration}" \ + "${build_duration:-n/a}" \ + "$(count_cached_lines "${log_file}")" \ + "${notes}" +} + +run_explicit_targets_scenario() { + local scenario=explicit-targets + local target worktree state_file log_file total_duration build_duration observed pass expected notes + + for target in gateway supervisor chart all; do + worktree=$(create_worktree "${scenario}-${target}") + state_file="${STATE_DIR}/${scenario}-${target}.state" + log_file="${LOG_DIR}/${scenario}-${target}.log" + + total_duration=$(run_fast_deploy_args "${worktree}" "${state_file}" "${log_file}" "${target}") + build_duration=$(extract_duration "${log_file}" "Builds") + + case "${target}" in + gateway) + if observed=$(validate_plan "${log_file}" 1 0 1); then + pass=PASS + else + pass=FAIL + fi + expected="build gateway=1, build supervisor=0, helm upgrade=1" + ;; + supervisor) + if observed=$(validate_plan "${log_file}" 0 1 0); then + pass=PASS + else + pass=FAIL + fi + expected="build gateway=0, build supervisor=1, helm upgrade=0" + ;; + chart) + if observed=$(validate_plan "${log_file}" 0 0 1); then + pass=PASS + else + pass=FAIL + fi + expected="build gateway=0, build supervisor=0, helm upgrade=1" + ;; + all) + if observed=$(validate_plan "${log_file}" 1 1 1); then + pass=PASS + else + pass=FAIL + fi + expected="build gateway=1, build supervisor=1, helm upgrade=1" + ;; + esac + notes="explicit target ${target}" + record_result \ + "${scenario}:${target}" \ + "explicit" \ + "${expected}" \ + "${observed}" \ + "${pass}" \ + "${total_duration}" \ + "${build_duration:-n/a}" \ + "$(count_cached_lines "${log_file}")" \ + "${notes}" + done +} + +run_cache_scenario() { + local scenario=$1 + local file=$2 + local marker=$3 + local target=$4 + + local worktree state_file cold_log warm_log cold_total warm_total cold_build warm_build cold_cached warm_cached pass notes + worktree=$(create_worktree "${scenario}") + state_file="${STATE_DIR}/${scenario}.state" + cold_log="${LOG_DIR}/${scenario}-cold.log" + warm_log="${LOG_DIR}/${scenario}-warm.log" + + append_marker "${worktree}/${file}" "${marker}" + clear_cache + + cold_total=$(run_fast_deploy_args "${worktree}" "${state_file}" "${cold_log}" "${target}") + cold_build=$(extract_duration "${cold_log}" "Builds") + cold_cached=$(count_cached_lines "${cold_log}") + + warm_total=$(run_fast_deploy_args "${worktree}" "${state_file}" "${warm_log}" "${target}") + warm_build=$(extract_duration "${warm_log}" "Builds") + warm_cached=$(count_cached_lines "${warm_log}") + + pass=FAIL + notes="warm rebuild should be faster or show cache hits" + + if [[ -n "${cold_build:-}" && -n "${warm_build:-}" && "${cold_build}" =~ ^[0-9]+$ && "${warm_build}" =~ ^[0-9]+$ && "${cold_build}" -gt 0 ]]; then + if [[ "${warm_build}" -le $((cold_build * 70 / 100)) ]]; then + pass=PASS + notes="warm build improved by at least 30%" + fi + fi + + if [[ "${pass}" != "PASS" && "${warm_cached}" =~ ^[0-9]+$ && "${warm_cached}" -gt "${cold_cached:-0}" ]]; then + pass=PASS + notes="warm build showed more cache hits" + fi + + record_result \ + "${scenario}:cold" \ + "cache" \ + "first rebuild of ${target} after cache clear" \ + "total=${cold_total}s, builds=${cold_build:-n/a}s" \ + "INFO" \ + "${cold_total}" \ + "${cold_build:-n/a}" \ + "${cold_cached}" \ + "baseline cold run" + + record_result \ + "${scenario}:warm" \ + "cache" \ + "second rebuild of ${target} should reuse cache" \ + "total=${warm_total}s, builds=${warm_build:-n/a}s" \ + "${pass}" \ + "${warm_total}" \ + "${warm_build:-n/a}" \ + "${warm_cached}" \ + "${notes}" +} + +run_container_invalidation_scenario() { + local scenario=container-invalidation + local worktree state_file prime_log rerun_log total_duration build_duration observed pass container_id notes + worktree=$(create_worktree "${scenario}") + state_file="${STATE_DIR}/${scenario}.state" + prime_log="${LOG_DIR}/${scenario}-prime.log" + rerun_log="${LOG_DIR}/${scenario}.log" + + run_fast_deploy "${worktree}" "${state_file}" "${prime_log}" >/dev/null + container_id=$(current_cluster_container_id) + if [[ -z "${container_id}" ]]; then + echo "Error: could not determine cluster container ID for invalidation scenario." >&2 + exit 1 + fi + + sed -i.bak "s|^container_id=.*$|container_id=invalidated-${container_id#sha256:}|" "${state_file}" + rm -f "${state_file}.bak" + + total_duration=$(run_fast_deploy "${worktree}" "${state_file}" "${rerun_log}") + build_duration=$(extract_duration "${rerun_log}" "Builds") + if observed=$(validate_plan "${rerun_log}" 1 0 1); then + pass=PASS + else + pass=FAIL + fi + notes="mismatched container ID should invalidate gateway and helm only" + + if [[ "${pass}" == "PASS" ]] && ! check_required_patterns "${rerun_log}" "Restarting gateway to pick up updated image...|Upgrading helm release..."; then + pass=FAIL + notes="${notes}; missing expected deploy log pattern" + fi + + record_result \ + "${scenario}" \ + "auto" \ + "build gateway=1, build supervisor=0, helm upgrade=1" \ + "${observed}" \ + "${pass}" \ + "${total_duration}" \ + "${build_duration:-n/a}" \ + "$(count_cached_lines "${rerun_log}")" \ + "${notes}" +} + +printf 'scenario\tmode\texpected\tobserved\tpass\ttotal_seconds\tbuild_seconds\tcached_lines\tnotes\n' > "${SUMMARY_TSV}" + +require_cluster +BUILDX_DRIVER=$(buildx_driver || true) + +for scenario in "${SCENARIOS[@]}"; do + case "${scenario}" in + noop) + run_noop_scenario + ;; + gateway-auto) + run_auto_scenario \ + "gateway-auto" \ + "crates/openshell-server/src/main.rs" \ + "// fast deploy cache test: gateway-auto ${RUN_ID}" \ + 1 0 1 \ + "gateway-only source change" \ + "Pushing updated images to local registry...|Restarting gateway to pick up updated image...|Upgrading helm release..." + ;; + supervisor-auto) + run_auto_scenario \ + "supervisor-auto" \ + "crates/openshell-sandbox/src/main.rs" \ + "// fast deploy cache test: supervisor-auto ${RUN_ID}" \ + 0 1 0 \ + "supervisor-only source change" \ + "Supervisor binary updated on cluster node." + ;; + shared-auto) + run_auto_scenario \ + "shared-auto" \ + "crates/openshell-policy/src/lib.rs" \ + "// fast deploy cache test: shared-auto ${RUN_ID}" \ + 1 1 0 \ + "shared dependency change should rebuild both binaries" \ + "Restarting gateway to pick up updated image...|Supervisor binary updated on cluster node." + ;; + helm-auto) + run_auto_scenario \ + "helm-auto" \ + "deploy/helm/openshell/values.yaml" \ + "# fast deploy cache test: helm-auto ${RUN_ID}" \ + 0 0 1 \ + "chart-only change" \ + "Upgrading helm release..." + ;; + unrelated-auto) + run_auto_scenario \ + "unrelated-auto" \ + "README.md" \ + "" \ + 0 0 0 \ + "unrelated file should stay a no-op" \ + "No new local changes since last deploy." + ;; + explicit-targets) + run_explicit_targets_scenario + ;; + gateway-cache) + run_cache_scenario \ + "gateway-cache" \ + "crates/openshell-server/src/main.rs" \ + "// fast deploy cache test: gateway-cache ${RUN_ID}" \ + "gateway" + ;; + supervisor-cache) + run_cache_scenario \ + "supervisor-cache" \ + "crates/openshell-sandbox/src/main.rs" \ + "// fast deploy cache test: supervisor-cache ${RUN_ID}" \ + "supervisor" + ;; + container-invalidation) + run_container_invalidation_scenario + ;; + *) + echo "Unknown scenario '${scenario}'" >&2 + exit 1 + ;; + esac +done + +write_summary_md + +echo "Fast deploy cache test report written to:" +echo " ${SUMMARY_MD}" From 7cc6c9c38aaecbf4ef908679e5e5bf40a7fbc734 Mon Sep 17 00:00:00 2001 From: Drew Newberry Date: Wed, 18 Mar 2026 10:26:24 -0700 Subject: [PATCH 05/15] wip --- scripts/build-benchmark/README.md | 43 ++ .../cluster-deploy-fast-test.sh | 609 ++++++++++++++++++ 2 files changed, 652 insertions(+) create mode 100644 scripts/build-benchmark/README.md create mode 100755 scripts/build-benchmark/cluster-deploy-fast-test.sh diff --git a/scripts/build-benchmark/README.md b/scripts/build-benchmark/README.md new file mode 100644 index 00000000..9b6834ec --- /dev/null +++ b/scripts/build-benchmark/README.md @@ -0,0 +1,43 @@ +# Build Benchmark + +Validation harness for cluster deploys. Tests change detection, build routing, and image cache reuse across component changes. All operations run through `mise run cluster` to replicate the real user workflow. + +## Usage + +The script bootstraps a cluster automatically via `mise run cluster` if one isn't already running. + +Run the full suite: + +```sh +scripts/build-benchmark/cluster-deploy-fast-test.sh +``` + +Run specific scenarios: + +```sh +scripts/build-benchmark/cluster-deploy-fast-test.sh noop gateway-auto supervisor-cache +``` + +## Scenarios + +| Scenario | Description | +|---|---| +| `noop` | Clean tree is a no-op after state is primed | +| `gateway-auto` | Gateway-only change triggers gateway rebuild + Helm upgrade | +| `supervisor-auto` | Supervisor-only change triggers supervisor refresh only | +| `shared-auto` | Shared dependency change triggers both rebuilds | +| `helm-auto` | Helm-only change triggers Helm upgrade only | +| `unrelated-auto` | Unrelated file change stays a no-op | +| `explicit-targets` | Explicit targets override change detection | +| `gateway-cache` | Cold vs warm gateway rebuild comparison | +| `supervisor-cache` | Cold vs warm supervisor rebuild comparison | +| `container-invalidation` | Mismatched container ID invalidates gateway + Helm state | + +## Environment Variables + +| Variable | Description | +|---|---| +| `CLUSTER_NAME` | Override cluster name to test against | +| `FAST_DEPLOY_TEST_REPORT_DIR` | Output directory (default: `.cache/cluster-deploy-fast-test/`) | +| `FAST_DEPLOY_TEST_KEEP_WORKTREES` | Set to `1` to keep temporary worktrees | +| `FAST_DEPLOY_TEST_SKIP_CACHE` | Set to `1` to skip cache timing scenarios | diff --git a/scripts/build-benchmark/cluster-deploy-fast-test.sh b/scripts/build-benchmark/cluster-deploy-fast-test.sh new file mode 100755 index 00000000..c0ef6609 --- /dev/null +++ b/scripts/build-benchmark/cluster-deploy-fast-test.sh @@ -0,0 +1,609 @@ +#!/usr/bin/env bash + +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +set -euo pipefail + +usage() { + cat <<'EOF' +Usage: cluster-deploy-fast-test.sh [scenario...] + +Repeatable validation harness for tasks/scripts/cluster-deploy-fast.sh. + +Scenarios: + noop Validate clean-tree auto deploy is a no-op after state is primed + gateway-auto Gateway-only change triggers gateway rebuild + Helm upgrade + supervisor-auto Supervisor-only change triggers supervisor refresh only + shared-auto Shared change triggers gateway + supervisor rebuild + helm-auto Helm-only change triggers Helm upgrade only + unrelated-auto Unrelated change stays a no-op + explicit-targets Explicit targets override change detection + gateway-cache Compare cold vs warm gateway rebuild after a code change + supervisor-cache Compare cold vs warm supervisor rebuild after a code change + container-invalidation Mismatched container ID invalidates gateway + Helm state + +If no scenarios are provided, the full suite runs. + +Environment: + CLUSTER_NAME Override cluster name to test against + FAST_DEPLOY_TEST_REPORT_DIR Output directory (default: .cache/cluster-deploy-fast-test/) + FAST_DEPLOY_TEST_KEEP_WORKTREES Keep temporary worktrees when set to 1 + FAST_DEPLOY_TEST_SKIP_CACHE Skip the cache timing scenarios when set to 1 +EOF +} + +if [[ "${1:-}" == "-h" || "${1:-}" == "--help" ]]; then + usage + exit 0 +fi + +SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd) +REPO_ROOT=$(cd "${SCRIPT_DIR}/../.." && pwd) +RUN_ID=$(date +"%Y%m%d-%H%M%S") +REPORT_DIR=${FAST_DEPLOY_TEST_REPORT_DIR:-"${REPO_ROOT}/.cache/cluster-deploy-fast-test/${RUN_ID}"} +WORKTREE_ROOT="${REPORT_DIR}/worktrees" +LOG_DIR="${REPORT_DIR}/logs" +STATE_DIR="${REPORT_DIR}/state" +CACHE_DIR="${REPORT_DIR}/buildkit-cache" +SUMMARY_TSV="${REPORT_DIR}/summary.tsv" +SUMMARY_MD="${REPORT_DIR}/summary.md" +KEEP_WORKTREES=${FAST_DEPLOY_TEST_KEEP_WORKTREES:-0} +SKIP_CACHE=${FAST_DEPLOY_TEST_SKIP_CACHE:-0} + +normalize_name() { + echo "$1" | tr '[:upper:]' '[:lower:]' | sed 's/[^a-z0-9-]/-/g' | sed 's/--*/-/g' | sed 's/^-//;s/-$//' +} + +ROOT_BASENAME=$(basename "${REPO_ROOT}") +CLUSTER_NAME=${CLUSTER_NAME:-$(normalize_name "${ROOT_BASENAME}")} + +mkdir -p "${WORKTREE_ROOT}" "${LOG_DIR}" "${STATE_DIR}" "${CACHE_DIR}" + +declare -a SCENARIOS=() +if [[ "$#" -gt 0 ]]; then + SCENARIOS=("$@") +else + SCENARIOS=( + noop + gateway-auto + supervisor-auto + shared-auto + helm-auto + unrelated-auto + explicit-targets + gateway-cache + supervisor-cache + container-invalidation + ) +fi + +if [[ "${SKIP_CACHE}" == "1" ]]; then + declare -a filtered=() + filtered=() + for scenario in "${SCENARIOS[@]}"; do + if [[ "${scenario}" != "gateway-cache" && "${scenario}" != "supervisor-cache" ]]; then + filtered+=("${scenario}") + fi + done + SCENARIOS=("${filtered[@]}") +fi + +declare -a CREATED_WORKTREES=() + +cleanup_worktrees() { + if [[ "${KEEP_WORKTREES}" == "1" ]]; then + return + fi + + local dir + for dir in "${CREATED_WORKTREES[@]:-}"; do + if [[ -d "${dir}" ]]; then + git -C "${REPO_ROOT}" worktree remove --force "${dir}" >/dev/null 2>&1 || true + fi + done +} +# cleanup_worktrees is called by on_exit trap set after TSV header is written + +ensure_cluster() { + echo "Ensuring cluster is running (mise run cluster)..." + mise run cluster +} + +create_worktree() { + local name=$1 + local dir="${WORKTREE_ROOT}/${name}" + rm -rf "${dir}" + git -C "${REPO_ROOT}" worktree add --detach "${dir}" HEAD >/dev/null + mise trust "${dir}/mise.toml" >/dev/null 2>&1 || true + CREATED_WORKTREES+=("${dir}") + printf '%s\n' "${dir}" +} + +append_marker() { + local file=$1 + local marker=$2 + printf '\n%s\n' "${marker}" >> "${file}" +} + +extract_plan_value() { + local log_file=$1 + local label=$2 + awk -F': +' -v pattern="${label}" '$0 ~ pattern {print $2; exit}' "${log_file}" +} + +extract_duration() { + local log_file=$1 + local label=$2 + awk -v prefix="${label} took " 'index($0, prefix) == 1 {sub(/^.* took /, "", $0); sub(/s$/, "", $0); print; exit}' "${log_file}" +} + +count_cached_lines() { + local log_file=$1 + grep -c " CACHED" "${log_file}" 2>/dev/null || true +} + +check_required_patterns() { + local log_file=$1 + local patterns=${2:-} + local pattern + + if [[ -z "${patterns}" ]]; then + return 0 + fi + + IFS='|' read -r -a pattern_array <<< "${patterns}" + for pattern in "${pattern_array[@]}"; do + if ! grep -Fq "${pattern}" "${log_file}"; then + return 1 + fi + done + + return 0 +} + +record_result() { + local scenario=$1 + local mode=$2 + local expected=$3 + local observed=$4 + local pass=$5 + local total_duration=$6 + local build_duration=$7 + local cached_lines=$8 + local notes=$9 + + printf '%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n' \ + "${scenario}" "${mode}" "${expected}" "${observed}" "${pass}" \ + "${total_duration}" "${build_duration}" "${cached_lines}" "${notes}" >> "${SUMMARY_TSV}" +} + +write_summary_md() { + { + echo "# Fast Deploy Cache Test Summary" + echo + echo "- Cluster: \`${CLUSTER_NAME}\`" + echo "- Report dir: \`${REPORT_DIR}\`" + echo + echo "| Scenario | Mode | Expected | Observed | Pass | Total (s) | Builds (s) | Cached lines | Notes |" + echo "|---|---|---|---|---|---:|---:|---:|---|" + awk -F '\t' 'NR > 1 {printf "| %s | %s | `%s` | `%s` | %s | %s | %s | %s | %s |\n", $1, $2, $3, $4, $5, $6, $7, $8, $9}' "${SUMMARY_TSV}" + } > "${SUMMARY_MD}" +} + +run_fast_deploy() { + local worktree=$1 + local state_file=$2 + local log_file=$3 + shift 3 + + local start end + start=$(date +%s) + ( + cd "${worktree}" + env \ + BUILDKIT_PROGRESS=plain \ + CLUSTER_NAME="${CLUSTER_NAME}" \ + DEPLOY_FAST_STATE_FILE="${state_file}" \ + DOCKER_BUILD_CACHE_DIR="${CACHE_DIR}" \ + "$@" \ + mise run cluster + ) >"${log_file}" 2>&1 || true + end=$(date +%s) + printf '%s\n' $((end - start)) +} + +run_fast_deploy_args() { + local worktree=$1 + local state_file=$2 + local log_file=$3 + shift 3 + + local start end + start=$(date +%s) + ( + cd "${worktree}" + env \ + BUILDKIT_PROGRESS=plain \ + CLUSTER_NAME="${CLUSTER_NAME}" \ + DEPLOY_FAST_STATE_FILE="${state_file}" \ + DOCKER_BUILD_CACHE_DIR="${CACHE_DIR}" \ + mise run cluster -- "$@" + ) >"${log_file}" 2>&1 || true + end=$(date +%s) + printf '%s\n' $((end - start)) +} + +validate_plan() { + local log_file=$1 + local expected_gateway=$2 + local expected_supervisor=$3 + local expected_helm=$4 + + local gateway supervisor helm + gateway=$(extract_plan_value "${log_file}" "build gateway") + supervisor=$(extract_plan_value "${log_file}" "build supervisor") + helm=$(extract_plan_value "${log_file}" "helm upgrade") + + if [[ "${gateway}" == "${expected_gateway}" && "${supervisor}" == "${expected_supervisor}" && "${helm}" == "${expected_helm}" ]]; then + printf '%s\n' "build gateway=${gateway}, build supervisor=${supervisor}, helm upgrade=${helm}" + return 0 + fi + + printf '%s\n' "build gateway=${gateway:-missing}, build supervisor=${supervisor:-missing}, helm upgrade=${helm:-missing}" + return 1 +} + +clear_cache() { + rm -rf "${CACHE_DIR}" + mkdir -p "${CACHE_DIR}" +} + +prime_state() { + local name=$1 + local worktree state_file log_file + worktree=$(create_worktree "${name}-prime") + state_file="${STATE_DIR}/${name}.state" + log_file="${LOG_DIR}/${name}-prime.log" + run_fast_deploy "${worktree}" "${state_file}" "${log_file}" >/dev/null || true +} + +run_auto_scenario() { + local scenario=$1 + local file=$2 + local marker=$3 + local expected_gateway=$4 + local expected_supervisor=$5 + local expected_helm=$6 + local note=$7 + local required_patterns=${8:-} + + local worktree state_file log_file total_duration build_duration observed pass + worktree=$(create_worktree "${scenario}") + state_file="${STATE_DIR}/${scenario}.state" + log_file="${LOG_DIR}/${scenario}.log" + + prime_state "${scenario}" + append_marker "${worktree}/${file}" "${marker}" + + total_duration=$(run_fast_deploy "${worktree}" "${state_file}" "${log_file}") + build_duration=$(extract_duration "${log_file}" "Builds") + if observed=$(validate_plan "${log_file}" "${expected_gateway}" "${expected_supervisor}" "${expected_helm}"); then + pass=PASS + else + pass=FAIL + fi + + if [[ "${pass}" == "PASS" ]] && ! check_required_patterns "${log_file}" "${required_patterns}"; then + pass=FAIL + note="${note}; missing expected deploy log pattern" + fi + + record_result \ + "${scenario}" \ + "auto" \ + "build gateway=${expected_gateway}, build supervisor=${expected_supervisor}, helm upgrade=${expected_helm}" \ + "${observed}" \ + "${pass}" \ + "${total_duration}" \ + "${build_duration:-n/a}" \ + "$(count_cached_lines "${log_file}")" \ + "${note}" +} + +run_noop_scenario() { + local scenario=noop + local worktree state_file log_file total_duration build_duration observed pass notes + worktree=$(create_worktree "${scenario}") + state_file="${STATE_DIR}/${scenario}.state" + log_file="${LOG_DIR}/${scenario}.log" + + prime_state "${scenario}" + + total_duration=$(run_fast_deploy "${worktree}" "${state_file}" "${log_file}") + build_duration=$(extract_duration "${log_file}" "Builds") + if observed=$(validate_plan "${log_file}" 0 0 0); then + pass=PASS + else + pass=FAIL + fi + notes="clean tree should print no-op plan" + + if ! grep -q "No new local changes since last deploy." "${log_file}"; then + pass=FAIL + notes="missing no-op message" + fi + + record_result \ + "${scenario}" \ + "auto" \ + "build gateway=0, build supervisor=0, helm upgrade=0" \ + "${observed}" \ + "${pass}" \ + "${total_duration}" \ + "${build_duration:-n/a}" \ + "$(count_cached_lines "${log_file}")" \ + "${notes}" +} + +run_explicit_targets_scenario() { + local scenario=explicit-targets + local target worktree state_file log_file total_duration build_duration observed pass expected notes + + for target in gateway supervisor chart all; do + worktree=$(create_worktree "${scenario}-${target}") + state_file="${STATE_DIR}/${scenario}-${target}.state" + log_file="${LOG_DIR}/${scenario}-${target}.log" + + total_duration=$(run_fast_deploy_args "${worktree}" "${state_file}" "${log_file}" "${target}") + build_duration=$(extract_duration "${log_file}" "Builds") + + case "${target}" in + gateway) + if observed=$(validate_plan "${log_file}" 1 0 1); then + pass=PASS + else + pass=FAIL + fi + expected="build gateway=1, build supervisor=0, helm upgrade=1" + ;; + supervisor) + if observed=$(validate_plan "${log_file}" 0 1 0); then + pass=PASS + else + pass=FAIL + fi + expected="build gateway=0, build supervisor=1, helm upgrade=0" + ;; + chart) + if observed=$(validate_plan "${log_file}" 0 0 1); then + pass=PASS + else + pass=FAIL + fi + expected="build gateway=0, build supervisor=0, helm upgrade=1" + ;; + all) + if observed=$(validate_plan "${log_file}" 1 1 1); then + pass=PASS + else + pass=FAIL + fi + expected="build gateway=1, build supervisor=1, helm upgrade=1" + ;; + esac + notes="explicit target ${target}" + record_result \ + "${scenario}:${target}" \ + "explicit" \ + "${expected}" \ + "${observed}" \ + "${pass}" \ + "${total_duration}" \ + "${build_duration:-n/a}" \ + "$(count_cached_lines "${log_file}")" \ + "${notes}" + done +} + +run_cache_scenario() { + local scenario=$1 + local file=$2 + local marker=$3 + local target=$4 + + local worktree state_file cold_log warm_log cold_total warm_total cold_build warm_build cold_cached warm_cached pass notes + worktree=$(create_worktree "${scenario}") + state_file="${STATE_DIR}/${scenario}.state" + cold_log="${LOG_DIR}/${scenario}-cold.log" + warm_log="${LOG_DIR}/${scenario}-warm.log" + + append_marker "${worktree}/${file}" "${marker}" + clear_cache + + cold_total=$(run_fast_deploy_args "${worktree}" "${state_file}" "${cold_log}" "${target}") + cold_build=$(extract_duration "${cold_log}" "Builds") + cold_cached=$(count_cached_lines "${cold_log}") + + warm_total=$(run_fast_deploy_args "${worktree}" "${state_file}" "${warm_log}" "${target}") + warm_build=$(extract_duration "${warm_log}" "Builds") + warm_cached=$(count_cached_lines "${warm_log}") + + pass=FAIL + notes="warm rebuild should be faster or show cache hits" + + if [[ -n "${cold_build:-}" && -n "${warm_build:-}" && "${cold_build}" =~ ^[0-9]+$ && "${warm_build}" =~ ^[0-9]+$ && "${cold_build}" -gt 0 ]]; then + if [[ "${warm_build}" -le $((cold_build * 70 / 100)) ]]; then + pass=PASS + notes="warm build improved by at least 30%" + fi + fi + + if [[ "${pass}" != "PASS" && "${warm_cached}" =~ ^[0-9]+$ && "${warm_cached}" -gt "${cold_cached:-0}" ]]; then + pass=PASS + notes="warm build showed more cache hits" + fi + + record_result \ + "${scenario}:cold" \ + "cache" \ + "first rebuild of ${target} after cache clear" \ + "total=${cold_total}s, builds=${cold_build:-n/a}s" \ + "INFO" \ + "${cold_total}" \ + "${cold_build:-n/a}" \ + "${cold_cached}" \ + "baseline cold run" + + record_result \ + "${scenario}:warm" \ + "cache" \ + "second rebuild of ${target} should reuse cache" \ + "total=${warm_total}s, builds=${warm_build:-n/a}s" \ + "${pass}" \ + "${warm_total}" \ + "${warm_build:-n/a}" \ + "${warm_cached}" \ + "${notes}" +} + +run_container_invalidation_scenario() { + local scenario=container-invalidation + local worktree state_file prime_log rerun_log total_duration build_duration observed pass container_id notes + worktree=$(create_worktree "${scenario}") + state_file="${STATE_DIR}/${scenario}.state" + prime_log="${LOG_DIR}/${scenario}-prime.log" + rerun_log="${LOG_DIR}/${scenario}.log" + + run_fast_deploy "${worktree}" "${state_file}" "${prime_log}" >/dev/null + container_id=$(awk -F= '/^container_id=/ {print $2; exit}' "${state_file}") + if [[ -z "${container_id}" ]]; then + echo "Error: could not determine cluster container ID from state file for invalidation scenario." >&2 + exit 1 + fi + + sed -i.bak "s|^container_id=.*$|container_id=invalidated-${container_id}|" "${state_file}" + rm -f "${state_file}.bak" + + total_duration=$(run_fast_deploy "${worktree}" "${state_file}" "${rerun_log}") + build_duration=$(extract_duration "${rerun_log}" "Builds") + if observed=$(validate_plan "${rerun_log}" 1 0 1); then + pass=PASS + else + pass=FAIL + fi + notes="mismatched container ID should invalidate gateway and helm only" + + if [[ "${pass}" == "PASS" ]] && ! check_required_patterns "${rerun_log}" "Restarting gateway to pick up updated image...|Upgrading helm release..."; then + pass=FAIL + notes="${notes}; missing expected deploy log pattern" + fi + + record_result \ + "${scenario}" \ + "auto" \ + "build gateway=1, build supervisor=0, helm upgrade=1" \ + "${observed}" \ + "${pass}" \ + "${total_duration}" \ + "${build_duration:-n/a}" \ + "$(count_cached_lines "${rerun_log}")" \ + "${notes}" +} + +printf 'scenario\tmode\texpected\tobserved\tpass\ttotal_seconds\tbuild_seconds\tcached_lines\tnotes\n' > "${SUMMARY_TSV}" + +on_exit() { + write_summary_md + echo "" + cat "${SUMMARY_MD}" + echo "" + echo "Full report written to: ${SUMMARY_MD}" + cleanup_worktrees +} +trap on_exit EXIT + +ensure_cluster + +SCENARIO_COUNT=${#SCENARIOS[@]} +SCENARIO_IDX=0 + +for scenario in "${SCENARIOS[@]}"; do + SCENARIO_IDX=$((SCENARIO_IDX + 1)) + echo "[${SCENARIO_IDX}/${SCENARIO_COUNT}] Running scenario: ${scenario}" + case "${scenario}" in + noop) + run_noop_scenario + ;; + gateway-auto) + run_auto_scenario \ + "gateway-auto" \ + "crates/openshell-server/src/main.rs" \ + "// fast deploy cache test: gateway-auto ${RUN_ID}" \ + 1 0 1 \ + "gateway-only source change" \ + "Pushing updated images to local registry...|Restarting gateway to pick up updated image...|Upgrading helm release..." + ;; + supervisor-auto) + run_auto_scenario \ + "supervisor-auto" \ + "crates/openshell-sandbox/src/main.rs" \ + "// fast deploy cache test: supervisor-auto ${RUN_ID}" \ + 0 1 0 \ + "supervisor-only source change" \ + "Supervisor binary updated on cluster node." + ;; + shared-auto) + run_auto_scenario \ + "shared-auto" \ + "crates/openshell-policy/src/lib.rs" \ + "// fast deploy cache test: shared-auto ${RUN_ID}" \ + 1 1 1 \ + "shared dependency change should rebuild both binaries" \ + "Restarting gateway to pick up updated image...|Supervisor binary updated on cluster node." + ;; + helm-auto) + run_auto_scenario \ + "helm-auto" \ + "deploy/helm/openshell/values.yaml" \ + "# fast deploy cache test: helm-auto ${RUN_ID}" \ + 0 0 1 \ + "chart-only change" \ + "Upgrading helm release..." + ;; + unrelated-auto) + run_auto_scenario \ + "unrelated-auto" \ + "README.md" \ + "" \ + 0 0 0 \ + "unrelated file should stay a no-op" \ + "No new local changes since last deploy." + ;; + explicit-targets) + run_explicit_targets_scenario + ;; + gateway-cache) + run_cache_scenario \ + "gateway-cache" \ + "crates/openshell-server/src/main.rs" \ + "// fast deploy cache test: gateway-cache ${RUN_ID}" \ + "gateway" + ;; + supervisor-cache) + run_cache_scenario \ + "supervisor-cache" \ + "crates/openshell-sandbox/src/main.rs" \ + "// fast deploy cache test: supervisor-cache ${RUN_ID}" \ + "supervisor" + ;; + container-invalidation) + run_container_invalidation_scenario + ;; + *) + echo "Unknown scenario '${scenario}'" >&2 + exit 1 + ;; + esac + echo "[${SCENARIO_IDX}/${SCENARIO_COUNT}] Done: ${scenario}" +done From 850c0a9ad574bc1abbaa751e41e24d2f565a321c Mon Sep 17 00:00:00 2001 From: Drew Newberry Date: Wed, 18 Mar 2026 12:52:11 -0700 Subject: [PATCH 06/15] perf(build): optimize fast-deploy supervisor builds - Add scratch export stage (supervisor-output) reducing export from 968 MB / 10s to 14 MB / 0.1s - Unify builder drivers: supervisor now uses desktop-linux driver when not cross-compiling, sharing BuildKit cache with gateway builds - Split monolithic COPY crates/ into per-target workspace stages so sandbox changes don't invalidate gateway builds and vice versa - Remove LTO from default release profile for fast local linking; CI restores codegen-units=1 via CARGO_CODEGEN_UNITS build arg - Set workspace version to 0.0.0 so stub builds in deps stage are clearly distinguishable from real versioned builds --- Cargo.lock | 18 +++++----- Cargo.toml | 4 +-- crates/openshell-core/build.rs | 5 ++- deploy/docker/Dockerfile.images | 49 +++++++++++++++++++++++++--- deploy/docker/cross-build.sh | 5 +++ tasks/scripts/cluster-deploy-fast.sh | 24 ++++++++++++-- tasks/scripts/docker-build-image.sh | 13 +++++++- 7 files changed, 96 insertions(+), 22 deletions(-) diff --git a/Cargo.lock b/Cargo.lock index 3d01356a..305c8a08 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -2803,7 +2803,7 @@ checksum = "c08d65885ee38876c4f86fa503fb49d7b507c2b62552df7c70b2fce627e06381" [[package]] name = "openshell-bootstrap" -version = "0.1.0" +version = "0.0.0" dependencies = [ "base64 0.22.1", "bollard", @@ -2822,7 +2822,7 @@ dependencies = [ [[package]] name = "openshell-cli" -version = "0.1.0" +version = "0.0.0" dependencies = [ "anyhow", "bytes", @@ -2867,7 +2867,7 @@ dependencies = [ [[package]] name = "openshell-core" -version = "0.1.0" +version = "0.0.0" dependencies = [ "miette", "prost", @@ -2884,7 +2884,7 @@ dependencies = [ [[package]] name = "openshell-policy" -version = "0.1.0" +version = "0.0.0" dependencies = [ "miette", "openshell-core", @@ -2894,7 +2894,7 @@ dependencies = [ [[package]] name = "openshell-providers" -version = "0.1.0" +version = "0.0.0" dependencies = [ "openshell-core", "thiserror 2.0.18", @@ -2902,7 +2902,7 @@ dependencies = [ [[package]] name = "openshell-router" -version = "0.1.0" +version = "0.0.0" dependencies = [ "bytes", "openshell-core", @@ -2920,7 +2920,7 @@ dependencies = [ [[package]] name = "openshell-sandbox" -version = "0.1.0" +version = "0.0.0" dependencies = [ "anyhow", "bytes", @@ -2961,7 +2961,7 @@ dependencies = [ [[package]] name = "openshell-server" -version = "0.1.0" +version = "0.0.0" dependencies = [ "anyhow", "axum 0.8.8", @@ -3015,7 +3015,7 @@ dependencies = [ [[package]] name = "openshell-tui" -version = "0.1.0" +version = "0.0.0" dependencies = [ "base64 0.22.1", "crossterm 0.28.1", diff --git a/Cargo.toml b/Cargo.toml index 215862cf..4fecf194 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -6,7 +6,7 @@ resolver = "2" members = ["crates/*"] [workspace.package] -version = "0.1.0" +version = "0.0.0" edition = "2024" rust-version = "1.88" license = "Apache-2.0" @@ -124,8 +124,6 @@ ref_option = "allow" # Common pattern for optional references missing_fields_in_debug = "allow" # Manual Debug impls often intentionally omit fields [profile.release] -lto = "thin" -codegen-units = 1 strip = true [profile.dev] diff --git a/crates/openshell-core/build.rs b/crates/openshell-core/build.rs index c06702c5..f44cdc75 100644 --- a/crates/openshell-core/build.rs +++ b/crates/openshell-core/build.rs @@ -17,7 +17,10 @@ fn main() -> Result<(), Box> { } // --- Protobuf compilation --- - // Use bundled protoc from protobuf-src + // Use bundled protoc from protobuf-src. The system protoc (from apt-get) + // does not bundle the well-known type includes (google/protobuf/struct.proto + // etc.), so we must use protobuf-src which ships both the binary and the + // include tree. // SAFETY: This is run at build time in a single-threaded build script context. // No other threads are reading environment variables concurrently. #[allow(unsafe_code)] diff --git a/deploy/docker/Dockerfile.images b/deploy/docker/Dockerfile.images index b16436c1..4740d830 100644 --- a/deploy/docker/Dockerfile.images +++ b/deploy/docker/Dockerfile.images @@ -10,6 +10,7 @@ # cluster Final cluster image # gateway-builder Release openshell-server binary # supervisor-builder Release openshell-sandbox binary +# supervisor-output Minimal stage exporting only the supervisor binary ARG K3S_VERSION=v1.35.2-k3s1 ARG K9S_VERSION=v0.50.18 @@ -24,6 +25,9 @@ ARG TARGETARCH ARG BUILDARCH ARG CARGO_TARGET_CACHE_SCOPE=default ARG SCCACHE_MEMCACHED_ENDPOINT +# CI sets this to 1 for maximum optimization; local builds leave it unset +# so cargo uses the Cargo.toml default (parallel codegen for fast linking). +ARG CARGO_CODEGEN_UNITS RUN apt-get update && apt-get install -y --no-install-recommends \ cmake g++ make protobuf-compiler curl && rm -rf /var/lib/apt/lists/* @@ -78,21 +82,34 @@ RUN --mount=type=cache,id=cargo-registry-${TARGETARCH},sharing=locked,target=/us --mount=type=cache,id=sccache-${TARGETARCH},sharing=locked,target=/tmp/sccache \ . cross-build.sh && cargo_cross_build --release -p openshell-server -p openshell-sandbox -FROM rust-deps AS rust-workspace +# --------------------------------------------------------------------------- +# Per-target workspace stages +# --------------------------------------------------------------------------- +# Copy only the crates needed for each target so that a change to +# openshell-sandbox does not invalidate the gateway build and vice versa. +# The skeleton stage already has stub Cargo.toml + src/ for every crate, +# so cargo workspace resolution continues to work — we just overwrite the +# crates whose real source is needed for compilation. + +FROM rust-deps AS gateway-workspace ARG OPENSHELL_CARGO_VERSION -COPY crates/ crates/ +COPY crates/openshell-core/ crates/openshell-core/ +COPY crates/openshell-policy/ crates/openshell-policy/ +COPY crates/openshell-providers/ crates/openshell-providers/ +COPY crates/openshell-router/ crates/openshell-router/ +COPY crates/openshell-server/ crates/openshell-server/ RUN touch \ crates/openshell-core/build.rs \ - crates/openshell-sandbox/src/main.rs \ crates/openshell-server/src/main.rs \ proto/*.proto && \ if [ -n "${OPENSHELL_CARGO_VERSION:-}" ]; then \ sed -i -E '/^\[workspace\.package\]/,/^\[/{s/^version[[:space:]]*=[[:space:]]*".*"/version = "'"${OPENSHELL_CARGO_VERSION}"'"/}' Cargo.toml; \ fi -FROM rust-workspace AS gateway-builder +FROM gateway-workspace AS gateway-builder +ARG CARGO_CODEGEN_UNITS RUN --mount=type=cache,id=cargo-registry-${TARGETARCH},sharing=locked,target=/usr/local/cargo/registry \ --mount=type=cache,id=cargo-git-${TARGETARCH},sharing=locked,target=/usr/local/cargo/git \ @@ -103,7 +120,24 @@ RUN --mount=type=cache,id=cargo-registry-${TARGETARCH},sharing=locked,target=/us mkdir -p /build/out && \ cp "$(cross_output_dir release)/openshell-server" /build/out/ -FROM rust-workspace AS supervisor-builder +FROM rust-deps AS supervisor-workspace +ARG OPENSHELL_CARGO_VERSION + +COPY crates/openshell-core/ crates/openshell-core/ +COPY crates/openshell-policy/ crates/openshell-policy/ +COPY crates/openshell-router/ crates/openshell-router/ +COPY crates/openshell-sandbox/ crates/openshell-sandbox/ + +RUN touch \ + crates/openshell-core/build.rs \ + crates/openshell-sandbox/src/main.rs \ + proto/*.proto && \ + if [ -n "${OPENSHELL_CARGO_VERSION:-}" ]; then \ + sed -i -E '/^\[workspace\.package\]/,/^\[/{s/^version[[:space:]]*=[[:space:]]*".*"/version = "'"${OPENSHELL_CARGO_VERSION}"'"/}' Cargo.toml; \ + fi + +FROM supervisor-workspace AS supervisor-builder +ARG CARGO_CODEGEN_UNITS RUN --mount=type=cache,id=cargo-registry-${TARGETARCH},sharing=locked,target=/usr/local/cargo/registry \ --mount=type=cache,id=cargo-git-${TARGETARCH},sharing=locked,target=/usr/local/cargo/git \ @@ -114,6 +148,11 @@ RUN --mount=type=cache,id=cargo-registry-${TARGETARCH},sharing=locked,target=/us mkdir -p /build/out && \ cp "$(cross_output_dir release)/openshell-sandbox" /build/out/ +# Minimal extraction stage for fast-deploy: exports only the supervisor +# binary (~20-40 MB) instead of the entire build environment (~968 MB). +FROM scratch AS supervisor-output +COPY --from=supervisor-builder /build/out/openshell-sandbox /openshell-sandbox + # --------------------------------------------------------------------------- # Final gateway image # --------------------------------------------------------------------------- diff --git a/deploy/docker/cross-build.sh b/deploy/docker/cross-build.sh index f7a0047f..0c2678b8 100755 --- a/deploy/docker/cross-build.sh +++ b/deploy/docker/cross-build.sh @@ -90,6 +90,11 @@ cargo_cross_build() { if [ -z "${SCCACHE_MEMCACHED_ENDPOINT:-}" ]; then unset SCCACHE_MEMCACHED_ENDPOINT 2>/dev/null || true fi + # When CARGO_CODEGEN_UNITS is set (e.g. CI=1), override the Cargo.toml + # release profile to use that many codegen units. + if [ -n "${CARGO_CODEGEN_UNITS:-}" ]; then + export CARGO_PROFILE_RELEASE_CODEGEN_UNITS="${CARGO_CODEGEN_UNITS}" + fi # Default sccache local disk cache to /tmp/sccache (matches BuildKit # cache mount target in Dockerfiles) when no dir is explicitly set. export SCCACHE_DIR="${SCCACHE_DIR:-/tmp/sccache}" diff --git a/tasks/scripts/cluster-deploy-fast.sh b/tasks/scripts/cluster-deploy-fast.sh index 1b2991da..a9831f31 100755 --- a/tasks/scripts/cluster-deploy-fast.sh +++ b/tasks/scripts/cluster-deploy-fast.sh @@ -315,6 +315,14 @@ if [[ "${build_supervisor}" == "1" ]]; then _cluster_image=$(docker inspect --format '{{.Config.Image}}' "${CONTAINER_NAME}" 2>/dev/null) CLUSTER_ARCH=$(docker image inspect --format '{{.Architecture}}' "${_cluster_image}" 2>/dev/null || echo "amd64") + # Detect the host (build) architecture in Docker's naming convention. + HOST_ARCH=$(docker info --format '{{.Architecture}}' 2>/dev/null || echo "amd64") + # Normalize: Docker reports "aarch64" on ARM hosts but uses "arm64" elsewhere. + case "${HOST_ARCH}" in + aarch64) HOST_ARCH=arm64 ;; + x86_64) HOST_ARCH=amd64 ;; + esac + # Build the supervisor binary from the shared image build graph, then # extract it via --output so fast deploys reuse the same Rust cache. SUPERVISOR_BUILD_DIR=$(mktemp -d) @@ -326,14 +334,24 @@ if [[ "${build_supervisor}" == "1" ]]; then _cargo_version=$(uv run python tasks/scripts/release.py get-version --cargo 2>/dev/null || true) fi - DOCKER_PLATFORM="linux/${CLUSTER_ARCH}" \ + # Only set DOCKER_PLATFORM when actually cross-compiling. Omitting it + # for native builds lets docker-build-image.sh pick the fast "docker" + # driver (same as gateway), which shares BuildKit cache mounts (sccache, + # cargo registry/target) and avoids docker-container IPC overhead. + _platform_env=() + if [[ "${CLUSTER_ARCH}" != "${HOST_ARCH}" ]]; then + _platform_env=(DOCKER_PLATFORM="linux/${CLUSTER_ARCH}") + fi + + env \ + "${_platform_env[@]+"${_platform_env[@]}"}" \ DOCKER_OUTPUT="type=local,dest=${SUPERVISOR_BUILD_DIR}" \ OPENSHELL_CARGO_VERSION="${_cargo_version}" \ - tasks/scripts/docker-build-image.sh supervisor-builder + tasks/scripts/docker-build-image.sh supervisor-output # Copy the built binary into the running k3s container docker exec "${CONTAINER_NAME}" mkdir -p /opt/openshell/bin - docker cp "${SUPERVISOR_BUILD_DIR}/build/out/openshell-sandbox" \ + docker cp "${SUPERVISOR_BUILD_DIR}/openshell-sandbox" \ "${CONTAINER_NAME}:/opt/openshell/bin/openshell-sandbox" docker exec "${CONTAINER_NAME}" chmod 755 /opt/openshell/bin/openshell-sandbox diff --git a/tasks/scripts/docker-build-image.sh b/tasks/scripts/docker-build-image.sh index 8f29f57a..9a116514 100755 --- a/tasks/scripts/docker-build-image.sh +++ b/tasks/scripts/docker-build-image.sh @@ -38,7 +38,7 @@ detect_rust_scope() { echo "no-rust" } -TARGET=${1:?"Usage: docker-build-image.sh [extra-args...]"} +TARGET=${1:?"Usage: docker-build-image.sh [extra-args...]"} shift DOCKERFILE="deploy/docker/Dockerfile.images" @@ -64,6 +64,9 @@ case "${TARGET}" in supervisor-builder) DOCKER_TARGET="supervisor-builder" ;; + supervisor-output) + DOCKER_TARGET="supervisor-output" + ;; *) echo "Error: unsupported target '${TARGET}'" >&2 exit 1 @@ -122,6 +125,13 @@ if [[ "${TARGET}" == "cluster" && -n "${K3S_VERSION:-}" ]]; then K3S_ARGS=(--build-arg "K3S_VERSION=${K3S_VERSION}") fi +# CI builds use codegen-units=1 for maximum optimization; local builds omit +# the arg so cargo uses the Cargo.toml default (parallel codegen, fast links). +CODEGEN_ARGS=() +if [[ -n "${CI:-}" ]]; then + CODEGEN_ARGS=(--build-arg "CARGO_CODEGEN_UNITS=1") +fi + TAG_ARGS=() if [[ "${IS_FINAL_IMAGE}" == "1" ]]; then TAG_ARGS=(-t "${IMAGE_NAME}:${IMAGE_TAG}") @@ -150,6 +160,7 @@ docker buildx build \ ${SCCACHE_ARGS[@]+"${SCCACHE_ARGS[@]}"} \ ${VERSION_ARGS[@]+"${VERSION_ARGS[@]}"} \ ${K3S_ARGS[@]+"${K3S_ARGS[@]}"} \ + ${CODEGEN_ARGS[@]+"${CODEGEN_ARGS[@]}"} \ --build-arg "CARGO_TARGET_CACHE_SCOPE=${CARGO_TARGET_CACHE_SCOPE}" \ -f "${DOCKERFILE}" \ --target "${DOCKER_TARGET}" \ From c710b07522662436bb40aa1ffcf4dc72af677434 Mon Sep 17 00:00:00 2001 From: Drew Newberry Date: Wed, 18 Mar 2026 13:09:52 -0700 Subject: [PATCH 07/15] docs(build): add guidance to cache test summary report Explain how to read each column and what to look for when reviewing fast-deploy benchmark results. --- .../cluster-deploy-fast-test.sh | 25 +++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/scripts/build-benchmark/cluster-deploy-fast-test.sh b/scripts/build-benchmark/cluster-deploy-fast-test.sh index c0ef6609..e1867354 100755 --- a/scripts/build-benchmark/cluster-deploy-fast-test.sh +++ b/scripts/build-benchmark/cluster-deploy-fast-test.sh @@ -185,6 +185,31 @@ write_summary_md() { echo "- Cluster: \`${CLUSTER_NAME}\`" echo "- Report dir: \`${REPORT_DIR}\`" echo + echo "## How to read this report" + echo + echo "Each row is an independent scenario that validates one aspect of the fast-deploy" + echo "change-detection and caching system." + echo + echo "**Columns:**" + echo + echo "- **Scenario** -- Name of the test. Suffixes like \`:cold\`/\`:warm\` indicate cache comparison phases." + echo "- **Mode** -- \`auto\` = change detection decides what to build; \`explicit\` = target forced via CLI arg; \`cache\` = cold vs warm timing comparison." + echo "- **Expected / Observed** -- The deploy plan the test expected vs what actually ran. Format: \`build gateway=N, build supervisor=N, helm upgrade=N\` where 1 = triggered, 0 = skipped." + echo "- **Pass** -- \`PASS\` = observed matched expected; \`FAIL\` = mismatch; \`INFO\` = informational baseline (no assertion)." + echo "- **Total (s)** -- Wall-clock time for the entire deploy including Docker context transfer, image push, and Helm rollout." + echo "- **Builds (s)** -- Time spent in cargo compilation and Docker image builds only. The gap between Total and Builds is deploy overhead (image push, Helm upgrade, pod rollout)." + echo "- **Cached lines** -- Number of Docker build steps that hit \`CACHED\`. Higher = more layer reuse. In cache scenarios, warm runs should show more cached lines than cold runs." + echo "- **Notes** -- Context about what the scenario tests or why it passed/failed." + echo + echo "**What to look for:**" + echo + echo "- All \`auto\` and \`explicit\` scenarios should be \`PASS\`. A \`FAIL\` means change detection routed incorrectly." + echo "- In \`cache\` scenarios, the \`:warm\` row should show either >= 30% faster build time or more cached lines than the \`:cold\` row." + echo "- The \`:cold\` rows are marked \`INFO\` -- they are baselines, not assertions." + echo "- If supervisor warm builds are slow, check the logs for full recompilation (many \`Compiling\` lines) vs cache hits (only workspace crates recompiling)." + echo + echo "## Results" + echo echo "| Scenario | Mode | Expected | Observed | Pass | Total (s) | Builds (s) | Cached lines | Notes |" echo "|---|---|---|---|---|---:|---:|---:|---|" awk -F '\t' 'NR > 1 {printf "| %s | %s | `%s` | `%s` | %s | %s | %s | %s | %s |\n", $1, $2, $3, $4, $5, $6, $7, $8, $9}' "${SUMMARY_TSV}" From ae53463334a1dfaae9373e9099f70438aac98b9f Mon Sep 17 00:00:00 2001 From: Drew Newberry Date: Wed, 18 Mar 2026 13:57:59 -0700 Subject: [PATCH 08/15] wip --- .../workflows/cluster-deploy-fast-nightly.yml | 98 --- architecture/build-containers.md | 1 - tasks/cluster.toml | 4 - .../scripts/cluster-deploy-fast-test-junit.py | 110 --- tasks/scripts/cluster-deploy-fast-test.sh | 625 ------------------ 5 files changed, 838 deletions(-) delete mode 100644 .github/workflows/cluster-deploy-fast-nightly.yml delete mode 100644 tasks/scripts/cluster-deploy-fast-test-junit.py delete mode 100755 tasks/scripts/cluster-deploy-fast-test.sh diff --git a/.github/workflows/cluster-deploy-fast-nightly.yml b/.github/workflows/cluster-deploy-fast-nightly.yml deleted file mode 100644 index 67c323f0..00000000 --- a/.github/workflows/cluster-deploy-fast-nightly.yml +++ /dev/null @@ -1,98 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 - -name: Cluster Deploy Fast Nightly - -on: - workflow_dispatch: {} - schedule: - - cron: "0 8 * * *" # 12 AM PT (08:00 UTC during PDT) - -permissions: - contents: read - packages: read - actions: read - checks: write - -concurrency: - group: cluster-deploy-fast-nightly - cancel-in-progress: false - -jobs: - fast-deploy-cache: - name: Fast Deploy Cache - runs-on: build-amd64 - timeout-minutes: 90 - container: - image: ghcr.io/nvidia/openshell/ci:latest - credentials: - username: ${{ github.actor }} - password: ${{ secrets.GITHUB_TOKEN }} - options: --privileged - volumes: - - /var/run/docker.sock:/var/run/docker.sock - env: - MISE_GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} - FAST_DEPLOY_TEST_REPORT_DIR: ${{ github.workspace }}/.cache/cluster-deploy-fast-test/nightly-${{ github.run_id }} - steps: - - uses: actions/checkout@v4 - - - name: Log in to GitHub Container Registry - uses: docker/login-action@v3 - with: - registry: ghcr.io - username: ${{ github.actor }} - password: ${{ secrets.GITHUB_TOKEN }} - - - name: Install Python dependencies and generate protobuf stubs - run: uv sync --frozen && mise run --no-prepare python:proto - - - name: Bootstrap cluster - run: mise run --no-prepare --skip-deps cluster - - - name: Run fast deploy cache harness - id: harness - if: ${{ !cancelled() }} - run: | - set -euo pipefail - set +e - tasks/scripts/cluster-deploy-fast-test.sh - status=$? - set -e - junit_path="${FAST_DEPLOY_TEST_REPORT_DIR}/junit.xml" - if [ -f "${FAST_DEPLOY_TEST_REPORT_DIR}/summary.tsv" ]; then - uv run python tasks/scripts/cluster-deploy-fast-test-junit.py \ - --input "${FAST_DEPLOY_TEST_REPORT_DIR}/summary.tsv" \ - --output "${junit_path}" \ - --suite-name "cluster-deploy-fast-nightly" - echo "junit_path=${junit_path}" >> "$GITHUB_OUTPUT" - fi - if [ -f "${FAST_DEPLOY_TEST_REPORT_DIR}/summary.md" ]; then - echo "summary_path=${FAST_DEPLOY_TEST_REPORT_DIR}/summary.md" >> "$GITHUB_OUTPUT" - fi - exit "${status}" - - - name: Publish Markdown summary - if: ${{ always() && !cancelled() && steps.harness.outputs.summary_path != '' }} - run: cat "${{ steps.harness.outputs.summary_path }}" >> "$GITHUB_STEP_SUMMARY" - - - name: Upload fast deploy nightly artifacts - if: ${{ always() && !cancelled() }} - uses: actions/upload-artifact@v4 - with: - name: cluster-deploy-fast-nightly-${{ github.run_id }} - path: ${{ env.FAST_DEPLOY_TEST_REPORT_DIR }} - if-no-files-found: error - retention-days: 14 - - - name: Publish test report - if: ${{ always() && !cancelled() && steps.harness.outputs.junit_path != '' }} - uses: dorny/test-reporter@v2 - with: - name: Cluster Deploy Fast Nightly - path: ${{ steps.harness.outputs.junit_path }} - reporter: java-junit - fail-on-error: true - use-actions-summary: true - report-title: Cluster Deploy Fast Nightly - badge-title: fast-deploy diff --git a/architecture/build-containers.md b/architecture/build-containers.md index 28f2e9ad..ee69896e 100644 --- a/architecture/build-containers.md +++ b/architecture/build-containers.md @@ -71,4 +71,3 @@ The harness runs isolated scenarios in temporary git worktrees, keeps its own st - cold vs warm rebuild comparisons for gateway and supervisor code changes - container-ID invalidation coverage to verify gateway + Helm are retriggered when the cluster container changes -Nightly CI runs the same harness in `.github/workflows/cluster-deploy-fast-nightly.yml`, uploads the full report directory as an artifact, and publishes the generated JUnit summary through `dorny/test-reporter` so failures and regressions are visible from the workflow run. diff --git a/tasks/cluster.toml b/tasks/cluster.toml index 6ddb0526..da06178f 100644 --- a/tasks/cluster.toml +++ b/tasks/cluster.toml @@ -30,10 +30,6 @@ description = "Pull-mode deploy using local registry pushes" run = "tasks/scripts/cluster-deploy-fast.sh all" hide = true -["cluster:test:fast-deploy-cache"] -description = "Validate fast deploy routing and image cache reuse across component changes" -run = "tasks/scripts/cluster-deploy-fast-test.sh" - ["cluster:push:gateway"] description = "Tag and push gateway image to pull registry" run = "tasks/scripts/cluster-push-component.sh gateway" diff --git a/tasks/scripts/cluster-deploy-fast-test-junit.py b/tasks/scripts/cluster-deploy-fast-test-junit.py deleted file mode 100644 index 94bd862e..00000000 --- a/tasks/scripts/cluster-deploy-fast-test-junit.py +++ /dev/null @@ -1,110 +0,0 @@ -#!/usr/bin/env python3 - -# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 - -from __future__ import annotations - -import argparse -import csv -import sys -import xml.etree.ElementTree as ET -from pathlib import Path - - -def parse_args() -> argparse.Namespace: - parser = argparse.ArgumentParser( - description="Convert cluster-deploy-fast TSV summary into JUnit XML." - ) - parser.add_argument("--input", required=True, help="Path to summary.tsv") - parser.add_argument("--output", required=True, help="Path to junit.xml") - parser.add_argument( - "--suite-name", - default="cluster-deploy-fast-nightly", - help="JUnit testsuite name", - ) - return parser.parse_args() - - -def as_seconds(value: str) -> str: - if not value or value == "n/a": - return "0" - try: - return str(float(value)) - except ValueError: - return "0" - - -def build_xml(rows: list[dict[str, str]], suite_name: str) -> ET.ElementTree: - total = len(rows) - failures = sum(1 for row in rows if row["pass"] == "FAIL") - skipped = sum(1 for row in rows if row["pass"] == "INFO") - suite = ET.Element( - "testsuite", - name=suite_name, - tests=str(total), - failures=str(failures), - skipped=str(skipped), - ) - - for row in rows: - case = ET.SubElement( - suite, - "testcase", - classname=f"cluster_deploy_fast.{row['mode']}", - name=row["scenario"], - time=as_seconds(row["total_seconds"]), - ) - - if row["pass"] == "FAIL": - failure = ET.SubElement( - case, - "failure", - message=row["notes"] or "scenario failed", - ) - failure.text = ( - f"Expected: {row['expected']}\n" - f"Observed: {row['observed']}\n" - f"Notes: {row['notes']}" - ) - elif row["pass"] == "INFO": - skipped = ET.SubElement( - case, - "skipped", - message=row["notes"] or "informational baseline", - ) - skipped.text = f"Observed: {row['observed']}" - - properties = ET.SubElement(case, "properties") - for key in ( - "expected", - "observed", - "build_seconds", - "cached_lines", - "notes", - ): - ET.SubElement(properties, "property", name=key, value=row[key]) - - return ET.ElementTree(suite) - - -def main() -> int: - args = parse_args() - input_path = Path(args.input) - output_path = Path(args.output) - - with input_path.open(newline="", encoding="utf-8") as handle: - rows = list(csv.DictReader(handle, delimiter="\t")) - - if not rows: - print(f"no rows found in {input_path}", file=sys.stderr) - return 1 - - output_path.parent.mkdir(parents=True, exist_ok=True) - tree = build_xml(rows, args.suite_name) - tree.write(output_path, encoding="utf-8", xml_declaration=True) - return 0 - - -if __name__ == "__main__": - raise SystemExit(main()) diff --git a/tasks/scripts/cluster-deploy-fast-test.sh b/tasks/scripts/cluster-deploy-fast-test.sh deleted file mode 100755 index 5d3048d0..00000000 --- a/tasks/scripts/cluster-deploy-fast-test.sh +++ /dev/null @@ -1,625 +0,0 @@ -#!/usr/bin/env bash - -# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 - -set -euo pipefail - -usage() { - cat <<'EOF' -Usage: cluster-deploy-fast-test.sh [scenario...] - -Repeatable validation harness for tasks/scripts/cluster-deploy-fast.sh. - -Scenarios: - noop Validate clean-tree auto deploy is a no-op after state is primed - gateway-auto Gateway-only change triggers gateway rebuild + Helm upgrade - supervisor-auto Supervisor-only change triggers supervisor refresh only - shared-auto Shared change triggers gateway + supervisor rebuild - helm-auto Helm-only change triggers Helm upgrade only - unrelated-auto Unrelated change stays a no-op - explicit-targets Explicit targets override change detection - gateway-cache Compare cold vs warm gateway rebuild after a code change - supervisor-cache Compare cold vs warm supervisor rebuild after a code change - container-invalidation Mismatched container ID invalidates gateway + Helm state - -If no scenarios are provided, the full suite runs. - -Environment: - CLUSTER_NAME Override cluster name to test against - FAST_DEPLOY_TEST_REPORT_DIR Output directory (default: .cache/cluster-deploy-fast-test/) - FAST_DEPLOY_TEST_KEEP_WORKTREES Keep temporary worktrees when set to 1 - FAST_DEPLOY_TEST_SKIP_CACHE Skip the cache timing scenarios when set to 1 -EOF -} - -if [[ "${1:-}" == "-h" || "${1:-}" == "--help" ]]; then - usage - exit 0 -fi - -SCRIPT_DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd) -REPO_ROOT=$(cd "${SCRIPT_DIR}/../.." && pwd) -RUN_ID=$(date +"%Y%m%d-%H%M%S") -REPORT_DIR=${FAST_DEPLOY_TEST_REPORT_DIR:-"${REPO_ROOT}/.cache/cluster-deploy-fast-test/${RUN_ID}"} -WORKTREE_ROOT="${REPORT_DIR}/worktrees" -LOG_DIR="${REPORT_DIR}/logs" -STATE_DIR="${REPORT_DIR}/state" -CACHE_DIR="${REPORT_DIR}/buildkit-cache" -SUMMARY_TSV="${REPORT_DIR}/summary.tsv" -SUMMARY_MD="${REPORT_DIR}/summary.md" -KEEP_WORKTREES=${FAST_DEPLOY_TEST_KEEP_WORKTREES:-0} -SKIP_CACHE=${FAST_DEPLOY_TEST_SKIP_CACHE:-0} - -normalize_name() { - echo "$1" | tr '[:upper:]' '[:lower:]' | sed 's/[^a-z0-9-]/-/g' | sed 's/--*/-/g' | sed 's/^-//;s/-$//' -} - -ROOT_BASENAME=$(basename "${REPO_ROOT}") -CLUSTER_NAME=${CLUSTER_NAME:-$(normalize_name "${ROOT_BASENAME}")} - -mkdir -p "${WORKTREE_ROOT}" "${LOG_DIR}" "${STATE_DIR}" "${CACHE_DIR}" - -declare -a SCENARIOS=() -if [[ "$#" -gt 0 ]]; then - SCENARIOS=("$@") -else - SCENARIOS=( - noop - gateway-auto - supervisor-auto - shared-auto - helm-auto - unrelated-auto - explicit-targets - gateway-cache - supervisor-cache - container-invalidation - ) -fi - -if [[ "${SKIP_CACHE}" == "1" ]]; then - declare -a filtered=() - filtered=() - for scenario in "${SCENARIOS[@]}"; do - if [[ "${scenario}" != "gateway-cache" && "${scenario}" != "supervisor-cache" ]]; then - filtered+=("${scenario}") - fi - done - SCENARIOS=("${filtered[@]}") -fi - -declare -a CREATED_WORKTREES=() - -cleanup() { - if [[ "${KEEP_WORKTREES}" == "1" ]]; then - return - fi - - local dir - for dir in "${CREATED_WORKTREES[@]:-}"; do - if [[ -d "${dir}" ]]; then - git -C "${REPO_ROOT}" worktree remove --force "${dir}" >/dev/null 2>&1 || true - fi - done -} -trap cleanup EXIT - -buildx_driver() { - local -a builder_args=() - local ctx - - if [[ -n "${DOCKER_BUILDER:-}" ]]; then - builder_args=(--builder "${DOCKER_BUILDER}") - elif [[ -z "${DOCKER_PLATFORM:-}" && -z "${CI:-}" ]]; then - ctx=$(docker context inspect --format '{{.Name}}' 2>/dev/null || echo default) - builder_args=(--builder "${ctx}") - fi - - docker buildx inspect ${builder_args[@]+"${builder_args[@]}"} 2>/dev/null \ - | awk -F': ' '/Driver:/ {gsub(/^[[:space:]]+|[[:space:]]+$/, "", $2); print $2; exit}' -} - -current_cluster_container_id() { - docker inspect --format '{{.Id}}' "openshell-cluster-${CLUSTER_NAME}" 2>/dev/null || true -} - -require_cluster() { - if ! docker ps -q --filter "name=^openshell-cluster-${CLUSTER_NAME}$" --filter "health=healthy" | grep -q .; then - echo "Error: cluster container 'openshell-cluster-${CLUSTER_NAME}' is not running or healthy." >&2 - echo "Start it first with: mise run cluster" >&2 - exit 1 - fi -} - -create_worktree() { - local name=$1 - local dir="${WORKTREE_ROOT}/${name}" - rm -rf "${dir}" - git -C "${REPO_ROOT}" worktree add --detach "${dir}" HEAD >/dev/null - CREATED_WORKTREES+=("${dir}") - printf '%s\n' "${dir}" -} - -append_marker() { - local file=$1 - local marker=$2 - printf '\n%s\n' "${marker}" >> "${file}" -} - -extract_plan_value() { - local log_file=$1 - local label=$2 - awk -F': +' -v pattern="${label}" '$0 ~ pattern {print $2; exit}' "${log_file}" -} - -extract_duration() { - local log_file=$1 - local label=$2 - awk -v prefix="${label} took " 'index($0, prefix) == 1 {sub(/^.* took /, "", $0); sub(/s$/, "", $0); print; exit}' "${log_file}" -} - -count_cached_lines() { - local log_file=$1 - grep -c " CACHED" "${log_file}" 2>/dev/null || true -} - -check_required_patterns() { - local log_file=$1 - local patterns=${2:-} - local pattern - - if [[ -z "${patterns}" ]]; then - return 0 - fi - - IFS='|' read -r -a pattern_array <<< "${patterns}" - for pattern in "${pattern_array[@]}"; do - if ! grep -Fq "${pattern}" "${log_file}"; then - return 1 - fi - done - - return 0 -} - -record_result() { - local scenario=$1 - local mode=$2 - local expected=$3 - local observed=$4 - local pass=$5 - local total_duration=$6 - local build_duration=$7 - local cached_lines=$8 - local notes=$9 - - printf '%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n' \ - "${scenario}" "${mode}" "${expected}" "${observed}" "${pass}" \ - "${total_duration}" "${build_duration}" "${cached_lines}" "${notes}" >> "${SUMMARY_TSV}" -} - -write_summary_md() { - { - echo "# Fast Deploy Cache Test Summary" - echo - echo "- Cluster: \`${CLUSTER_NAME}\`" - echo "- Buildx driver: \`${BUILDX_DRIVER:-unknown}\`" - echo "- Report dir: \`${REPORT_DIR}\`" - echo - echo "| Scenario | Mode | Expected | Observed | Pass | Total (s) | Builds (s) | Cached lines | Notes |" - echo "|---|---|---|---|---|---:|---:|---:|---|" - awk -F '\t' 'NR > 1 {printf "| %s | %s | `%s` | `%s` | %s | %s | %s | %s | %s |\n", $1, $2, $3, $4, $5, $6, $7, $8, $9}' "${SUMMARY_TSV}" - } > "${SUMMARY_MD}" -} - -run_fast_deploy() { - local worktree=$1 - local state_file=$2 - local log_file=$3 - shift 3 - - local start end status - start=$(date +%s) - ( - cd "${worktree}" - env \ - BUILDKIT_PROGRESS=plain \ - CLUSTER_NAME="${CLUSTER_NAME}" \ - DEPLOY_FAST_STATE_FILE="${state_file}" \ - DOCKER_BUILD_CACHE_DIR="${CACHE_DIR}" \ - "$@" \ - ./tasks/scripts/cluster-deploy-fast.sh - ) >"${log_file}" 2>&1 || status=$? - status=${status:-0} - end=$(date +%s) - printf '%s\n' $((end - start)) - return "${status}" -} - -run_fast_deploy_args() { - local worktree=$1 - local state_file=$2 - local log_file=$3 - shift 3 - - local start end status - start=$(date +%s) - ( - cd "${worktree}" - env \ - BUILDKIT_PROGRESS=plain \ - CLUSTER_NAME="${CLUSTER_NAME}" \ - DEPLOY_FAST_STATE_FILE="${state_file}" \ - DOCKER_BUILD_CACHE_DIR="${CACHE_DIR}" \ - ./tasks/scripts/cluster-deploy-fast.sh "$@" - ) >"${log_file}" 2>&1 || status=$? - status=${status:-0} - end=$(date +%s) - printf '%s\n' $((end - start)) - return "${status}" -} - -validate_plan() { - local log_file=$1 - local expected_gateway=$2 - local expected_supervisor=$3 - local expected_helm=$4 - - local gateway supervisor helm - gateway=$(extract_plan_value "${log_file}" "build gateway") - supervisor=$(extract_plan_value "${log_file}" "build supervisor") - helm=$(extract_plan_value "${log_file}" "helm upgrade") - - if [[ "${gateway}" == "${expected_gateway}" && "${supervisor}" == "${expected_supervisor}" && "${helm}" == "${expected_helm}" ]]; then - printf '%s\n' "build gateway=${gateway}, build supervisor=${supervisor}, helm upgrade=${helm}" - return 0 - fi - - printf '%s\n' "build gateway=${gateway:-missing}, build supervisor=${supervisor:-missing}, helm upgrade=${helm:-missing}" - return 1 -} - -clear_cache() { - rm -rf "${CACHE_DIR}" - mkdir -p "${CACHE_DIR}" -} - -prime_state() { - local name=$1 - local worktree state_file log_file - worktree=$(create_worktree "${name}-prime") - state_file="${STATE_DIR}/${name}.state" - log_file="${LOG_DIR}/${name}-prime.log" - run_fast_deploy "${worktree}" "${state_file}" "${log_file}" >/dev/null -} - -run_auto_scenario() { - local scenario=$1 - local file=$2 - local marker=$3 - local expected_gateway=$4 - local expected_supervisor=$5 - local expected_helm=$6 - local note=$7 - local required_patterns=${8:-} - - local worktree state_file log_file total_duration build_duration observed pass - worktree=$(create_worktree "${scenario}") - state_file="${STATE_DIR}/${scenario}.state" - log_file="${LOG_DIR}/${scenario}.log" - - prime_state "${scenario}" - append_marker "${worktree}/${file}" "${marker}" - - total_duration=$(run_fast_deploy "${worktree}" "${state_file}" "${log_file}") - build_duration=$(extract_duration "${log_file}" "Builds") - if observed=$(validate_plan "${log_file}" "${expected_gateway}" "${expected_supervisor}" "${expected_helm}"); then - pass=PASS - else - pass=FAIL - fi - - if [[ "${pass}" == "PASS" ]] && ! check_required_patterns "${log_file}" "${required_patterns}"; then - pass=FAIL - note="${note}; missing expected deploy log pattern" - fi - - record_result \ - "${scenario}" \ - "auto" \ - "build gateway=${expected_gateway}, build supervisor=${expected_supervisor}, helm upgrade=${expected_helm}" \ - "${observed}" \ - "${pass}" \ - "${total_duration}" \ - "${build_duration:-n/a}" \ - "$(count_cached_lines "${log_file}")" \ - "${note}" -} - -run_noop_scenario() { - local scenario=noop - local worktree state_file log_file total_duration build_duration observed pass notes - worktree=$(create_worktree "${scenario}") - state_file="${STATE_DIR}/${scenario}.state" - log_file="${LOG_DIR}/${scenario}.log" - - prime_state "${scenario}" - - total_duration=$(run_fast_deploy "${worktree}" "${state_file}" "${log_file}") - build_duration=$(extract_duration "${log_file}" "Builds") - if observed=$(validate_plan "${log_file}" 0 0 0); then - pass=PASS - else - pass=FAIL - fi - notes="clean tree should print no-op plan" - - if ! grep -q "No new local changes since last deploy." "${log_file}"; then - pass=FAIL - notes="missing no-op message" - fi - - record_result \ - "${scenario}" \ - "auto" \ - "build gateway=0, build supervisor=0, helm upgrade=0" \ - "${observed}" \ - "${pass}" \ - "${total_duration}" \ - "${build_duration:-n/a}" \ - "$(count_cached_lines "${log_file}")" \ - "${notes}" -} - -run_explicit_targets_scenario() { - local scenario=explicit-targets - local target worktree state_file log_file total_duration build_duration observed pass expected notes - - for target in gateway supervisor chart all; do - worktree=$(create_worktree "${scenario}-${target}") - state_file="${STATE_DIR}/${scenario}-${target}.state" - log_file="${LOG_DIR}/${scenario}-${target}.log" - - total_duration=$(run_fast_deploy_args "${worktree}" "${state_file}" "${log_file}" "${target}") - build_duration=$(extract_duration "${log_file}" "Builds") - - case "${target}" in - gateway) - if observed=$(validate_plan "${log_file}" 1 0 1); then - pass=PASS - else - pass=FAIL - fi - expected="build gateway=1, build supervisor=0, helm upgrade=1" - ;; - supervisor) - if observed=$(validate_plan "${log_file}" 0 1 0); then - pass=PASS - else - pass=FAIL - fi - expected="build gateway=0, build supervisor=1, helm upgrade=0" - ;; - chart) - if observed=$(validate_plan "${log_file}" 0 0 1); then - pass=PASS - else - pass=FAIL - fi - expected="build gateway=0, build supervisor=0, helm upgrade=1" - ;; - all) - if observed=$(validate_plan "${log_file}" 1 1 1); then - pass=PASS - else - pass=FAIL - fi - expected="build gateway=1, build supervisor=1, helm upgrade=1" - ;; - esac - notes="explicit target ${target}" - record_result \ - "${scenario}:${target}" \ - "explicit" \ - "${expected}" \ - "${observed}" \ - "${pass}" \ - "${total_duration}" \ - "${build_duration:-n/a}" \ - "$(count_cached_lines "${log_file}")" \ - "${notes}" - done -} - -run_cache_scenario() { - local scenario=$1 - local file=$2 - local marker=$3 - local target=$4 - - local worktree state_file cold_log warm_log cold_total warm_total cold_build warm_build cold_cached warm_cached pass notes - worktree=$(create_worktree "${scenario}") - state_file="${STATE_DIR}/${scenario}.state" - cold_log="${LOG_DIR}/${scenario}-cold.log" - warm_log="${LOG_DIR}/${scenario}-warm.log" - - append_marker "${worktree}/${file}" "${marker}" - clear_cache - - cold_total=$(run_fast_deploy_args "${worktree}" "${state_file}" "${cold_log}" "${target}") - cold_build=$(extract_duration "${cold_log}" "Builds") - cold_cached=$(count_cached_lines "${cold_log}") - - warm_total=$(run_fast_deploy_args "${worktree}" "${state_file}" "${warm_log}" "${target}") - warm_build=$(extract_duration "${warm_log}" "Builds") - warm_cached=$(count_cached_lines "${warm_log}") - - pass=FAIL - notes="warm rebuild should be faster or show cache hits" - - if [[ -n "${cold_build:-}" && -n "${warm_build:-}" && "${cold_build}" =~ ^[0-9]+$ && "${warm_build}" =~ ^[0-9]+$ && "${cold_build}" -gt 0 ]]; then - if [[ "${warm_build}" -le $((cold_build * 70 / 100)) ]]; then - pass=PASS - notes="warm build improved by at least 30%" - fi - fi - - if [[ "${pass}" != "PASS" && "${warm_cached}" =~ ^[0-9]+$ && "${warm_cached}" -gt "${cold_cached:-0}" ]]; then - pass=PASS - notes="warm build showed more cache hits" - fi - - record_result \ - "${scenario}:cold" \ - "cache" \ - "first rebuild of ${target} after cache clear" \ - "total=${cold_total}s, builds=${cold_build:-n/a}s" \ - "INFO" \ - "${cold_total}" \ - "${cold_build:-n/a}" \ - "${cold_cached}" \ - "baseline cold run" - - record_result \ - "${scenario}:warm" \ - "cache" \ - "second rebuild of ${target} should reuse cache" \ - "total=${warm_total}s, builds=${warm_build:-n/a}s" \ - "${pass}" \ - "${warm_total}" \ - "${warm_build:-n/a}" \ - "${warm_cached}" \ - "${notes}" -} - -run_container_invalidation_scenario() { - local scenario=container-invalidation - local worktree state_file prime_log rerun_log total_duration build_duration observed pass container_id notes - worktree=$(create_worktree "${scenario}") - state_file="${STATE_DIR}/${scenario}.state" - prime_log="${LOG_DIR}/${scenario}-prime.log" - rerun_log="${LOG_DIR}/${scenario}.log" - - run_fast_deploy "${worktree}" "${state_file}" "${prime_log}" >/dev/null - container_id=$(current_cluster_container_id) - if [[ -z "${container_id}" ]]; then - echo "Error: could not determine cluster container ID for invalidation scenario." >&2 - exit 1 - fi - - sed -i.bak "s|^container_id=.*$|container_id=invalidated-${container_id#sha256:}|" "${state_file}" - rm -f "${state_file}.bak" - - total_duration=$(run_fast_deploy "${worktree}" "${state_file}" "${rerun_log}") - build_duration=$(extract_duration "${rerun_log}" "Builds") - if observed=$(validate_plan "${rerun_log}" 1 0 1); then - pass=PASS - else - pass=FAIL - fi - notes="mismatched container ID should invalidate gateway and helm only" - - if [[ "${pass}" == "PASS" ]] && ! check_required_patterns "${rerun_log}" "Restarting gateway to pick up updated image...|Upgrading helm release..."; then - pass=FAIL - notes="${notes}; missing expected deploy log pattern" - fi - - record_result \ - "${scenario}" \ - "auto" \ - "build gateway=1, build supervisor=0, helm upgrade=1" \ - "${observed}" \ - "${pass}" \ - "${total_duration}" \ - "${build_duration:-n/a}" \ - "$(count_cached_lines "${rerun_log}")" \ - "${notes}" -} - -printf 'scenario\tmode\texpected\tobserved\tpass\ttotal_seconds\tbuild_seconds\tcached_lines\tnotes\n' > "${SUMMARY_TSV}" - -require_cluster -BUILDX_DRIVER=$(buildx_driver || true) - -for scenario in "${SCENARIOS[@]}"; do - case "${scenario}" in - noop) - run_noop_scenario - ;; - gateway-auto) - run_auto_scenario \ - "gateway-auto" \ - "crates/openshell-server/src/main.rs" \ - "// fast deploy cache test: gateway-auto ${RUN_ID}" \ - 1 0 1 \ - "gateway-only source change" \ - "Pushing updated images to local registry...|Restarting gateway to pick up updated image...|Upgrading helm release..." - ;; - supervisor-auto) - run_auto_scenario \ - "supervisor-auto" \ - "crates/openshell-sandbox/src/main.rs" \ - "// fast deploy cache test: supervisor-auto ${RUN_ID}" \ - 0 1 0 \ - "supervisor-only source change" \ - "Supervisor binary updated on cluster node." - ;; - shared-auto) - run_auto_scenario \ - "shared-auto" \ - "crates/openshell-policy/src/lib.rs" \ - "// fast deploy cache test: shared-auto ${RUN_ID}" \ - 1 1 0 \ - "shared dependency change should rebuild both binaries" \ - "Restarting gateway to pick up updated image...|Supervisor binary updated on cluster node." - ;; - helm-auto) - run_auto_scenario \ - "helm-auto" \ - "deploy/helm/openshell/values.yaml" \ - "# fast deploy cache test: helm-auto ${RUN_ID}" \ - 0 0 1 \ - "chart-only change" \ - "Upgrading helm release..." - ;; - unrelated-auto) - run_auto_scenario \ - "unrelated-auto" \ - "README.md" \ - "" \ - 0 0 0 \ - "unrelated file should stay a no-op" \ - "No new local changes since last deploy." - ;; - explicit-targets) - run_explicit_targets_scenario - ;; - gateway-cache) - run_cache_scenario \ - "gateway-cache" \ - "crates/openshell-server/src/main.rs" \ - "// fast deploy cache test: gateway-cache ${RUN_ID}" \ - "gateway" - ;; - supervisor-cache) - run_cache_scenario \ - "supervisor-cache" \ - "crates/openshell-sandbox/src/main.rs" \ - "// fast deploy cache test: supervisor-cache ${RUN_ID}" \ - "supervisor" - ;; - container-invalidation) - run_container_invalidation_scenario - ;; - *) - echo "Unknown scenario '${scenario}'" >&2 - exit 1 - ;; - esac -done - -write_summary_md - -echo "Fast deploy cache test report written to:" -echo " ${SUMMARY_MD}" From 7784b50924c725c1000464d551d8b9f19cf90831 Mon Sep 17 00:00:00 2001 From: Drew Newberry Date: Wed, 18 Mar 2026 14:02:54 -0700 Subject: [PATCH 09/15] fix(build): include Dockerfile and build scripts in change detection The supervisor committed-tree fingerprint was missing deploy/docker/Dockerfile.images, and neither gateway nor supervisor included tasks/scripts/docker-build-image.sh. Changes to these files (e.g. from rebasing main) would not trigger a rebuild. Align the git ls-tree paths with the matches_* functions so committed and uncommitted changes are detected consistently. --- tasks/scripts/cluster-deploy-fast.sh | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/tasks/scripts/cluster-deploy-fast.sh b/tasks/scripts/cluster-deploy-fast.sh index a9831f31..d9c022de 100755 --- a/tasks/scripts/cluster-deploy-fast.sh +++ b/tasks/scripts/cluster-deploy-fast.sh @@ -149,13 +149,13 @@ matches_gateway() { Cargo.toml|Cargo.lock|proto/*|deploy/docker/cross-build.sh) return 0 ;; - crates/openshell-core/*|crates/openshell-policy/*|crates/openshell-providers/*) + deploy/docker/Dockerfile.images|tasks/scripts/docker-build-image.sh) return 0 ;; - crates/openshell-router/*) + crates/openshell-core/*|crates/openshell-policy/*|crates/openshell-providers/*) return 0 ;; - crates/openshell-server/*|deploy/docker/Dockerfile.images) + crates/openshell-router/*|crates/openshell-server/*) return 0 ;; *) @@ -170,10 +170,13 @@ matches_supervisor() { Cargo.toml|Cargo.lock|proto/*|deploy/docker/cross-build.sh) return 0 ;; + deploy/docker/Dockerfile.images|tasks/scripts/docker-build-image.sh) + return 0 + ;; crates/openshell-core/*|crates/openshell-policy/*|crates/openshell-router/*) return 0 ;; - crates/openshell-sandbox/*|deploy/docker/Dockerfile.images) + crates/openshell-sandbox/*) return 0 ;; *) @@ -206,10 +209,10 @@ compute_fingerprint() { local committed_trees="" case "${component}" in gateway) - committed_trees=$(git ls-tree HEAD Cargo.toml Cargo.lock proto/ deploy/docker/cross-build.sh crates/openshell-core/ crates/openshell-policy/ crates/openshell-providers/ crates/openshell-router/ crates/openshell-server/ deploy/docker/Dockerfile.images 2>/dev/null || true) + committed_trees=$(git ls-tree HEAD Cargo.toml Cargo.lock proto/ deploy/docker/cross-build.sh deploy/docker/Dockerfile.images tasks/scripts/docker-build-image.sh crates/openshell-core/ crates/openshell-policy/ crates/openshell-providers/ crates/openshell-router/ crates/openshell-server/ 2>/dev/null || true) ;; supervisor) - committed_trees=$(git ls-tree HEAD Cargo.toml Cargo.lock proto/ deploy/docker/cross-build.sh crates/openshell-core/ crates/openshell-policy/ crates/openshell-router/ crates/openshell-sandbox/ 2>/dev/null || true) + committed_trees=$(git ls-tree HEAD Cargo.toml Cargo.lock proto/ deploy/docker/cross-build.sh deploy/docker/Dockerfile.images tasks/scripts/docker-build-image.sh crates/openshell-core/ crates/openshell-policy/ crates/openshell-router/ crates/openshell-sandbox/ 2>/dev/null || true) ;; helm) committed_trees=$(git ls-tree HEAD deploy/helm/openshell/ 2>/dev/null || true) From 1185215285cd70eed401e72a0d1c083462444fb1 Mon Sep 17 00:00:00 2001 From: Drew Newberry Date: Wed, 18 Mar 2026 14:15:14 -0700 Subject: [PATCH 10/15] fix(build): remove unused OPENSHELL_IMAGE_TAG ARG from wheel/CLI Dockerfiles This ARG was declared early but never referenced, causing every layer below it to be cache-busted whenever the image tag changed. Removing it lets dependency-install and toolchain layers stay cached across tag changes. --- deploy/docker/Dockerfile.cli-macos | 1 - deploy/docker/Dockerfile.python-wheels | 1 - deploy/docker/Dockerfile.python-wheels-macos | 1 - 3 files changed, 3 deletions(-) diff --git a/deploy/docker/Dockerfile.cli-macos b/deploy/docker/Dockerfile.cli-macos index 2e93acea..08485e33 100644 --- a/deploy/docker/Dockerfile.cli-macos +++ b/deploy/docker/Dockerfile.cli-macos @@ -18,7 +18,6 @@ FROM ${OSXCROSS_IMAGE} AS osxcross FROM python:3.12-slim AS builder -ARG OPENSHELL_IMAGE_TAG ARG CARGO_TARGET_CACHE_SCOPE=default ENV PATH="/root/.cargo/bin:/usr/local/bin:/osxcross/bin:${PATH}" diff --git a/deploy/docker/Dockerfile.python-wheels b/deploy/docker/Dockerfile.python-wheels index a8e299c3..6e853e98 100644 --- a/deploy/docker/Dockerfile.python-wheels +++ b/deploy/docker/Dockerfile.python-wheels @@ -28,7 +28,6 @@ FROM base AS builder ARG TARGETARCH ARG BUILDARCH -ARG OPENSHELL_IMAGE_TAG ARG CARGO_TARGET_CACHE_SCOPE=default ARG SCCACHE_MEMCACHED_ENDPOINT diff --git a/deploy/docker/Dockerfile.python-wheels-macos b/deploy/docker/Dockerfile.python-wheels-macos index ec16f411..a2696c9b 100644 --- a/deploy/docker/Dockerfile.python-wheels-macos +++ b/deploy/docker/Dockerfile.python-wheels-macos @@ -12,7 +12,6 @@ FROM ${OSXCROSS_IMAGE} AS osxcross FROM python:${PYTHON_VERSION}-slim AS builder ARG TARGETARCH -ARG OPENSHELL_IMAGE_TAG ARG CARGO_TARGET_CACHE_SCOPE=default ENV PATH="/root/.cargo/bin:/usr/local/bin:/osxcross/bin:${PATH}" From 4a545fb7bb020d8af186145e772c5449383eadd1 Mon Sep 17 00:00:00 2001 From: Drew Newberry Date: Wed, 18 Mar 2026 14:17:05 -0700 Subject: [PATCH 11/15] chore(build): remove unused ECR mode from docker-publish-multiarch --- tasks/scripts/docker-publish-multiarch.sh | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/tasks/scripts/docker-publish-multiarch.sh b/tasks/scripts/docker-publish-multiarch.sh index 6847c90f..0808e327 100755 --- a/tasks/scripts/docker-publish-multiarch.sh +++ b/tasks/scripts/docker-publish-multiarch.sh @@ -6,7 +6,7 @@ set -euo pipefail usage() { - echo "Usage: docker-publish-multiarch.sh --mode " >&2 + echo "Usage: docker-publish-multiarch.sh --mode registry" >&2 exit 1 } @@ -43,11 +43,6 @@ case "${MODE}" in registry) REGISTRY=${DOCKER_REGISTRY:?Set DOCKER_REGISTRY to push multi-arch images (e.g. ghcr.io/myorg)} ;; - ecr) - AWS_ACCOUNT_ID=${AWS_ACCOUNT_ID:-012345678901} - AWS_REGION=${AWS_REGION:-us-west-2} - REGISTRY="${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/openshell" - ;; *) echo "Unknown mode: ${MODE}" >&2 usage From 368b5013dd6732c6424ebd290d09691dfa18dc22 Mon Sep 17 00:00:00 2001 From: Drew Newberry Date: Wed, 18 Mar 2026 14:25:48 -0700 Subject: [PATCH 12/15] refactor(build): consolidate Docker build scripts - Remove docker-build-cluster.sh: helm packaging is now inlined into docker-build-image.sh when the target is 'cluster' - Remove docker-build-component.sh: the gateway case was a passthrough to docker-build-image.sh; the CI case is now docker-build-ci.sh - Simplify docker-publish-multiarch.sh: remove --mode flag since only 'registry' mode remains after ECR removal - Remove dead docker:publish:cluster:multiarch ECR task from docker.toml - Update all callers (cluster-deploy-fast, cluster-bootstrap, cluster-push-component, remote-deploy, publish.toml) The build entry point is now docker-build-image.sh for all Rust targets (gateway, cluster, supervisor-builder, supervisor-output) and docker-build-ci.sh for the CI image. --- scripts/remote-deploy.sh | 4 +-- tasks/docker.toml | 13 +++----- tasks/publish.toml | 4 +-- tasks/scripts/cluster-bootstrap.sh | 4 +-- tasks/scripts/cluster-deploy-fast.sh | 2 +- tasks/scripts/cluster-push-component.sh | 2 +- tasks/scripts/docker-build-ci.sh | 26 +++++++++++++++ tasks/scripts/docker-build-cluster.sh | 14 -------- tasks/scripts/docker-build-component.sh | 37 --------------------- tasks/scripts/docker-build-image.sh | 6 ++++ tasks/scripts/docker-publish-multiarch.sh | 39 +++-------------------- 11 files changed, 48 insertions(+), 103 deletions(-) create mode 100755 tasks/scripts/docker-build-ci.sh delete mode 100755 tasks/scripts/docker-build-cluster.sh delete mode 100755 tasks/scripts/docker-build-component.sh diff --git a/scripts/remote-deploy.sh b/scripts/remote-deploy.sh index 579f117f..37a0dca8 100755 --- a/scripts/remote-deploy.sh +++ b/scripts/remote-deploy.sh @@ -235,8 +235,8 @@ rm -f .env echo "==> Building Docker images (tag=${IMAGE_TAG})..." export OPENSHELL_CARGO_VERSION="${CARGO_VERSION}" export IMAGE_TAG -mise exec -- tasks/scripts/docker-build-cluster.sh -mise exec -- tasks/scripts/docker-build-component.sh gateway +mise exec -- tasks/scripts/docker-build-image.sh cluster +mise exec -- tasks/scripts/docker-build-image.sh gateway export OPENSHELL_CLUSTER_IMAGE="openshell/cluster:${IMAGE_TAG}" export OPENSHELL_PUSH_IMAGES="openshell/gateway:${IMAGE_TAG}" diff --git a/tasks/docker.toml b/tasks/docker.toml index 4ca52e56..b4e370ff 100644 --- a/tasks/docker.toml +++ b/tasks/docker.toml @@ -18,7 +18,7 @@ hide = true ["build:docker:ci"] description = "Build the CI Docker image" -run = "tasks/scripts/docker-build-component.sh ci" +run = "tasks/scripts/docker-build-ci.sh" hide = true ["docker:build:ci"] @@ -28,7 +28,7 @@ hide = true ["build:docker:gateway"] description = "Build the gateway Docker image" -run = "tasks/scripts/docker-build-component.sh gateway" +run = "tasks/scripts/docker-build-image.sh gateway" hide = true ["docker:build:gateway"] @@ -38,7 +38,7 @@ hide = true ["build:docker:cluster"] description = "Build the k3s cluster image (component images pulled at runtime from registry)" -run = "tasks/scripts/docker-build-cluster.sh" +run = "tasks/scripts/docker-build-image.sh cluster" hide = true ["docker:build:cluster"] @@ -48,7 +48,7 @@ hide = true ["build:docker:cluster:multiarch"] description = "Build multi-arch cluster image and push to a registry" -run = "tasks/scripts/docker-publish-multiarch.sh --mode registry" +run = "tasks/scripts/docker-publish-multiarch.sh" hide = true ["docker:build:cluster:multiarch"] @@ -56,11 +56,6 @@ description = "Alias for build:docker:cluster:multiarch" depends = ["build:docker:cluster:multiarch"] hide = true -["docker:publish:cluster:multiarch"] -description = "Build and publish multi-arch cluster image to ECR" -run = "tasks/scripts/docker-publish-multiarch.sh --mode ecr" -hide = true - ["docker:cleanup"] description = "Remove stale images, volumes, and build cache not used by the current cluster" run = "scripts/docker-cleanup.sh --force" diff --git a/tasks/publish.toml b/tasks/publish.toml index c7ade875..7845a1ab 100644 --- a/tasks/publish.toml +++ b/tasks/publish.toml @@ -11,7 +11,7 @@ set -euo pipefail VERSION_DOCKER=$(uv run python tasks/scripts/release.py get-version --docker) VERSION_PYTHON=$(uv run python tasks/scripts/release.py get-version --python) CARGO_VERSION=$(uv run python tasks/scripts/release.py get-version --cargo) -IMAGE_TAG=dev TAG_LATEST=true EXTRA_DOCKER_TAGS="$VERSION_DOCKER" mise run docker:publish:cluster:multiarch +IMAGE_TAG=dev TAG_LATEST=true EXTRA_DOCKER_TAGS="$VERSION_DOCKER" mise run build:docker:cluster:multiarch OPENSHELL_CARGO_VERSION="$CARGO_VERSION" mise run build:python:wheel:multiarch OPENSHELL_CARGO_VERSION="$CARGO_VERSION" mise run build:python:wheel:macos uv run python tasks/scripts/release.py python-publish --version "$VERSION_PYTHON" @@ -26,7 +26,7 @@ set -euo pipefail VERSION_DOCKER=$(uv run python tasks/scripts/release.py get-version --docker) VERSION_PYTHON=$(uv run python tasks/scripts/release.py get-version --python) CARGO_VERSION=$(uv run python tasks/scripts/release.py get-version --cargo) -IMAGE_TAG="$VERSION_DOCKER" TAG_LATEST=false mise run docker:publish:cluster:multiarch +IMAGE_TAG="$VERSION_DOCKER" TAG_LATEST=false mise run build:docker:cluster:multiarch OPENSHELL_CARGO_VERSION="$CARGO_VERSION" mise run build:python:wheel:multiarch OPENSHELL_CARGO_VERSION="$CARGO_VERSION" mise run build:python:wheel:macos uv run python tasks/scripts/release.py python-publish --version "$VERSION_PYTHON" diff --git a/tasks/scripts/cluster-bootstrap.sh b/tasks/scripts/cluster-bootstrap.sh index f354daea..4eec288e 100755 --- a/tasks/scripts/cluster-bootstrap.sh +++ b/tasks/scripts/cluster-bootstrap.sh @@ -230,12 +230,12 @@ fi # and entrypoint from the working tree. This ensures the k3s container # always starts with the correct chart version. if [ "${SKIP_CLUSTER_IMAGE_BUILD:-}" != "1" ]; then - tasks/scripts/docker-build-cluster.sh + tasks/scripts/docker-build-image.sh cluster fi # In fast/build modes, use the locally-built cluster image rather than the # remote distribution registry image. The local image is built by -# `docker-build-cluster.sh` and contains the bundled Helm chart and +# `docker-build-image.sh cluster` and contains the bundled Helm chart and # manifests from the current working tree. if [ -z "${OPENSHELL_CLUSTER_IMAGE:-}" ]; then export OPENSHELL_CLUSTER_IMAGE="openshell/cluster:${IMAGE_TAG}" diff --git a/tasks/scripts/cluster-deploy-fast.sh b/tasks/scripts/cluster-deploy-fast.sh index d9c022de..600bdd6c 100755 --- a/tasks/scripts/cluster-deploy-fast.sh +++ b/tasks/scripts/cluster-deploy-fast.sh @@ -302,7 +302,7 @@ build_start=$(date +%s) declare -a built_components=() if [[ "${build_gateway}" == "1" ]]; then - tasks/scripts/docker-build-component.sh gateway + tasks/scripts/docker-build-image.sh gateway fi # Build the supervisor binary and docker cp it into the running k3s cluster. diff --git a/tasks/scripts/cluster-push-component.sh b/tasks/scripts/cluster-push-component.sh index 4a78a6b5..a59192ef 100755 --- a/tasks/scripts/cluster-push-component.sh +++ b/tasks/scripts/cluster-push-component.sh @@ -49,7 +49,7 @@ done if [ -z "${resolved_source_image}" ]; then echo "Local image not found for ${component}:${IMAGE_TAG}, building..." - tasks/scripts/docker-build-component.sh "${component}" + tasks/scripts/docker-build-image.sh "${component}" resolved_source_image="openshell/${component}:${IMAGE_TAG}" fi diff --git a/tasks/scripts/docker-build-ci.sh b/tasks/scripts/docker-build-ci.sh new file mode 100755 index 00000000..d10c6f8b --- /dev/null +++ b/tasks/scripts/docker-build-ci.sh @@ -0,0 +1,26 @@ +#!/usr/bin/env bash + +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +# Build the CI Docker image (deploy/docker/Dockerfile.ci). +# This is a standalone build, separate from the main image build graph. + +set -euo pipefail + +OUTPUT_ARGS=(--load) +if [[ "${DOCKER_PUSH:-}" == "1" ]]; then + OUTPUT_ARGS=(--push) +elif [[ "${DOCKER_PLATFORM:-}" == *","* ]]; then + OUTPUT_ARGS=(--push) +fi + +exec docker buildx build \ + ${DOCKER_BUILDER:+--builder ${DOCKER_BUILDER}} \ + ${DOCKER_PLATFORM:+--platform ${DOCKER_PLATFORM}} \ + -f deploy/docker/Dockerfile.ci \ + -t "openshell/ci:${IMAGE_TAG:-dev}" \ + --provenance=false \ + "$@" \ + ${OUTPUT_ARGS[@]+"${OUTPUT_ARGS[@]}"} \ + . diff --git a/tasks/scripts/docker-build-cluster.sh b/tasks/scripts/docker-build-cluster.sh deleted file mode 100755 index 425d8e75..00000000 --- a/tasks/scripts/docker-build-cluster.sh +++ /dev/null @@ -1,14 +0,0 @@ -#!/usr/bin/env bash - -# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 - -set -euo pipefail - -mkdir -p deploy/docker/.build/charts - -echo "Packaging helm chart..." -helm package deploy/helm/openshell -d deploy/docker/.build/charts/ - -echo "Building cluster image..." -exec tasks/scripts/docker-build-image.sh cluster "$@" diff --git a/tasks/scripts/docker-build-component.sh b/tasks/scripts/docker-build-component.sh deleted file mode 100755 index 312e5ac0..00000000 --- a/tasks/scripts/docker-build-component.sh +++ /dev/null @@ -1,37 +0,0 @@ -#!/usr/bin/env bash - -# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 - -set -euo pipefail - -COMPONENT=${1:?"Usage: docker-build-component.sh [extra-args...]"} -shift - -case "${COMPONENT}" in - gateway) - exec tasks/scripts/docker-build-image.sh gateway "$@" - ;; - ci) - OUTPUT_ARGS=(--load) - if [[ "${DOCKER_PUSH:-}" == "1" ]]; then - OUTPUT_ARGS=(--push) - elif [[ "${DOCKER_PLATFORM:-}" == *","* ]]; then - OUTPUT_ARGS=(--push) - fi - - exec docker buildx build \ - ${DOCKER_BUILDER:+--builder ${DOCKER_BUILDER}} \ - ${DOCKER_PLATFORM:+--platform ${DOCKER_PLATFORM}} \ - -f deploy/docker/Dockerfile.ci \ - -t "openshell/ci:${IMAGE_TAG:-dev}" \ - --provenance=false \ - "$@" \ - ${OUTPUT_ARGS[@]+"${OUTPUT_ARGS[@]}"} \ - . - ;; - *) - echo "Error: unsupported component '${COMPONENT}'" >&2 - exit 1 - ;; -esac diff --git a/tasks/scripts/docker-build-image.sh b/tasks/scripts/docker-build-image.sh index 9a116514..ea2fa08b 100755 --- a/tasks/scripts/docker-build-image.sh +++ b/tasks/scripts/docker-build-image.sh @@ -120,6 +120,12 @@ RUST_SCOPE=${RUST_TOOLCHAIN_SCOPE:-$(detect_rust_scope "${DOCKERFILE}")} CACHE_SCOPE_INPUT="v2|shared|release|${LOCK_HASH}|${RUST_SCOPE}" CARGO_TARGET_CACHE_SCOPE=$(printf '%s' "${CACHE_SCOPE_INPUT}" | sha256_16_stdin) +# The cluster image embeds the packaged Helm chart. +if [[ "${TARGET}" == "cluster" ]]; then + mkdir -p deploy/docker/.build/charts + helm package deploy/helm/openshell -d deploy/docker/.build/charts/ >/dev/null +fi + K3S_ARGS=() if [[ "${TARGET}" == "cluster" && -n "${K3S_VERSION:-}" ]]; then K3S_ARGS=(--build-arg "K3S_VERSION=${K3S_VERSION}") diff --git a/tasks/scripts/docker-publish-multiarch.sh b/tasks/scripts/docker-publish-multiarch.sh index 0808e327..07459c23 100755 --- a/tasks/scripts/docker-publish-multiarch.sh +++ b/tasks/scripts/docker-publish-multiarch.sh @@ -3,29 +3,12 @@ # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: Apache-2.0 -set -euo pipefail - -usage() { - echo "Usage: docker-publish-multiarch.sh --mode registry" >&2 - exit 1 -} - -MODE="" -while [[ $# -gt 0 ]]; do - case "$1" in - --mode) - MODE="$2" - shift 2 - ;; - *) - echo "Unknown argument: $1" >&2 - usage - ;; - esac -done +# Build multi-arch gateway + cluster images and push to a container registry. +# Requires DOCKER_REGISTRY to be set (e.g. ghcr.io/myorg). -[[ -n "${MODE}" ]] || usage +set -euo pipefail +REGISTRY=${DOCKER_REGISTRY:?Set DOCKER_REGISTRY to push multi-arch images (e.g. ghcr.io/myorg)} IMAGE_TAG=${IMAGE_TAG:-dev} PLATFORMS=${DOCKER_PLATFORMS:-linux/amd64,linux/arm64} TAG_LATEST=${TAG_LATEST:-false} @@ -39,16 +22,6 @@ if [[ -n "${EXTRA_DOCKER_TAGS_RAW}" ]]; then done fi -case "${MODE}" in - registry) - REGISTRY=${DOCKER_REGISTRY:?Set DOCKER_REGISTRY to push multi-arch images (e.g. ghcr.io/myorg)} - ;; - *) - echo "Unknown mode: ${MODE}" >&2 - usage - ;; -esac - BUILDER_NAME=${DOCKER_BUILDER:-multiarch} if docker buildx inspect "${BUILDER_NAME}" >/dev/null 2>&1; then echo "Using existing buildx builder: ${BUILDER_NAME}" @@ -66,10 +39,6 @@ export IMAGE_REGISTRY="${REGISTRY}" echo "Building multi-arch gateway image..." tasks/scripts/docker-build-image.sh gateway -mkdir -p deploy/docker/.build/charts -echo "Packaging helm chart..." -helm package deploy/helm/openshell -d deploy/docker/.build/charts/ - echo echo "Building multi-arch cluster image..." tasks/scripts/docker-build-image.sh cluster From 4cf847c86d82f58bc86485785f99175896923499 Mon Sep 17 00:00:00 2001 From: Drew Newberry Date: Wed, 18 Mar 2026 14:31:21 -0700 Subject: [PATCH 13/15] chore(ci): change auto-tag schedule from 7 PM to 7 AM PDT --- .github/workflows/release-auto-tag.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/release-auto-tag.yml b/.github/workflows/release-auto-tag.yml index c0722853..97efa81d 100644 --- a/.github/workflows/release-auto-tag.yml +++ b/.github/workflows/release-auto-tag.yml @@ -6,7 +6,7 @@ name: Release Auto-Tag on: workflow_dispatch: {} schedule: - - cron: "0 2 * * *" # 7 PM PDT + - cron: "0 14 * * *" # 7 AM PDT permissions: contents: write From 3a77e9e46b2f46c547552d0348662815d892dcfe Mon Sep 17 00:00:00 2001 From: Drew Newberry Date: Wed, 18 Mar 2026 14:40:05 -0700 Subject: [PATCH 14/15] revert: restore Dockerfile.cli-macos and Dockerfile.python-wheels to main --- deploy/docker/Dockerfile.cli-macos | 2 ++ deploy/docker/Dockerfile.python-wheels | 2 ++ 2 files changed, 4 insertions(+) diff --git a/deploy/docker/Dockerfile.cli-macos b/deploy/docker/Dockerfile.cli-macos index 08485e33..c495d533 100644 --- a/deploy/docker/Dockerfile.cli-macos +++ b/deploy/docker/Dockerfile.cli-macos @@ -103,6 +103,8 @@ RUN touch crates/openshell-cli/src/main.rs \ crates/openshell-core/build.rs \ proto/*.proto +# Declare version ARG here (not earlier) so the git-hash-bearing value does not +# invalidate the expensive dependency-build layers above on every commit. ARG OPENSHELL_CARGO_VERSION RUN --mount=type=cache,id=cargo-registry-cli-macos,sharing=locked,target=/root/.cargo/registry \ --mount=type=cache,id=cargo-git-cli-macos,sharing=locked,target=/root/.cargo/git \ diff --git a/deploy/docker/Dockerfile.python-wheels b/deploy/docker/Dockerfile.python-wheels index 6e853e98..be0c62af 100644 --- a/deploy/docker/Dockerfile.python-wheels +++ b/deploy/docker/Dockerfile.python-wheels @@ -84,6 +84,8 @@ RUN touch crates/openshell-cli/src/main.rs \ crates/openshell-core/build.rs \ proto/*.proto +# Declare version ARG here (not earlier) so the git-hash-bearing value does not +# invalidate the expensive dependency-build layers above on every commit. ARG OPENSHELL_CARGO_VERSION RUN --mount=type=cache,id=cargo-registry-python-wheels-${TARGETARCH},sharing=locked,target=/root/.cargo/registry \ --mount=type=cache,id=cargo-git-python-wheels-${TARGETARCH},sharing=locked,target=/root/.cargo/git \ From c5474c1c99f6add52312451f838cac4a03a0fe5a Mon Sep 17 00:00:00 2001 From: Drew Newberry Date: Wed, 18 Mar 2026 14:41:45 -0700 Subject: [PATCH 15/15] revert: restore Dockerfile.python-wheels-macos to main --- deploy/docker/Dockerfile.python-wheels-macos | 2 ++ 1 file changed, 2 insertions(+) diff --git a/deploy/docker/Dockerfile.python-wheels-macos b/deploy/docker/Dockerfile.python-wheels-macos index a2696c9b..d8885f97 100644 --- a/deploy/docker/Dockerfile.python-wheels-macos +++ b/deploy/docker/Dockerfile.python-wheels-macos @@ -91,6 +91,8 @@ RUN touch crates/openshell-cli/src/main.rs \ crates/openshell-core/build.rs \ proto/*.proto +# Declare version ARG here (not earlier) so the git-hash-bearing value does not +# invalidate the expensive dependency-build layers above on every commit. ARG OPENSHELL_CARGO_VERSION RUN --mount=type=cache,id=cargo-registry-python-wheels-macos-${TARGETARCH},sharing=locked,target=/root/.cargo/registry \ --mount=type=cache,id=cargo-git-python-wheels-macos-${TARGETARCH},sharing=locked,target=/root/.cargo/git \