-
Notifications
You must be signed in to change notification settings - Fork 447
Add weekly forward compatibility testing #1884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
5a025e3
e6b39bc
afad630
4f8ddc8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,74 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| # Copyright NVIDIA CORPORATION | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| set -euo pipefail | ||
|
|
||
| # Usage: generate-values-overrides.sh OUTPUT_FILE TOOLKIT_IMAGE DEVICE_PLUGIN_IMAGE MIG_MANAGER_IMAGE | ||
| # | ||
| # Generates a Helm values override file for GPU Operator component images. | ||
| # This file can be used with `helm install -f values-overrides.yaml` to | ||
| # override default component image versions. | ||
|
|
||
| if [[ $# -ne 4 ]]; then | ||
| echo "Usage: $0 OUTPUT_FILE TOOLKIT_IMAGE DEVICE_PLUGIN_IMAGE MIG_MANAGER_IMAGE" >&2 | ||
| echo "" >&2 | ||
| echo "Example:" >&2 | ||
| echo " $0 values.yaml \\" >&2 | ||
| echo " ghcr.io/nvidia/container-toolkit:v1.18.0-ubuntu20.04 \\" >&2 | ||
| echo " ghcr.io/nvidia/k8s-device-plugin:v0.17.0-ubi8 \\" >&2 | ||
| echo " ghcr.io/nvidia/k8s-mig-manager:v0.10.0-ubuntu20.04" >&2 | ||
| exit 1 | ||
| fi | ||
|
|
||
| OUTPUT_FILE="$1" | ||
| TOOLKIT_IMAGE="$2" | ||
| DEVICE_PLUGIN_IMAGE="$3" | ||
| MIG_MANAGER_IMAGE="$4" | ||
|
|
||
| # Generate values override file | ||
| cat > "${OUTPUT_FILE}" <<EOF | ||
| # Generated by generate-values-overrides.sh | ||
| # Date: $(date -u +"%Y-%m-%d %H:%M:%S UTC") | ||
| # | ||
| # This file overrides default GPU Operator component images with | ||
| # specific versions for forward compatibility testing. | ||
|
|
||
| toolkit: | ||
| repository: "" | ||
| version: "" | ||
| image: "${TOOLKIT_IMAGE}" | ||
|
|
||
| devicePlugin: | ||
| repository: "" | ||
| version: "" | ||
| image: "${DEVICE_PLUGIN_IMAGE}" | ||
|
|
||
| migManager: | ||
| repository: "" | ||
| version: "" | ||
| image: "${MIG_MANAGER_IMAGE}" | ||
| EOF | ||
|
|
||
| echo "Generated values override file: ${OUTPUT_FILE}" | ||
| echo "" | ||
| echo "=== Component Images ===" | ||
| echo "Container Toolkit: ${TOOLKIT_IMAGE}" | ||
| echo "Device Plugin: ${DEVICE_PLUGIN_IMAGE}" | ||
| echo "MIG Manager: ${MIG_MANAGER_IMAGE}" | ||
| echo "" | ||
| echo "=== File Contents ===" | ||
| cat "${OUTPUT_FILE}" | ||
|
|
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,103 @@ | ||||||
| #!/bin/bash | ||||||
| # Copyright NVIDIA CORPORATION | ||||||
| # | ||||||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||||||
| # you may not use this file except in compliance with the License. | ||||||
| # You may obtain a copy of the License at | ||||||
| # | ||||||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||||||
| # | ||||||
| # Unless required by applicable law or agreed to in writing, software | ||||||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||||||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||||
| # See the License for the specific language governing permissions and | ||||||
| # limitations under the License. | ||||||
|
|
||||||
| set -euo pipefail | ||||||
|
|
||||||
| COMPONENT=${1:-} | ||||||
|
|
||||||
| if [[ -z "${COMPONENT}" ]]; then | ||||||
| echo "Usage: $0 <toolkit|device-plugin|mig-manager>" >&2 | ||||||
| exit 1 | ||||||
| fi | ||||||
|
|
||||||
| # Verify regctl is available | ||||||
| if ! command -v regctl &> /dev/null; then | ||||||
| echo "Error: regctl not found. Please install regctl first." >&2 | ||||||
| exit 1 | ||||||
| fi | ||||||
|
|
||||||
| # Map component names to GHCR image repositories and GitHub source repositories | ||||||
| case "${COMPONENT}" in | ||||||
| toolkit) | ||||||
| IMAGE_REPO="ghcr.io/nvidia/container-toolkit" | ||||||
| GITHUB_REPO="NVIDIA/container-toolkit" | ||||||
| ;; | ||||||
| device-plugin) | ||||||
| IMAGE_REPO="ghcr.io/nvidia/k8s-device-plugin" | ||||||
| GITHUB_REPO="NVIDIA/k8s-device-plugin" | ||||||
| ;; | ||||||
| mig-manager) | ||||||
| IMAGE_REPO="ghcr.io/nvidia/k8s-mig-manager" | ||||||
| GITHUB_REPO="NVIDIA/k8s-mig-manager" | ||||||
| ;; | ||||||
| *) | ||||||
| echo "Error: Unknown component '${COMPONENT}'" >&2 | ||||||
| echo "Valid components: toolkit, device-plugin, mig-manager" >&2 | ||||||
| exit 1 | ||||||
| ;; | ||||||
| esac | ||||||
|
|
||||||
| echo "Fetching latest commit from ${GITHUB_REPO}..." >&2 | ||||||
|
|
||||||
| # Get the latest commit SHA from the main branch using GitHub API | ||||||
| GITHUB_API_URL="https://api.github.com/repos/${GITHUB_REPO}/commits/main" | ||||||
|
|
||||||
| # Use GITHUB_TOKEN if available for authentication (higher rate limits) | ||||||
| if [[ -n "${GITHUB_TOKEN:-}" ]]; then | ||||||
| LATEST_COMMIT=$(curl -sSL \ | ||||||
| -H "Authorization: Bearer ${GITHUB_TOKEN}" \ | ||||||
| -H "Accept: application/vnd.github.v3+json" \ | ||||||
| "${GITHUB_API_URL}" | \ | ||||||
| jq -r '.sha[0:8]') | ||||||
| else | ||||||
| LATEST_COMMIT=$(curl -sSL \ | ||||||
| -H "Accept: application/vnd.github.v3+json" \ | ||||||
| "${GITHUB_API_URL}" | \ | ||||||
| jq -r '.sha[0:8]') | ||||||
| fi | ||||||
|
|
||||||
| if [[ -z "${LATEST_COMMIT}" || "${LATEST_COMMIT}" == "null" ]]; then | ||||||
| echo "Error: Failed to fetch latest commit from ${GITHUB_REPO}" >&2 | ||||||
| exit 1 | ||||||
| fi | ||||||
|
|
||||||
| echo "Latest commit SHA: ${LATEST_COMMIT}" >&2 | ||||||
|
|
||||||
| # Construct full image path with commit tag | ||||||
| FULL_IMAGE="${IMAGE_REPO}:${LATEST_COMMIT}" | ||||||
|
|
||||||
| echo "Verifying image exists: ${FULL_IMAGE}" >&2 | ||||||
|
|
||||||
| # Verify the image exists using regctl with retry | ||||||
| MAX_RETRIES=5 | ||||||
| RETRY_DELAY=30 | ||||||
| for i in $(seq 1 ${MAX_RETRIES}); do | ||||||
| if regctl manifest head "${FULL_IMAGE}" &> /dev/null; then | ||||||
| echo "Verified ${COMPONENT} image: ${FULL_IMAGE}" >&2 | ||||||
| echo "${FULL_IMAGE}" | ||||||
| exit 0 | ||||||
| fi | ||||||
|
|
||||||
| if [[ $i -lt ${MAX_RETRIES} ]]; then | ||||||
| echo "Image not found (attempt $i/${MAX_RETRIES}), waiting ${RETRY_DELAY}s for CI to build..." >&2 | ||||||
| sleep ${RETRY_DELAY} | ||||||
| # Exponential backoff: 30s, 60s, 120s, 240s | ||||||
|
||||||
| # Exponential backoff: 30s, 60s, 120s, 240s | |
| # Exponential backoff between attempts: 30s, 60s, 120s, 240s (5 attempts, 4 waits) |
Copilot
AI
Feb 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The forward compatibility workflow fetches the latest commit SHA from component repositories and immediately attempts to verify the corresponding image exists. However, there's a potential race condition: the commit may be very recent, and the CI pipeline in the component repository might not have finished building and publishing the image yet.
While the retry logic with exponential backoff (lines 84-99) helps mitigate this, the maximum wait time is approximately 7.5 minutes (30 + 60 + 120 + 240 seconds). For some repositories with slower CI pipelines, this might not be sufficient. Consider either increasing MAX_RETRIES, adjusting the backoff strategy, or adding a configurable delay before the first attempt to allow CI pipelines more time to complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script relies on
jqto parse JSON responses from the GitHub API but doesn't verify thatjqis installed before attempting to use it. Ifjqis not available, the script will fail with a cryptic error rather than a clear message.Consider adding a check similar to the
regctlverification on lines 26-29 to ensurejqis available before proceeding, providing a helpful error message if it's missing.