Add weekly forward compatibility testing by ArangoGutierrez · Pull Request #1884 · NVIDIA/gpu-operator

ArangoGutierrez · 2025-11-10T17:43:40Z

Implement automated forward compatibility tests that validate GPU Operator
against the latest published images from NVIDIA component repositories.

Changes:

Add forward-compatibility.yaml workflow (weekly + manual trigger)
Create get-latest-images.sh to fetch latest commit-based images
Extend e2e-tests.yaml with optional component image inputs
Update install-operator.sh to support device-plugin and mig-manager overrides
Add Slack notifications for test failures
Create variables.yaml as reusable workflow for shared CI variables

Components tested:

ghcr.io/nvidia/container-toolkit
ghcr.io/nvidia/k8s-device-plugin
ghcr.io/nvidia/k8s-mig-manager

We could add other operands later, but I wanted to start with the core ones.

Schedule: Every Monday at 2 AM UTC

github-advanced-security · 2025-11-10T17:48:16Z

This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation.

Copilot

Pull Request Overview

This PR implements automated weekly forward compatibility testing for the GPU Operator against the latest published component images from NVIDIA repositories. The changes refactor the existing monolithic CI workflow into reusable workflow modules and add forward compatibility testing infrastructure.

Key changes:

Refactored CI workflows into separate, reusable workflow files (variables, golang-checks, config-checks, image-builds, e2e-tests, release)
Added forward-compatibility.yaml workflow with scheduled weekly runs and manual trigger support
Created get-latest-images.sh script to fetch latest commit-based images from component repositories
Extended install-operator.sh to support device-plugin and mig-manager image overrides
Added Slack notifications for test failures

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
.github/workflows/forward-compatibility.yaml	New workflow for weekly forward compatibility testing with Slack notifications
.github/workflows/ci.yaml	Refactored to use reusable workflows instead of monolithic job definitions
.github/workflows/variables.yaml	New reusable workflow for shared CI variables
.github/workflows/e2e-tests.yaml	New reusable e2e test workflow with optional component image inputs
.github/workflows/image-builds.yaml	New reusable workflow for GPU operator image builds
.github/workflows/release.yaml	New reusable workflow for release processes
.github/workflows/golang-checks.yaml	New reusable workflow for Go checks and tests
.github/workflows/config-checks.yaml	New reusable workflow for configuration validation
.github/workflows/code-scanning.yaml	New reusable workflow for CodeQL scanning
.github/scripts/get-latest-images.sh	Script to fetch latest component images from NVIDIA repositories
tests/scripts/install-operator.sh	Extended to support device-plugin and mig-manager image overrides

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

.github/workflows/forward-compatibility.yaml

.github/workflows/ci.yaml

.github/workflows/forward-compatibility.yaml

ArangoGutierrez · 2025-11-26T17:59:08Z

PING @rahulait @rajathagasthya @tariq1890

cdesiniotis · 2025-11-26T23:02:30Z

.github/workflows/forward-compatibility.yaml

+        env:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+        run: |
+          # Use workflow_dispatch inputs if provided, otherwise fetch latest


Question: when would someone set these variables as inputs? If the scope of this workflow is to always test the top-of-tree images for all operands, then maybe we can remove the input variables and move the entire run: block into a shell script? Thoughts?

My intention is to allow manual triggering of the workflow in case we want to test a very specific image without having to wait for the "weekly" run. But I'll set it in a way that even during manual triggers, we will simply fetch all the latests images from operands.

cdesiniotis · 2025-11-26T23:13:32Z

.github/workflows/forward-compatibility.yaml

+      toolkit_image: ${{ needs.fetch-latest-images.outputs.toolkit_image }}
+      device_plugin_image: ${{ needs.fetch-latest-images.outputs.device_plugin_image }}
+      mig_manager_image: ${{ needs.fetch-latest-images.outputs.mig_manager_image }}


I understand the current tests/scripts/install-operator.sh has separate input variables for the various operand images, but would it simplify our CI definition if we passed a values override file instead? Just an idea -- the previous step fetch-latest-images can craft the values override file that points to all the latest operand images. This file is passed as input to the run-e2e-tests step which launches our e2e test script with this overrides file. Thoughts? The reason I am suggesting this is because I see a lot of boilerplate in forward-compatibility.yaml and e2e-tests.yaml for declaring all of these input variables.

I took a stab at this comment, please let me know what you think

cdesiniotis · 2025-11-26T23:19:12Z

.github/workflows/forward-compatibility.yaml

+      toolkit_image:
+        description: 'Override container-toolkit image'
+        required: false
+        type: string
+      device_plugin_image:
+        description: 'Override device-plugin image'
+        required: false
+        type: string
+      mig_manager_image:
+        description: 'Override mig-manager image'
+        required: false
+        type: string


Why do these variables need to be inputs? Aren't we always resolving the images in this workflow itself?

cdesiniotis · 2025-11-26T23:22:24Z

.github/workflows/ci.yaml

+    with:
+      operator_image: ${{ needs.variables.outputs.operator_image_base }}
+      operator_version: ${{ needs.variables.outputs.operator_version }}


Question -- could we potentially move this to the variables workflow itself? As in, make operator_image and operator_version as outputs of the variables worflow?

good catch, done!

cdesiniotis · 2025-11-26T23:23:22Z

.github/workflows/ci.yaml

+    with:
+      commit_short_sha: ${{ needs.variables.outputs.commit_short_sha }}
+      operator_version: ${{ needs.variables.outputs.operator_version }}
+      operator_image_base: ${{ needs.variables.outputs.operator_image_base }}


Same comment as above -- is this really needed? Can we move these variables as outputs of the variables workflow itself?

same, done!

cdesiniotis · 2025-11-26T23:27:05Z

.github/workflows/e2e-tests.yaml

+    with:
+      operator_image: ${{ inputs.operator_image }}
+      operator_version: ${{ inputs.operator_version }}
+      toolkit_image: ${{ inputs.toolkit_image }}
+      device_plugin_image: ${{ inputs.device_plugin_image }}
+      mig_manager_image: ${{ inputs.mig_manager_image }}


What is the rationale for passing these variables as input to the variables workflow?

ArangoGutierrez · 2026-02-04T12:14:09Z

@cdesiniotis @rahulait @rahulait @tariq1890 PTAL

rajathagasthya · 2026-02-04T18:15:22Z

.github/scripts/generate-values-overrides.sh

@@ -0,0 +1,74 @@
+#!/usr/bin/env bash
+
+# Copyright 2024 NVIDIA CORPORATION


nit: remove year.

rajathagasthya · 2026-02-04T18:47:15Z

.github/workflows/e2e-tests.yaml

+          echo "instance_hostname=ubuntu@${{ steps.get_public_dns_name.outputs.result }}" >> $GITHUB_ENV
+          echo "private_key=${{ github.workspace }}/key.pem" >> $GITHUB_ENV
+          if [[ "${{ inputs.use_values_override }}" == "true" ]]; then
+            echo "VALUES_FILE=${{ github.workspace }}/values-overrides.yaml" >> $GITHUB_ENV


I may be missing something, but how does the remote instance know about VALUES_FILE? Do we need to pass that environment variable here?

gpu-operator/tests/local.sh

Line 43 in 6531839

NGC_API_KEY="${NGC_API_KEY}" \

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 11 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-05T11:01:11Z

.github/workflows/forward-compatibility.yaml

+            channel: ${{ secrets.SLACK_CHANNEL_ID }}
+            text: |
+              :x: *Forward Compatibility Test Failed for GPU Operator*
+
+              *Workflow:* ${{ github.workflow }}
+              *Repository:* ${{ github.repository }}
+              *Trigger:* ${{ github.event_name }}
+
+              *Tested Components:*
+              Download `values-overrides` artifact to see tested component versions
+
+              *Details:* <https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Failed Run>
+              <@S095E7BNGJU> 


The Slack notification payload format is incorrect. The action expects a JSON payload, but this is using YAML format with a nested structure. The correct format should use JSON with the payload structured as: {"channel": "${{ secrets.SLACK_CHANNEL_ID }}", "text": "message"}. The current format will cause the Slack notification to fail.

Suggested change

channel: ${{ secrets.SLACK_CHANNEL_ID }}

text: |

:x: *Forward Compatibility Test Failed for GPU Operator*

*Workflow:* ${{ github.workflow }}

*Repository:* ${{ github.repository }}

*Trigger:* ${{ github.event_name }}

*Tested Components:*

Download `values-overrides` artifact to see tested component versions

*Details:* <https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Failed Run>

<@S095E7BNGJU>

{"channel": "${{ secrets.SLACK_CHANNEL_ID }}", "text": ":x: *Forward Compatibility Test Failed for GPU Operator*\n\n*Workflow:* ${{ github.workflow }}\n*Repository:* ${{ github.repository }}\n*Trigger:* ${{ github.event_name }}\n\n*Tested Components:*\nDownload `values-overrides` artifact to see tested component versions\n\n*Details:* <https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Failed Run>\n<@S095E7BNGJU>"}

Copilot · 2026-02-05T11:01:11Z

tests/scripts/install-operator.sh

+	# Clear individual options since we're using values file
+	TOOLKIT_CONTAINER_OPTIONS=""
+	DEVICE_PLUGIN_OPTIONS=""
+	MIG_MANAGER_OPTIONS=""


When USE_VALUES_FILE is true, OPERATOR_OPTIONS is cleared/never set, but it's still being appended to in the vGPU section (lines 81-92). The vGPU driver options added to OPERATOR_OPTIONS won't be applied when the values file approach is used (line 101-104), causing vGPU configurations to be silently ignored. The vGPU options should be added to the values file instead when USE_VALUES_FILE is true.

Suggested change

# Clear individual options since we're using values file

TOOLKIT_CONTAINER_OPTIONS=""

DEVICE_PLUGIN_OPTIONS=""

MIG_MANAGER_OPTIONS=""

# Ensure individual options are defined; avoid clobbering any existing values

: ${TOOLKIT_CONTAINER_OPTIONS:=""}

: ${DEVICE_PLUGIN_OPTIONS:=""}

: ${MIG_MANAGER_OPTIONS:=""}

Copilot · 2026-02-05T11:01:12Z

.github/workflows/e2e-tests.yaml

+  variables:
+    uses: ./.github/workflows/variables.yaml
+    with:
+      operator_image: ${{ inputs.operator_image }}
+      operator_version: ${{ inputs.operator_version }}


When the workflow is triggered by a push event (lines 18-22), the inputs object is undefined, which will cause the variables workflow call to receive empty/undefined values. This will break the e2e tests when run from push events. Push events should either not call the variables workflow with inputs, or the workflow should handle the undefined inputs case properly.

Copilot · 2026-02-05T11:01:12Z

.github/workflows/forward-compatibility.yaml

+  notify-failure:
+    runs-on: ubuntu-latest
+    needs: [fetch-latest-images, run-e2e-tests]
+    if: ${{ failure() }}


The notify-failure job will run only when ANY of the previous jobs fail (failure() checks all needed jobs). However, if fetch-latest-images fails, the run-e2e-tests job will be skipped (not failed). The current condition if: ${{ failure() }} will trigger even if only fetch-latest-images fails, which may not be the intended behavior. Consider using if: ${{ always() && (needs.fetch-latest-images.result == 'failure' || needs.run-e2e-tests.result == 'failure') }} for more precise control.

Suggested change

if: ${{ failure() }}

if: ${{ always() && (needs.fetch-latest-images.result == 'failure' || needs.run-e2e-tests.result == 'failure') }}

Copilot · 2026-02-05T11:01:12Z

.github/workflows/golang-checks.yaml

+        with:
+          go-version: ${{ env.GOLANG_VERSION }}
+      - name: Lint
+        uses: golangci/golangci-lint-action@v8


The golangci-lint-action version has been downgraded from v9 (in the old CI workflow) to v8 (in the new golang-checks workflow). This may introduce compatibility issues or missing features. Unless there's a specific reason for the downgrade, it should be kept at v9 for consistency.

Suggested change

uses: golangci/golangci-lint-action@v8

uses: golangci/golangci-lint-action@v9

Copilot · 2026-02-05T11:01:13Z

.github/workflows/forward-compatibility.yaml

+              Download `values-overrides` artifact to see tested component versions
+
+              *Details:* <https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Failed Run>
+              <@S095E7BNGJU> 


The Slack notification contains a hardcoded user ID <@S095E7BNGJU>. This appears to be a specific user mention that will only notify one person. Consider making this configurable via a secret or input parameter, or documenting who this user is and why they're hardcoded. If this is meant to notify a team or channel, use a more appropriate mechanism.

Suggested change

<@S095E7BNGJU>

${{ secrets.SLACK_MENTION }}

Copilot · 2026-02-05T11:01:13Z

.github/workflows/variables.yaml

+          # Determine if we should push images
+          PUSH_ON_BUILD="false"
+          if [[ "${{ github.actor }}" != "dependabot[bot]" ]]; then
+            if [[ "${{ github.event_name }}" == "pull_request" && "${{ github.event.pull_request.head.repo.full_name }}" == "${{ github.repository }}" ]]; then
+              PUSH_ON_BUILD="true"
+            elif [[ "${{ github.event_name }}" == "push" ]]; then
+              PUSH_ON_BUILD="true"
+            elif [[ "${{ github.event_name }}" == "workflow_dispatch" ]]; then
+              PUSH_ON_BUILD="true"
+            fi


When the variables workflow is called from the ci.yaml push event (line 30-31 in ci.yaml), the github.event_name will be "push" inside the variables workflow, not "workflow_call". Line 84 checks for workflow_dispatch, but that won't match when called as a reusable workflow. The PUSH_ON_BUILD logic may not work correctly. Consider checking github.event_name == 'workflow_call' or relying solely on the calling workflow's event type.

Copilot · 2026-02-05T11:01:14Z

tests/scripts/install-operator.sh

+		# Create a combined values file
+		COMBINED_VALUES=$(mktemp)
+		cat "${VALUES_FILE}" "${TEMP_ENV_VALUES}" > "${COMBINED_VALUES}"


Simply concatenating YAML files with cat does not properly merge them. If both files define the same keys (e.g., both define toolkit:), the resulting YAML will have duplicate keys, which may cause unpredictable behavior or errors when parsed by Helm. Consider using a proper YAML merge tool like yq to merge the values files, or ensure that the files don't have overlapping keys.

Suggested change

# Create a combined values file

COMBINED_VALUES=$(mktemp)

cat "${VALUES_FILE}" "${TEMP_ENV_VALUES}" > "${COMBINED_VALUES}"

# Create a combined values file by merging YAML documents

COMBINED_VALUES=$(mktemp)

if command -v yq >/dev/null 2>&1; then

yq ea '. as $item ireduce ({}; . * $item )' "${VALUES_FILE}" "${TEMP_ENV_VALUES}" > "${COMBINED_VALUES}"

else

echo "Error: yq is required to merge YAML values files but was not found in PATH." >&2

exit 1

fi

Copilot · 2026-02-05T11:01:14Z

.github/workflows/forward-compatibility.yaml

+          TOOLKIT=$(.github/scripts/get-latest-images.sh toolkit)
+
+          echo "::notice::Fetching latest device-plugin image..."
+          DEVICE_PLUGIN=$(.github/scripts/get-latest-images.sh device-plugin)
+
+          echo "::notice::Fetching latest mig-manager image..."
+          MIG_MANAGER=$(.github/scripts/get-latest-images.sh mig-manager)
+
+          # Generate values override file
+          .github/scripts/generate-values-overrides.sh \
+            values-overrides.yaml \
+            "${TOOLKIT}" \
+            "${DEVICE_PLUGIN}" \
+            "${MIG_MANAGER}"


The new shell scripts (get-latest-images.sh, generate-values-overrides.sh) are being executed directly in the workflow without a chmod command. These files must be committed with executable permissions set (chmod +x), otherwise the workflow will fail with a "Permission denied" error. Ensure these files have the executable bit set in git before merging.

Copilot · 2026-02-05T11:01:14Z

.github/workflows/e2e-tests.yaml

+  push:
+    branches:
+      - "pull-request/[0-9]+"
+      - main
+      - release-*


When e2e-tests.yaml is triggered via push event (lines 18-22), secrets are not inherited automatically and the workflow will fail when trying to access AWS_ACCESS_KEY_ID and other required secrets. The workflow expects these secrets to be passed explicitly (as done in ci.yaml line 57 with secrets: inherit), but when triggered directly via push, no secrets context exists. Either remove the push trigger or make secrets available for push events.

Suggested change

push:

branches:

- "pull-request/[0-9]+"

- main

- release-*

Break the monolithic ci.yaml into focused, reusable workflow files: code-scanning, config-checks, golang-checks, image-builds, and release. Centralize shared variables in variables.yaml. Standardize copyright headers across all workflow files. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Add e2e-tests.yaml reusable workflow for end-to-end GPU operator testing. Introduce env-to-values.sh to convert environment variables to Helm values overrides. Update install-operator.sh to use yq-based YAML merging for component image configuration. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Add forward-compatibility.yaml workflow that runs weekly to test the GPU operator against latest upstream component images (toolkit, device-plugin, mig-manager). Includes get-latest-images.sh with retry/backoff for image verification and generate-values-overrides.sh for Helm values generation. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

ArangoGutierrez · 2026-02-05T20:21:39Z

@rajathagasthya @cdesiniotis PTAL

Copilot

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 11 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-05T20:29:30Z

tests/scripts/install-operator.sh

+			echo "Warning: yq not found, falling back to concatenation (may have issues with duplicate keys)" >&2
+			cat "${VALUES_FILE}" "${TEMP_ENV_VALUES}" > "${COMBINED_VALUES}"


When yq is not available, the script falls back to simple concatenation of YAML files which can produce invalid YAML if there are duplicate top-level keys. While this provides a fallback, consider either:

Making yq a required dependency and failing if it's not found

Documenting this limitation more clearly

Implementing a more robust merge strategy that at least detects and warns about duplicate keys

The current fallback might lead to silent failures that are difficult to debug.

Suggested change

echo "Warning: yq not found, falling back to concatenation (may have issues with duplicate keys)" >&2

cat "${VALUES_FILE}" "${TEMP_ENV_VALUES}" > "${COMBINED_VALUES}"

echo "Error: 'yq' command not found. It is required to safely merge values files." >&2

echo "Please install yq (https://mikefarah.gitbook.io/yq/) and re-run this script." >&2

exit 1

Copilot · 2026-02-05T20:29:31Z

.github/workflows/e2e-tests.yaml

+      operator_version:
+        description: 'Operator version to test (override)'
+        required: false
+        type: string


The use_values_override input is available for workflow_call but not for workflow_dispatch. This inconsistency means that users who manually trigger the E2E Tests workflow via workflow_dispatch won't be able to use the values override feature, even though the workflow logic supports it. Consider adding this input to the workflow_dispatch section for consistency.

Suggested change

type: string

type: string

use_values_override:

description: 'Use values-overrides artifact for component image configuration'

required: false

type: boolean

default: false

Copilot · 2026-02-05T20:29:31Z

.github/workflows/e2e-tests.yaml

+      - name: Run e2e tests
+        env:
+          OPERATOR_VERSION: ${{ needs.variables.outputs.operator_version }}
+          OPERATOR_IMAGE: ${{ needs.variables.outputs.operator_image }}
+          GPU_PRODUCT_NAME: "Tesla-T4"
+          SKIP_LAUNCH: "true"
+          CONTAINER_RUNTIME: "containerd"
+          TEST_CASE: "./tests/cases/defaults.sh"
+        run: |
+          ./tests/ci-run-e2e.sh ${OPERATOR_IMAGE} ${OPERATOR_VERSION} ${GPU_PRODUCT_NAME} ${TEST_CASE} || rc=$?
+          ./tests/scripts/pull.sh /tmp/logs logs
+          exit $rc


The VALUES_FILE environment variable is set in step "Copy values override file to remote" but is not explicitly included in the env block of the "Run e2e tests" step. While variables set with $GITHUB_ENV persist across steps, it's more explicit and maintainable to include VALUES_FILE in the env block of the test step, similar to how other environment variables like OPERATOR_VERSION and OPERATOR_IMAGE are specified. This makes the dependencies clearer and prevents potential issues if the step execution context changes.

Copilot · 2026-02-05T20:29:31Z

.github/workflows/e2e-tests.yaml

+      - name: Run e2e tests
+        env:
+          OPERATOR_VERSION: ${{ needs.variables.outputs.operator_version }}
+          OPERATOR_IMAGE: ${{ needs.variables.outputs.operator_image }}
+          GPU_PRODUCT_NAME: "Tesla-T4"
+          SKIP_LAUNCH: "true"
+          CONTAINER_RUNTIME: "containerd"
+          TEST_CASE: "./tests/cases/nvidia-driver.sh"
+        run: |
+          ./tests/ci-run-e2e.sh ${OPERATOR_IMAGE} ${OPERATOR_VERSION} ${GPU_PRODUCT_NAME} ${TEST_CASE} || rc=$?
+          ./tests/scripts/pull.sh /tmp/logs logs
+          exit $rc


The VALUES_FILE environment variable is set in step "Copy values override file to remote" but is not explicitly included in the env block of the "Run e2e tests" step. While variables set with $GITHUB_ENV persist across steps, it's more explicit and maintainable to include VALUES_FILE in the env block of the test step, similar to how other environment variables like OPERATOR_VERSION and OPERATOR_IMAGE are specified. This makes the dependencies clearer and prevents potential issues if the step execution context changes.

Copilot · 2026-02-05T20:29:32Z

.github/workflows/forward-compatibility.yaml

+      - name: Get latest component images and generate values override file
+        env:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+        run: |
+          # Fetch latest images from component repositories
+          echo "::notice::Fetching latest container-toolkit image..."
+          TOOLKIT=$(.github/scripts/get-latest-images.sh toolkit)
+
+          echo "::notice::Fetching latest device-plugin image..."
+          DEVICE_PLUGIN=$(.github/scripts/get-latest-images.sh device-plugin)
+
+          echo "::notice::Fetching latest mig-manager image..."
+          MIG_MANAGER=$(.github/scripts/get-latest-images.sh mig-manager)
+
+          # Generate values override file
+          .github/scripts/generate-values-overrides.sh \
+            values-overrides.yaml \
+            "${TOOLKIT}" \
+            "${DEVICE_PLUGIN}" \
+            "${MIG_MANAGER}"


The get-latest-images.sh script requires jq to parse JSON responses from the GitHub API, but there's no installation step for jq in the fetch-latest-images job. While jq is pre-installed on ubuntu-latest runners, this creates an implicit dependency that isn't documented. Consider either:

Adding an explicit jq installation step for clarity and robustness

Adding a comment noting the dependency on the ubuntu-latest pre-installed tools

Adding a check at the beginning of the script similar to the regctl check (lines 25-29)

Copilot · 2026-02-05T20:29:32Z

.github/workflows/e2e-tests.yaml

+          name: values-overrides
+          path: ${{ github.workspace }}
+      - name: Set up Holodeck
+        uses: NVIDIA/holodeck@v0.2.17


The Holodeck version has been downgraded from v0.2.18 (used in the original ci.yaml) to v0.2.17. This should match the version used in the e2e-tests-containerd job for consistency. Unless this is intentional for a specific reason, the version should be updated to v0.2.18 to match the original CI workflow.

Suggested change

uses: NVIDIA/holodeck@v0.2.17

uses: NVIDIA/holodeck@v0.2.18

Copilot · 2026-02-05T20:29:32Z

.github/workflows/forward-compatibility.yaml

+      - name: Send Slack alert notification
+        uses: slackapi/slack-github-action@v2.1.1
+        with:
+          method: chat.postMessage
+          token: ${{ secrets.SLACK_BOT_TOKEN }}
+          payload: |
+            {
+              "channel": "${{ secrets.SLACK_CHANNEL_ID }}",
+              "text": ":x: *Forward Compatibility Test Failed for GPU Operator*\n\n*Workflow:* ${{ github.workflow }}\n*Repository:* ${{ github.repository }}\n*Trigger:* ${{ github.event_name }}\n\n*Tested Components:*\nDownload `values-overrides` artifact to see tested component versions\n\n*Details:* <https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Failed Run>\n\n${{ secrets.SLACK_MENTION_LIST }}"
+            }


The Slack notification step will fail if SLACK_BOT_TOKEN or SLACK_CHANNEL_ID secrets are not configured in the repository. While the secrets are marked as optional in the e2e-tests.yaml workflow definition (lines 38-41), the Slack action will still fail if they're missing or empty when the notify-failure job runs.

Consider adding a condition to check if the secrets are available before attempting to send the notification, such as:
if: ${{ always() && (needs.fetch-latest-images.result == 'failure' || needs.run-e2e-tests.result == 'failure') && secrets.SLACK_BOT_TOKEN != '' }}

Copilot · 2026-02-05T20:29:33Z

.github/workflows/release.yaml

+      operator_version:
+        required: true
+        type: string


The operator_version input is defined in the workflow_call inputs (line 27) but is never used in the workflow. The variables job only uses commit_short_sha and operator_image_base inputs. Either this input should be removed if it's not needed, or it should be used appropriately in the workflow logic.

Suggested change

operator_version:

required: true

type: string

Copilot · 2026-02-05T20:29:33Z

.github/workflows/e2e-tests.yaml

+  e2e-tests-containerd:
+    needs: [variables]
+    runs-on: linux-amd64-cpu4
+    permissions:
+      contents: read
+      id-token: write
+    steps:
+      - uses: actions/checkout@v6
+        name: Check out code
+      - name: Download values override file
+        if: ${{ inputs.use_values_override }}
+        uses: actions/download-artifact@v6
+        with:
+          name: values-overrides
+          path: ${{ github.workspace }}
+      - name: Set up Holodeck
+        uses: NVIDIA/holodeck@v0.2.17
+        with:
+          aws_access_key_id: ${{ secrets.AWS_ACCESS_KEY_ID }}
+          aws_secret_access_key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
+          aws_ssh_key: ${{ secrets.AWS_SSH_KEY }}
+          holodeck_config: "tests/holodeck.yaml"
+      - name: Get public dns name
+        id: get_public_dns_name
+        uses: mikefarah/yq@master
+        with:
+          cmd: yq '.status.properties[] | select(.name == "public-dns-name") | .value' /github/workspace/.cache/holodeck.yaml
+      - name: Set test environment
+        run: |
+          echo "instance_hostname=ubuntu@${{ steps.get_public_dns_name.outputs.result }}" >> $GITHUB_ENV
+          echo "private_key=${{ github.workspace }}/key.pem" >> $GITHUB_ENV
+      - name: Write SSH key
+        run: |
+          echo "${{ secrets.AWS_SSH_KEY }}" > ${private_key} && chmod 400 ${private_key}
+      - name: Copy values override file to remote
+        if: ${{ inputs.use_values_override }}
+        run: |
+          scp -i ${private_key} -o StrictHostKeyChecking=no \
+            ${{ github.workspace }}/values-overrides.yaml \
+            ${instance_hostname}:/tmp/values-overrides.yaml
+          echo "VALUES_FILE=/tmp/values-overrides.yaml" >> $GITHUB_ENV
+      - name: Run e2e tests
+        env:
+          OPERATOR_VERSION: ${{ needs.variables.outputs.operator_version }}
+          OPERATOR_IMAGE: ${{ needs.variables.outputs.operator_image }}
+          GPU_PRODUCT_NAME: "Tesla-T4"
+          SKIP_LAUNCH: "true"
+          CONTAINER_RUNTIME: "containerd"
+          TEST_CASE: "./tests/cases/defaults.sh"
+        run: |
+          ./tests/ci-run-e2e.sh ${OPERATOR_IMAGE} ${OPERATOR_VERSION} ${GPU_PRODUCT_NAME} ${TEST_CASE} || rc=$?
+          ./tests/scripts/pull.sh /tmp/logs logs
+          exit $rc
+      - name: Archive test logs
+        if: ${{ failure() }}
+        uses: actions/upload-artifact@v6
+        with:
+          name: containerd-e2e-test-logs
+          path: ./logs/
+          retention-days: 15
+
+  e2e-tests-nvidiadriver:
+    needs: [variables]
+    runs-on: linux-amd64-cpu4
+    permissions:
+      contents: read
+      id-token: write
+    steps:
+      - uses: actions/checkout@v6
+        name: Check out code
+      - name: Download values override file
+        if: ${{ inputs.use_values_override }}
+        uses: actions/download-artifact@v6
+        with:
+          name: values-overrides
+          path: ${{ github.workspace }}
+      - name: Set up Holodeck
+        uses: NVIDIA/holodeck@v0.2.17
+        with:
+          aws_access_key_id: ${{ secrets.AWS_ACCESS_KEY_ID }}
+          aws_secret_access_key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
+          aws_ssh_key: ${{ secrets.AWS_SSH_KEY }}
+          holodeck_config: "tests/holodeck.yaml"
+      - name: Get public dns name
+        id: get_public_dns_name
+        uses: mikefarah/yq@master
+        with:
+          cmd: yq '.status.properties[] | select(.name == "public-dns-name") | .value' /github/workspace/.cache/holodeck.yaml
+      - name: Set test environment
+        run: |
+          echo "instance_hostname=ubuntu@${{ steps.get_public_dns_name.outputs.result }}" >> $GITHUB_ENV
+          echo "private_key=${{ github.workspace }}/key.pem" >> $GITHUB_ENV
+      - name: Write SSH key
+        run: |
+          echo "${{ secrets.AWS_SSH_KEY }}" > ${private_key} && chmod 400 ${private_key}
+      - name: Copy values override file to remote
+        if: ${{ inputs.use_values_override }}
+        run: |
+          scp -i ${private_key} -o StrictHostKeyChecking=no \
+            ${{ github.workspace }}/values-overrides.yaml \
+            ${instance_hostname}:/tmp/values-overrides.yaml
+          echo "VALUES_FILE=/tmp/values-overrides.yaml" >> $GITHUB_ENV
+      - name: Run e2e tests
+        env:
+          OPERATOR_VERSION: ${{ needs.variables.outputs.operator_version }}
+          OPERATOR_IMAGE: ${{ needs.variables.outputs.operator_image }}
+          GPU_PRODUCT_NAME: "Tesla-T4"
+          SKIP_LAUNCH: "true"
+          CONTAINER_RUNTIME: "containerd"
+          TEST_CASE: "./tests/cases/nvidia-driver.sh"
+        run: |
+          ./tests/ci-run-e2e.sh ${OPERATOR_IMAGE} ${OPERATOR_VERSION} ${GPU_PRODUCT_NAME} ${TEST_CASE} || rc=$?
+          ./tests/scripts/pull.sh /tmp/logs logs
+          exit $rc
+      - name: Archive test logs
+        if: ${{ failure() }}
+        uses: actions/upload-artifact@v6
+        with:
+          name: nvidiadriver-e2e-test-logs
+          path: ./logs/
+          retention-days: 15


The E2E test workflow only runs two test cases: defaults.sh and nvidia-driver.sh. There is also an experimental-runtime.sh test case in the tests/cases directory that is not being executed. Consider whether this test case should also be included in the CI pipeline for more comprehensive test coverage, or document why it's excluded.

Copilot · 2026-02-05T20:29:33Z

.github/workflows/forward-compatibility.yaml

+  run-e2e-tests:
+    needs: [fetch-latest-images]
+    uses: ./.github/workflows/e2e-tests.yaml
+    with:
+      operator_image: ghcr.io/nvidia/gpu-operator
+      operator_version: main-latest
+      use_values_override: true
+    secrets: inherit


Artifacts uploaded in the fetch-latest-images job will not be accessible to the reusable e2e-tests.yaml workflow because artifacts are not shared across workflow boundaries when using workflow_call. The download-artifact action in e2e-tests.yaml will fail to find the values-overrides artifact.

To fix this, you need to either:

Move the artifact generation logic into the e2e-tests.yaml workflow itself

Pass the component images as inputs and regenerate the values file in e2e-tests.yaml

Use outputs to pass the image names instead of artifacts

ArangoGutierrez self-assigned this Nov 10, 2025

ArangoGutierrez requested review from cdesiniotis, elezar, shivamerla and tariq1890 as code owners November 10, 2025 17:43

ArangoGutierrez force-pushed the integration-tests branch 2 times, most recently from d55dd82 to 9dab00e Compare November 10, 2025 17:54

ArangoGutierrez requested a review from Copilot November 10, 2025 19:29

Copilot AI reviewed Nov 10, 2025

View reviewed changes

ArangoGutierrez force-pushed the integration-tests branch 3 times, most recently from 84a0c31 to 54d6fc9 Compare November 11, 2025 13:27

ArangoGutierrez requested review from rahulait and rajathagasthya November 19, 2025 18:10

rajathagasthya requested a review from karthikvetrivel November 26, 2025 19:40

cdesiniotis reviewed Nov 26, 2025

View reviewed changes

ArangoGutierrez force-pushed the integration-tests branch from 54d6fc9 to 203941a Compare December 3, 2025 11:00

ArangoGutierrez requested a review from cdesiniotis December 3, 2025 11:01

rajathagasthya reviewed Feb 4, 2026

View reviewed changes

ArangoGutierrez requested a review from Copilot February 5, 2026 10:50

Copilot started reviewing on behalf of ArangoGutierrez February 5, 2026 10:50 View session

Copilot AI reviewed Feb 5, 2026

View reviewed changes

ArangoGutierrez added 3 commits February 5, 2026 21:17

ArangoGutierrez force-pushed the integration-tests branch from 203941a to 764a66e Compare February 5, 2026 20:19

ArangoGutierrez requested a review from rajathagasthya February 5, 2026 20:21

ArangoGutierrez requested a review from Copilot February 5, 2026 20:21

Copilot started reviewing on behalf of ArangoGutierrez February 5, 2026 20:21 View session

Copilot AI reviewed Feb 5, 2026

View reviewed changes

		@@ -0,0 +1,74 @@
		#!/usr/bin/env bash

		# Copyright 2024 NVIDIA CORPORATION

	if: ${{ failure() }}
	if: ${{ always() && (needs.fetch-latest-images.result == 'failure' \|\| needs.run-e2e-tests.result == 'failure') }}

	uses: golangci/golangci-lint-action@v8
	uses: golangci/golangci-lint-action@v9

		echo "Warning: yq not found, falling back to concatenation (may have issues with duplicate keys)" >&2
		cat "${VALUES_FILE}" "${TEMP_ENV_VALUES}" > "${COMBINED_VALUES}"

Conversation

ArangoGutierrez commented Nov 10, 2025

Uh oh!

github-advanced-security bot commented Nov 10, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArangoGutierrez commented Nov 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArangoGutierrez commented Feb 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

ArangoGutierrez commented Feb 5, 2026

Uh oh!