Skip to content

Add weekly forward compatibility testing#1884

Open
ArangoGutierrez wants to merge 3 commits intoNVIDIA:mainfrom
ArangoGutierrez:integration-tests
Open

Add weekly forward compatibility testing#1884
ArangoGutierrez wants to merge 3 commits intoNVIDIA:mainfrom
ArangoGutierrez:integration-tests

Conversation

@ArangoGutierrez
Copy link
Collaborator

Implement automated forward compatibility tests that validate GPU Operator
against the latest published images from NVIDIA component repositories.

Changes:

  • Add forward-compatibility.yaml workflow (weekly + manual trigger)
  • Create get-latest-images.sh to fetch latest commit-based images
  • Extend e2e-tests.yaml with optional component image inputs
  • Update install-operator.sh to support device-plugin and mig-manager overrides
  • Add Slack notifications for test failures
  • Create variables.yaml as reusable workflow for shared CI variables

Components tested:

  • ghcr.io/nvidia/container-toolkit
  • ghcr.io/nvidia/k8s-device-plugin
  • ghcr.io/nvidia/k8s-mig-manager

We could add other operands later, but I wanted to start with the core ones.

Schedule: Every Monday at 2 AM UTC

@github-advanced-security
Copy link

This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation.

@ArangoGutierrez ArangoGutierrez force-pushed the integration-tests branch 2 times, most recently from d55dd82 to 9dab00e Compare November 10, 2025 17:54
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements automated weekly forward compatibility testing for the GPU Operator against the latest published component images from NVIDIA repositories. The changes refactor the existing monolithic CI workflow into reusable workflow modules and add forward compatibility testing infrastructure.

Key changes:

  • Refactored CI workflows into separate, reusable workflow files (variables, golang-checks, config-checks, image-builds, e2e-tests, release)
  • Added forward-compatibility.yaml workflow with scheduled weekly runs and manual trigger support
  • Created get-latest-images.sh script to fetch latest commit-based images from component repositories
  • Extended install-operator.sh to support device-plugin and mig-manager image overrides
  • Added Slack notifications for test failures

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
.github/workflows/forward-compatibility.yaml New workflow for weekly forward compatibility testing with Slack notifications
.github/workflows/ci.yaml Refactored to use reusable workflows instead of monolithic job definitions
.github/workflows/variables.yaml New reusable workflow for shared CI variables
.github/workflows/e2e-tests.yaml New reusable e2e test workflow with optional component image inputs
.github/workflows/image-builds.yaml New reusable workflow for GPU operator image builds
.github/workflows/release.yaml New reusable workflow for release processes
.github/workflows/golang-checks.yaml New reusable workflow for Go checks and tests
.github/workflows/config-checks.yaml New reusable workflow for configuration validation
.github/workflows/code-scanning.yaml New reusable workflow for CodeQL scanning
.github/scripts/get-latest-images.sh Script to fetch latest component images from NVIDIA repositories
tests/scripts/install-operator.sh Extended to support device-plugin and mig-manager image overrides

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ArangoGutierrez ArangoGutierrez force-pushed the integration-tests branch 3 times, most recently from 84a0c31 to 54d6fc9 Compare November 11, 2025 13:27
@ArangoGutierrez
Copy link
Collaborator Author

PING @rahulait @rajathagasthya @tariq1890

env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
# Use workflow_dispatch inputs if provided, otherwise fetch latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: when would someone set these variables as inputs? If the scope of this workflow is to always test the top-of-tree images for all operands, then maybe we can remove the input variables and move the entire run: block into a shell script? Thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention is to allow manual triggering of the workflow in case we want to test a very specific image without having to wait for the "weekly" run. But I'll set it in a way that even during manual triggers, we will simply fetch all the latests images from operands.

Comment on lines 101 to 103
toolkit_image: ${{ needs.fetch-latest-images.outputs.toolkit_image }}
device_plugin_image: ${{ needs.fetch-latest-images.outputs.device_plugin_image }}
mig_manager_image: ${{ needs.fetch-latest-images.outputs.mig_manager_image }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the current tests/scripts/install-operator.sh has separate input variables for the various operand images, but would it simplify our CI definition if we passed a values override file instead? Just an idea -- the previous step fetch-latest-images can craft the values override file that points to all the latest operand images. This file is passed as input to the run-e2e-tests step which launches our e2e test script with this overrides file. Thoughts? The reason I am suggesting this is because I see a lot of boilerplate in forward-compatibility.yaml and e2e-tests.yaml for declaring all of these input variables.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a stab at this comment, please let me know what you think

Comment on lines 22 to 33
toolkit_image:
description: 'Override container-toolkit image'
required: false
type: string
device_plugin_image:
description: 'Override device-plugin image'
required: false
type: string
mig_manager_image:
description: 'Override mig-manager image'
required: false
type: string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do these variables need to be inputs? Aren't we always resolving the images in this workflow itself?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Comment on lines 54 to 56
with:
operator_image: ${{ needs.variables.outputs.operator_image_base }}
operator_version: ${{ needs.variables.outputs.operator_version }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question -- could we potentially move this to the variables workflow itself? As in, make operator_image and operator_version as outputs of the variables worflow?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, done!

Comment on lines +63 to +66
with:
commit_short_sha: ${{ needs.variables.outputs.commit_short_sha }}
operator_version: ${{ needs.variables.outputs.operator_version }}
operator_image_base: ${{ needs.variables.outputs.operator_image_base }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above -- is this really needed? Can we move these variables as outputs of the variables workflow itself?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, done!

Comment on lines 80 to 85
with:
operator_image: ${{ inputs.operator_image }}
operator_version: ${{ inputs.operator_version }}
toolkit_image: ${{ inputs.toolkit_image }}
device_plugin_image: ${{ inputs.device_plugin_image }}
mig_manager_image: ${{ inputs.mig_manager_image }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale for passing these variables as input to the variables workflow?

@ArangoGutierrez
Copy link
Collaborator Author

@@ -0,0 +1,74 @@
#!/usr/bin/env bash

# Copyright 2024 NVIDIA CORPORATION
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove year.

echo "instance_hostname=ubuntu@${{ steps.get_public_dns_name.outputs.result }}" >> $GITHUB_ENV
echo "private_key=${{ github.workspace }}/key.pem" >> $GITHUB_ENV
if [[ "${{ inputs.use_values_override }}" == "true" ]]; then
echo "VALUES_FILE=${{ github.workspace }}/values-overrides.yaml" >> $GITHUB_ENV
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be missing something, but how does the remote instance know about VALUES_FILE? Do we need to pass that environment variable here?

NGC_API_KEY="${NGC_API_KEY}" \

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 11 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 88 to 100
channel: ${{ secrets.SLACK_CHANNEL_ID }}
text: |
:x: *Forward Compatibility Test Failed for GPU Operator*

*Workflow:* ${{ github.workflow }}
*Repository:* ${{ github.repository }}
*Trigger:* ${{ github.event_name }}

*Tested Components:*
Download `values-overrides` artifact to see tested component versions

*Details:* <https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Failed Run>
<@S095E7BNGJU>
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Slack notification payload format is incorrect. The action expects a JSON payload, but this is using YAML format with a nested structure. The correct format should use JSON with the payload structured as: {"channel": "${{ secrets.SLACK_CHANNEL_ID }}", "text": "message"}. The current format will cause the Slack notification to fail.

Suggested change
channel: ${{ secrets.SLACK_CHANNEL_ID }}
text: |
:x: *Forward Compatibility Test Failed for GPU Operator*
*Workflow:* ${{ github.workflow }}
*Repository:* ${{ github.repository }}
*Trigger:* ${{ github.event_name }}
*Tested Components:*
Download `values-overrides` artifact to see tested component versions
*Details:* <https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Failed Run>
<@S095E7BNGJU>
{"channel": "${{ secrets.SLACK_CHANNEL_ID }}", "text": ":x: *Forward Compatibility Test Failed for GPU Operator*\n\n*Workflow:* ${{ github.workflow }}\n*Repository:* ${{ github.repository }}\n*Trigger:* ${{ github.event_name }}\n\n*Tested Components:*\nDownload `values-overrides` artifact to see tested component versions\n\n*Details:* <https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Failed Run>\n<@S095E7BNGJU>"}

Copilot uses AI. Check for mistakes.
Comment on lines +55 to +66
# Clear individual options since we're using values file
TOOLKIT_CONTAINER_OPTIONS=""
DEVICE_PLUGIN_OPTIONS=""
MIG_MANAGER_OPTIONS=""
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When USE_VALUES_FILE is true, OPERATOR_OPTIONS is cleared/never set, but it's still being appended to in the vGPU section (lines 81-92). The vGPU driver options added to OPERATOR_OPTIONS won't be applied when the values file approach is used (line 101-104), causing vGPU configurations to be silently ignored. The vGPU options should be added to the values file instead when USE_VALUES_FILE is true.

Suggested change
# Clear individual options since we're using values file
TOOLKIT_CONTAINER_OPTIONS=""
DEVICE_PLUGIN_OPTIONS=""
MIG_MANAGER_OPTIONS=""
# Ensure individual options are defined; avoid clobbering any existing values
: ${TOOLKIT_CONTAINER_OPTIONS:=""}
: ${DEVICE_PLUGIN_OPTIONS:=""}
: ${MIG_MANAGER_OPTIONS:=""}

Copilot uses AI. Check for mistakes.
Comment on lines +59 to +58
variables:
uses: ./.github/workflows/variables.yaml
with:
operator_image: ${{ inputs.operator_image }}
operator_version: ${{ inputs.operator_version }}
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the workflow is triggered by a push event (lines 18-22), the inputs object is undefined, which will cause the variables workflow call to receive empty/undefined values. This will break the e2e tests when run from push events. Push events should either not call the variables workflow with inputs, or the workflow should handle the undefined inputs case properly.

Copilot uses AI. Check for mistakes.
notify-failure:
runs-on: ubuntu-latest
needs: [fetch-latest-images, run-e2e-tests]
if: ${{ failure() }}
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The notify-failure job will run only when ANY of the previous jobs fail (failure() checks all needed jobs). However, if fetch-latest-images fails, the run-e2e-tests job will be skipped (not failed). The current condition if: ${{ failure() }} will trigger even if only fetch-latest-images fails, which may not be the intended behavior. Consider using if: ${{ always() && (needs.fetch-latest-images.result == 'failure' || needs.run-e2e-tests.result == 'failure') }} for more precise control.

Suggested change
if: ${{ failure() }}
if: ${{ always() && (needs.fetch-latest-images.result == 'failure' || needs.run-e2e-tests.result == 'failure') }}

Copilot uses AI. Check for mistakes.
with:
go-version: ${{ env.GOLANG_VERSION }}
- name: Lint
uses: golangci/golangci-lint-action@v8
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The golangci-lint-action version has been downgraded from v9 (in the old CI workflow) to v8 (in the new golang-checks workflow). This may introduce compatibility issues or missing features. Unless there's a specific reason for the downgrade, it should be kept at v9 for consistency.

Suggested change
uses: golangci/golangci-lint-action@v8
uses: golangci/golangci-lint-action@v9

Copilot uses AI. Check for mistakes.
Download `values-overrides` artifact to see tested component versions

*Details:* <https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Failed Run>
<@S095E7BNGJU>
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Slack notification contains a hardcoded user ID <@S095E7BNGJU>. This appears to be a specific user mention that will only notify one person. Consider making this configurable via a secret or input parameter, or documenting who this user is and why they're hardcoded. If this is meant to notify a team or channel, use a more appropriate mechanism.

Suggested change
<@S095E7BNGJU>
${{ secrets.SLACK_MENTION }}

Copilot uses AI. Check for mistakes.
Comment on lines +77 to +88
# Determine if we should push images
PUSH_ON_BUILD="false"
if [[ "${{ github.actor }}" != "dependabot[bot]" ]]; then
if [[ "${{ github.event_name }}" == "pull_request" && "${{ github.event.pull_request.head.repo.full_name }}" == "${{ github.repository }}" ]]; then
PUSH_ON_BUILD="true"
elif [[ "${{ github.event_name }}" == "push" ]]; then
PUSH_ON_BUILD="true"
elif [[ "${{ github.event_name }}" == "workflow_dispatch" ]]; then
PUSH_ON_BUILD="true"
fi
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the variables workflow is called from the ci.yaml push event (line 30-31 in ci.yaml), the github.event_name will be "push" inside the variables workflow, not "workflow_call". Line 84 checks for workflow_dispatch, but that won't match when called as a reusable workflow. The PUSH_ON_BUILD logic may not work correctly. Consider checking github.event_name == 'workflow_call' or relying solely on the calling workflow's event type.

Copilot uses AI. Check for mistakes.
Comment on lines 47 to 49
# Create a combined values file
COMBINED_VALUES=$(mktemp)
cat "${VALUES_FILE}" "${TEMP_ENV_VALUES}" > "${COMBINED_VALUES}"
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply concatenating YAML files with cat does not properly merge them. If both files define the same keys (e.g., both define toolkit:), the resulting YAML will have duplicate keys, which may cause unpredictable behavior or errors when parsed by Helm. Consider using a proper YAML merge tool like yq to merge the values files, or ensure that the files don't have overlapping keys.

Suggested change
# Create a combined values file
COMBINED_VALUES=$(mktemp)
cat "${VALUES_FILE}" "${TEMP_ENV_VALUES}" > "${COMBINED_VALUES}"
# Create a combined values file by merging YAML documents
COMBINED_VALUES=$(mktemp)
if command -v yq >/dev/null 2>&1; then
yq ea '. as $item ireduce ({}; . * $item )' "${VALUES_FILE}" "${TEMP_ENV_VALUES}" > "${COMBINED_VALUES}"
else
echo "Error: yq is required to merge YAML values files but was not found in PATH." >&2
exit 1
fi

Copilot uses AI. Check for mistakes.
Comment on lines +46 to +59
TOOLKIT=$(.github/scripts/get-latest-images.sh toolkit)

echo "::notice::Fetching latest device-plugin image..."
DEVICE_PLUGIN=$(.github/scripts/get-latest-images.sh device-plugin)

echo "::notice::Fetching latest mig-manager image..."
MIG_MANAGER=$(.github/scripts/get-latest-images.sh mig-manager)

# Generate values override file
.github/scripts/generate-values-overrides.sh \
values-overrides.yaml \
"${TOOLKIT}" \
"${DEVICE_PLUGIN}" \
"${MIG_MANAGER}"
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new shell scripts (get-latest-images.sh, generate-values-overrides.sh) are being executed directly in the workflow without a chmod command. These files must be committed with executable permissions set (chmod +x), otherwise the workflow will fail with a "Permission denied" error. Ensure these files have the executable bit set in git before merging.

Copilot uses AI. Check for mistakes.
Comment on lines 18 to 22
push:
branches:
- "pull-request/[0-9]+"
- main
- release-*
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When e2e-tests.yaml is triggered via push event (lines 18-22), secrets are not inherited automatically and the workflow will fail when trying to access AWS_ACCESS_KEY_ID and other required secrets. The workflow expects these secrets to be passed explicitly (as done in ci.yaml line 57 with secrets: inherit), but when triggered directly via push, no secrets context exists. Either remove the push trigger or make secrets available for push events.

Suggested change
push:
branches:
- "pull-request/[0-9]+"
- main
- release-*

Copilot uses AI. Check for mistakes.
Break the monolithic ci.yaml into focused, reusable workflow files:
code-scanning, config-checks, golang-checks, image-builds, and
release. Centralize shared variables in variables.yaml. Standardize
copyright headers across all workflow files.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add e2e-tests.yaml reusable workflow for end-to-end GPU operator
testing. Introduce env-to-values.sh to convert environment variables
to Helm values overrides. Update install-operator.sh to use yq-based
YAML merging for component image configuration.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add forward-compatibility.yaml workflow that runs weekly to test the
GPU operator against latest upstream component images (toolkit,
device-plugin, mig-manager). Includes get-latest-images.sh with
retry/backoff for image verification and generate-values-overrides.sh
for Helm values generation.

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@ArangoGutierrez
Copy link
Collaborator Author

@rajathagasthya @cdesiniotis PTAL

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 11 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +55 to +56
echo "Warning: yq not found, falling back to concatenation (may have issues with duplicate keys)" >&2
cat "${VALUES_FILE}" "${TEMP_ENV_VALUES}" > "${COMBINED_VALUES}"
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When yq is not available, the script falls back to simple concatenation of YAML files which can produce invalid YAML if there are duplicate top-level keys. While this provides a fallback, consider either:

  1. Making yq a required dependency and failing if it's not found
  2. Documenting this limitation more clearly
  3. Implementing a more robust merge strategy that at least detects and warns about duplicate keys

The current fallback might lead to silent failures that are difficult to debug.

Suggested change
echo "Warning: yq not found, falling back to concatenation (may have issues with duplicate keys)" >&2
cat "${VALUES_FILE}" "${TEMP_ENV_VALUES}" > "${COMBINED_VALUES}"
echo "Error: 'yq' command not found. It is required to safely merge values files." >&2
echo "Please install yq (https://mikefarah.gitbook.io/yq/) and re-run this script." >&2
exit 1

Copilot uses AI. Check for mistakes.
operator_version:
description: 'Operator version to test (override)'
required: false
type: string
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use_values_override input is available for workflow_call but not for workflow_dispatch. This inconsistency means that users who manually trigger the E2E Tests workflow via workflow_dispatch won't be able to use the values override feature, even though the workflow logic supports it. Consider adding this input to the workflow_dispatch section for consistency.

Suggested change
type: string
type: string
use_values_override:
description: 'Use values-overrides artifact for component image configuration'
required: false
type: boolean
default: false

Copilot uses AI. Check for mistakes.
Comment on lines +101 to +112
- name: Run e2e tests
env:
OPERATOR_VERSION: ${{ needs.variables.outputs.operator_version }}
OPERATOR_IMAGE: ${{ needs.variables.outputs.operator_image }}
GPU_PRODUCT_NAME: "Tesla-T4"
SKIP_LAUNCH: "true"
CONTAINER_RUNTIME: "containerd"
TEST_CASE: "./tests/cases/defaults.sh"
run: |
./tests/ci-run-e2e.sh ${OPERATOR_IMAGE} ${OPERATOR_VERSION} ${GPU_PRODUCT_NAME} ${TEST_CASE} || rc=$?
./tests/scripts/pull.sh /tmp/logs logs
exit $rc
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VALUES_FILE environment variable is set in step "Copy values override file to remote" but is not explicitly included in the env block of the "Run e2e tests" step. While variables set with $GITHUB_ENV persist across steps, it's more explicit and maintainable to include VALUES_FILE in the env block of the test step, similar to how other environment variables like OPERATOR_VERSION and OPERATOR_IMAGE are specified. This makes the dependencies clearer and prevents potential issues if the step execution context changes.

Copilot uses AI. Check for mistakes.
Comment on lines +162 to +173
- name: Run e2e tests
env:
OPERATOR_VERSION: ${{ needs.variables.outputs.operator_version }}
OPERATOR_IMAGE: ${{ needs.variables.outputs.operator_image }}
GPU_PRODUCT_NAME: "Tesla-T4"
SKIP_LAUNCH: "true"
CONTAINER_RUNTIME: "containerd"
TEST_CASE: "./tests/cases/nvidia-driver.sh"
run: |
./tests/ci-run-e2e.sh ${OPERATOR_IMAGE} ${OPERATOR_VERSION} ${GPU_PRODUCT_NAME} ${TEST_CASE} || rc=$?
./tests/scripts/pull.sh /tmp/logs logs
exit $rc
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VALUES_FILE environment variable is set in step "Copy values override file to remote" but is not explicitly included in the env block of the "Run e2e tests" step. While variables set with $GITHUB_ENV persist across steps, it's more explicit and maintainable to include VALUES_FILE in the env block of the test step, similar to how other environment variables like OPERATOR_VERSION and OPERATOR_IMAGE are specified. This makes the dependencies clearer and prevents potential issues if the step execution context changes.

Copilot uses AI. Check for mistakes.
Comment on lines +40 to +59
- name: Get latest component images and generate values override file
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
# Fetch latest images from component repositories
echo "::notice::Fetching latest container-toolkit image..."
TOOLKIT=$(.github/scripts/get-latest-images.sh toolkit)

echo "::notice::Fetching latest device-plugin image..."
DEVICE_PLUGIN=$(.github/scripts/get-latest-images.sh device-plugin)

echo "::notice::Fetching latest mig-manager image..."
MIG_MANAGER=$(.github/scripts/get-latest-images.sh mig-manager)

# Generate values override file
.github/scripts/generate-values-overrides.sh \
values-overrides.yaml \
"${TOOLKIT}" \
"${DEVICE_PLUGIN}" \
"${MIG_MANAGER}"
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The get-latest-images.sh script requires jq to parse JSON responses from the GitHub API, but there's no installation step for jq in the fetch-latest-images job. While jq is pre-installed on ubuntu-latest runners, this creates an implicit dependency that isn't documented. Consider either:

  1. Adding an explicit jq installation step for clarity and robustness
  2. Adding a comment noting the dependency on the ubuntu-latest pre-installed tools
  3. Adding a check at the beginning of the script similar to the regctl check (lines 25-29)

Copilot uses AI. Check for mistakes.
name: values-overrides
path: ${{ github.workspace }}
- name: Set up Holodeck
uses: NVIDIA/holodeck@v0.2.17
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Holodeck version has been downgraded from v0.2.18 (used in the original ci.yaml) to v0.2.17. This should match the version used in the e2e-tests-containerd job for consistency. Unless this is intentional for a specific reason, the version should be updated to v0.2.18 to match the original CI workflow.

Suggested change
uses: NVIDIA/holodeck@v0.2.17
uses: NVIDIA/holodeck@v0.2.18

Copilot uses AI. Check for mistakes.
Comment on lines +82 to +91
- name: Send Slack alert notification
uses: slackapi/slack-github-action@v2.1.1
with:
method: chat.postMessage
token: ${{ secrets.SLACK_BOT_TOKEN }}
payload: |
{
"channel": "${{ secrets.SLACK_CHANNEL_ID }}",
"text": ":x: *Forward Compatibility Test Failed for GPU Operator*\n\n*Workflow:* ${{ github.workflow }}\n*Repository:* ${{ github.repository }}\n*Trigger:* ${{ github.event_name }}\n\n*Tested Components:*\nDownload `values-overrides` artifact to see tested component versions\n\n*Details:* <https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Failed Run>\n\n${{ secrets.SLACK_MENTION_LIST }}"
}
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Slack notification step will fail if SLACK_BOT_TOKEN or SLACK_CHANNEL_ID secrets are not configured in the repository. While the secrets are marked as optional in the e2e-tests.yaml workflow definition (lines 38-41), the Slack action will still fail if they're missing or empty when the notify-failure job runs.

Consider adding a condition to check if the secrets are available before attempting to send the notification, such as:
if: ${{ always() && (needs.fetch-latest-images.result == 'failure' || needs.run-e2e-tests.result == 'failure') && secrets.SLACK_BOT_TOKEN != '' }}

Copilot uses AI. Check for mistakes.
Comment on lines +26 to +28
operator_version:
required: true
type: string
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operator_version input is defined in the workflow_call inputs (line 27) but is never used in the workflow. The variables job only uses commit_short_sha and operator_image_base inputs. Either this input should be removed if it's not needed, or it should be used appropriately in the workflow logic.

Suggested change
operator_version:
required: true
type: string

Copilot uses AI. Check for mistakes.
Comment on lines +60 to +180
e2e-tests-containerd:
needs: [variables]
runs-on: linux-amd64-cpu4
permissions:
contents: read
id-token: write
steps:
- uses: actions/checkout@v6
name: Check out code
- name: Download values override file
if: ${{ inputs.use_values_override }}
uses: actions/download-artifact@v6
with:
name: values-overrides
path: ${{ github.workspace }}
- name: Set up Holodeck
uses: NVIDIA/holodeck@v0.2.17
with:
aws_access_key_id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws_secret_access_key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws_ssh_key: ${{ secrets.AWS_SSH_KEY }}
holodeck_config: "tests/holodeck.yaml"
- name: Get public dns name
id: get_public_dns_name
uses: mikefarah/yq@master
with:
cmd: yq '.status.properties[] | select(.name == "public-dns-name") | .value' /github/workspace/.cache/holodeck.yaml
- name: Set test environment
run: |
echo "instance_hostname=ubuntu@${{ steps.get_public_dns_name.outputs.result }}" >> $GITHUB_ENV
echo "private_key=${{ github.workspace }}/key.pem" >> $GITHUB_ENV
- name: Write SSH key
run: |
echo "${{ secrets.AWS_SSH_KEY }}" > ${private_key} && chmod 400 ${private_key}
- name: Copy values override file to remote
if: ${{ inputs.use_values_override }}
run: |
scp -i ${private_key} -o StrictHostKeyChecking=no \
${{ github.workspace }}/values-overrides.yaml \
${instance_hostname}:/tmp/values-overrides.yaml
echo "VALUES_FILE=/tmp/values-overrides.yaml" >> $GITHUB_ENV
- name: Run e2e tests
env:
OPERATOR_VERSION: ${{ needs.variables.outputs.operator_version }}
OPERATOR_IMAGE: ${{ needs.variables.outputs.operator_image }}
GPU_PRODUCT_NAME: "Tesla-T4"
SKIP_LAUNCH: "true"
CONTAINER_RUNTIME: "containerd"
TEST_CASE: "./tests/cases/defaults.sh"
run: |
./tests/ci-run-e2e.sh ${OPERATOR_IMAGE} ${OPERATOR_VERSION} ${GPU_PRODUCT_NAME} ${TEST_CASE} || rc=$?
./tests/scripts/pull.sh /tmp/logs logs
exit $rc
- name: Archive test logs
if: ${{ failure() }}
uses: actions/upload-artifact@v6
with:
name: containerd-e2e-test-logs
path: ./logs/
retention-days: 15

e2e-tests-nvidiadriver:
needs: [variables]
runs-on: linux-amd64-cpu4
permissions:
contents: read
id-token: write
steps:
- uses: actions/checkout@v6
name: Check out code
- name: Download values override file
if: ${{ inputs.use_values_override }}
uses: actions/download-artifact@v6
with:
name: values-overrides
path: ${{ github.workspace }}
- name: Set up Holodeck
uses: NVIDIA/holodeck@v0.2.17
with:
aws_access_key_id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws_secret_access_key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws_ssh_key: ${{ secrets.AWS_SSH_KEY }}
holodeck_config: "tests/holodeck.yaml"
- name: Get public dns name
id: get_public_dns_name
uses: mikefarah/yq@master
with:
cmd: yq '.status.properties[] | select(.name == "public-dns-name") | .value' /github/workspace/.cache/holodeck.yaml
- name: Set test environment
run: |
echo "instance_hostname=ubuntu@${{ steps.get_public_dns_name.outputs.result }}" >> $GITHUB_ENV
echo "private_key=${{ github.workspace }}/key.pem" >> $GITHUB_ENV
- name: Write SSH key
run: |
echo "${{ secrets.AWS_SSH_KEY }}" > ${private_key} && chmod 400 ${private_key}
- name: Copy values override file to remote
if: ${{ inputs.use_values_override }}
run: |
scp -i ${private_key} -o StrictHostKeyChecking=no \
${{ github.workspace }}/values-overrides.yaml \
${instance_hostname}:/tmp/values-overrides.yaml
echo "VALUES_FILE=/tmp/values-overrides.yaml" >> $GITHUB_ENV
- name: Run e2e tests
env:
OPERATOR_VERSION: ${{ needs.variables.outputs.operator_version }}
OPERATOR_IMAGE: ${{ needs.variables.outputs.operator_image }}
GPU_PRODUCT_NAME: "Tesla-T4"
SKIP_LAUNCH: "true"
CONTAINER_RUNTIME: "containerd"
TEST_CASE: "./tests/cases/nvidia-driver.sh"
run: |
./tests/ci-run-e2e.sh ${OPERATOR_IMAGE} ${OPERATOR_VERSION} ${GPU_PRODUCT_NAME} ${TEST_CASE} || rc=$?
./tests/scripts/pull.sh /tmp/logs logs
exit $rc
- name: Archive test logs
if: ${{ failure() }}
uses: actions/upload-artifact@v6
with:
name: nvidiadriver-e2e-test-logs
path: ./logs/
retention-days: 15
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The E2E test workflow only runs two test cases: defaults.sh and nvidia-driver.sh. There is also an experimental-runtime.sh test case in the tests/cases directory that is not being executed. Consider whether this test case should also be included in the CI pipeline for more comprehensive test coverage, or document why it's excluded.

Copilot uses AI. Check for mistakes.
Comment on lines +68 to +75
run-e2e-tests:
needs: [fetch-latest-images]
uses: ./.github/workflows/e2e-tests.yaml
with:
operator_image: ghcr.io/nvidia/gpu-operator
operator_version: main-latest
use_values_override: true
secrets: inherit
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Artifacts uploaded in the fetch-latest-images job will not be accessible to the reusable e2e-tests.yaml workflow because artifacts are not shared across workflow boundaries when using workflow_call. The download-artifact action in e2e-tests.yaml will fail to find the values-overrides artifact.

To fix this, you need to either:

  1. Move the artifact generation logic into the e2e-tests.yaml workflow itself
  2. Pass the component images as inputs and regenerate the values file in e2e-tests.yaml
  3. Use outputs to pass the image names instead of artifacts

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants