Add weekly forward compatibility testing#1884
Add weekly forward compatibility testing#1884ArangoGutierrez wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
|
This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation. |
d55dd82 to
9dab00e
Compare
There was a problem hiding this comment.
Pull Request Overview
This PR implements automated weekly forward compatibility testing for the GPU Operator against the latest published component images from NVIDIA repositories. The changes refactor the existing monolithic CI workflow into reusable workflow modules and add forward compatibility testing infrastructure.
Key changes:
- Refactored CI workflows into separate, reusable workflow files (variables, golang-checks, config-checks, image-builds, e2e-tests, release)
- Added forward-compatibility.yaml workflow with scheduled weekly runs and manual trigger support
- Created get-latest-images.sh script to fetch latest commit-based images from component repositories
- Extended install-operator.sh to support device-plugin and mig-manager image overrides
- Added Slack notifications for test failures
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| .github/workflows/forward-compatibility.yaml | New workflow for weekly forward compatibility testing with Slack notifications |
| .github/workflows/ci.yaml | Refactored to use reusable workflows instead of monolithic job definitions |
| .github/workflows/variables.yaml | New reusable workflow for shared CI variables |
| .github/workflows/e2e-tests.yaml | New reusable e2e test workflow with optional component image inputs |
| .github/workflows/image-builds.yaml | New reusable workflow for GPU operator image builds |
| .github/workflows/release.yaml | New reusable workflow for release processes |
| .github/workflows/golang-checks.yaml | New reusable workflow for Go checks and tests |
| .github/workflows/config-checks.yaml | New reusable workflow for configuration validation |
| .github/workflows/code-scanning.yaml | New reusable workflow for CodeQL scanning |
| .github/scripts/get-latest-images.sh | Script to fetch latest component images from NVIDIA repositories |
| tests/scripts/install-operator.sh | Extended to support device-plugin and mig-manager image overrides |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
84a0c31 to
54d6fc9
Compare
| env: | ||
| GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} | ||
| run: | | ||
| # Use workflow_dispatch inputs if provided, otherwise fetch latest |
There was a problem hiding this comment.
Question: when would someone set these variables as inputs? If the scope of this workflow is to always test the top-of-tree images for all operands, then maybe we can remove the input variables and move the entire run: block into a shell script? Thoughts?
There was a problem hiding this comment.
My intention is to allow manual triggering of the workflow in case we want to test a very specific image without having to wait for the "weekly" run. But I'll set it in a way that even during manual triggers, we will simply fetch all the latests images from operands.
| toolkit_image: ${{ needs.fetch-latest-images.outputs.toolkit_image }} | ||
| device_plugin_image: ${{ needs.fetch-latest-images.outputs.device_plugin_image }} | ||
| mig_manager_image: ${{ needs.fetch-latest-images.outputs.mig_manager_image }} |
There was a problem hiding this comment.
I understand the current tests/scripts/install-operator.sh has separate input variables for the various operand images, but would it simplify our CI definition if we passed a values override file instead? Just an idea -- the previous step fetch-latest-images can craft the values override file that points to all the latest operand images. This file is passed as input to the run-e2e-tests step which launches our e2e test script with this overrides file. Thoughts? The reason I am suggesting this is because I see a lot of boilerplate in forward-compatibility.yaml and e2e-tests.yaml for declaring all of these input variables.
There was a problem hiding this comment.
I took a stab at this comment, please let me know what you think
| toolkit_image: | ||
| description: 'Override container-toolkit image' | ||
| required: false | ||
| type: string | ||
| device_plugin_image: | ||
| description: 'Override device-plugin image' | ||
| required: false | ||
| type: string | ||
| mig_manager_image: | ||
| description: 'Override mig-manager image' | ||
| required: false | ||
| type: string |
There was a problem hiding this comment.
Why do these variables need to be inputs? Aren't we always resolving the images in this workflow itself?
| with: | ||
| operator_image: ${{ needs.variables.outputs.operator_image_base }} | ||
| operator_version: ${{ needs.variables.outputs.operator_version }} |
There was a problem hiding this comment.
Question -- could we potentially move this to the variables workflow itself? As in, make operator_image and operator_version as outputs of the variables worflow?
There was a problem hiding this comment.
good catch, done!
| with: | ||
| commit_short_sha: ${{ needs.variables.outputs.commit_short_sha }} | ||
| operator_version: ${{ needs.variables.outputs.operator_version }} | ||
| operator_image_base: ${{ needs.variables.outputs.operator_image_base }} |
There was a problem hiding this comment.
Same comment as above -- is this really needed? Can we move these variables as outputs of the variables workflow itself?
There was a problem hiding this comment.
same, done!
.github/workflows/e2e-tests.yaml
Outdated
| with: | ||
| operator_image: ${{ inputs.operator_image }} | ||
| operator_version: ${{ inputs.operator_version }} | ||
| toolkit_image: ${{ inputs.toolkit_image }} | ||
| device_plugin_image: ${{ inputs.device_plugin_image }} | ||
| mig_manager_image: ${{ inputs.mig_manager_image }} |
There was a problem hiding this comment.
What is the rationale for passing these variables as input to the variables workflow?
54d6fc9 to
203941a
Compare
| @@ -0,0 +1,74 @@ | |||
| #!/usr/bin/env bash | |||
|
|
|||
| # Copyright 2024 NVIDIA CORPORATION | |||
.github/workflows/e2e-tests.yaml
Outdated
| echo "instance_hostname=ubuntu@${{ steps.get_public_dns_name.outputs.result }}" >> $GITHUB_ENV | ||
| echo "private_key=${{ github.workspace }}/key.pem" >> $GITHUB_ENV | ||
| if [[ "${{ inputs.use_values_override }}" == "true" ]]; then | ||
| echo "VALUES_FILE=${{ github.workspace }}/values-overrides.yaml" >> $GITHUB_ENV |
There was a problem hiding this comment.
I may be missing something, but how does the remote instance know about VALUES_FILE? Do we need to pass that environment variable here?
Line 43 in 6531839
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 11 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| channel: ${{ secrets.SLACK_CHANNEL_ID }} | ||
| text: | | ||
| :x: *Forward Compatibility Test Failed for GPU Operator* | ||
|
|
||
| *Workflow:* ${{ github.workflow }} | ||
| *Repository:* ${{ github.repository }} | ||
| *Trigger:* ${{ github.event_name }} | ||
|
|
||
| *Tested Components:* | ||
| Download `values-overrides` artifact to see tested component versions | ||
|
|
||
| *Details:* <https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Failed Run> | ||
| <@S095E7BNGJU> |
There was a problem hiding this comment.
The Slack notification payload format is incorrect. The action expects a JSON payload, but this is using YAML format with a nested structure. The correct format should use JSON with the payload structured as: {"channel": "${{ secrets.SLACK_CHANNEL_ID }}", "text": "message"}. The current format will cause the Slack notification to fail.
| channel: ${{ secrets.SLACK_CHANNEL_ID }} | |
| text: | | |
| :x: *Forward Compatibility Test Failed for GPU Operator* | |
| *Workflow:* ${{ github.workflow }} | |
| *Repository:* ${{ github.repository }} | |
| *Trigger:* ${{ github.event_name }} | |
| *Tested Components:* | |
| Download `values-overrides` artifact to see tested component versions | |
| *Details:* <https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Failed Run> | |
| <@S095E7BNGJU> | |
| {"channel": "${{ secrets.SLACK_CHANNEL_ID }}", "text": ":x: *Forward Compatibility Test Failed for GPU Operator*\n\n*Workflow:* ${{ github.workflow }}\n*Repository:* ${{ github.repository }}\n*Trigger:* ${{ github.event_name }}\n\n*Tested Components:*\nDownload `values-overrides` artifact to see tested component versions\n\n*Details:* <https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Failed Run>\n<@S095E7BNGJU>"} |
| # Clear individual options since we're using values file | ||
| TOOLKIT_CONTAINER_OPTIONS="" | ||
| DEVICE_PLUGIN_OPTIONS="" | ||
| MIG_MANAGER_OPTIONS="" |
There was a problem hiding this comment.
When USE_VALUES_FILE is true, OPERATOR_OPTIONS is cleared/never set, but it's still being appended to in the vGPU section (lines 81-92). The vGPU driver options added to OPERATOR_OPTIONS won't be applied when the values file approach is used (line 101-104), causing vGPU configurations to be silently ignored. The vGPU options should be added to the values file instead when USE_VALUES_FILE is true.
| # Clear individual options since we're using values file | |
| TOOLKIT_CONTAINER_OPTIONS="" | |
| DEVICE_PLUGIN_OPTIONS="" | |
| MIG_MANAGER_OPTIONS="" | |
| # Ensure individual options are defined; avoid clobbering any existing values | |
| : ${TOOLKIT_CONTAINER_OPTIONS:=""} | |
| : ${DEVICE_PLUGIN_OPTIONS:=""} | |
| : ${MIG_MANAGER_OPTIONS:=""} |
| variables: | ||
| uses: ./.github/workflows/variables.yaml | ||
| with: | ||
| operator_image: ${{ inputs.operator_image }} | ||
| operator_version: ${{ inputs.operator_version }} |
There was a problem hiding this comment.
When the workflow is triggered by a push event (lines 18-22), the inputs object is undefined, which will cause the variables workflow call to receive empty/undefined values. This will break the e2e tests when run from push events. Push events should either not call the variables workflow with inputs, or the workflow should handle the undefined inputs case properly.
| notify-failure: | ||
| runs-on: ubuntu-latest | ||
| needs: [fetch-latest-images, run-e2e-tests] | ||
| if: ${{ failure() }} |
There was a problem hiding this comment.
The notify-failure job will run only when ANY of the previous jobs fail (failure() checks all needed jobs). However, if fetch-latest-images fails, the run-e2e-tests job will be skipped (not failed). The current condition if: ${{ failure() }} will trigger even if only fetch-latest-images fails, which may not be the intended behavior. Consider using if: ${{ always() && (needs.fetch-latest-images.result == 'failure' || needs.run-e2e-tests.result == 'failure') }} for more precise control.
| if: ${{ failure() }} | |
| if: ${{ always() && (needs.fetch-latest-images.result == 'failure' || needs.run-e2e-tests.result == 'failure') }} |
.github/workflows/golang-checks.yaml
Outdated
| with: | ||
| go-version: ${{ env.GOLANG_VERSION }} | ||
| - name: Lint | ||
| uses: golangci/golangci-lint-action@v8 |
There was a problem hiding this comment.
The golangci-lint-action version has been downgraded from v9 (in the old CI workflow) to v8 (in the new golang-checks workflow). This may introduce compatibility issues or missing features. Unless there's a specific reason for the downgrade, it should be kept at v9 for consistency.
| uses: golangci/golangci-lint-action@v8 | |
| uses: golangci/golangci-lint-action@v9 |
| Download `values-overrides` artifact to see tested component versions | ||
|
|
||
| *Details:* <https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Failed Run> | ||
| <@S095E7BNGJU> |
There was a problem hiding this comment.
The Slack notification contains a hardcoded user ID <@S095E7BNGJU>. This appears to be a specific user mention that will only notify one person. Consider making this configurable via a secret or input parameter, or documenting who this user is and why they're hardcoded. If this is meant to notify a team or channel, use a more appropriate mechanism.
| <@S095E7BNGJU> | |
| ${{ secrets.SLACK_MENTION }} |
| # Determine if we should push images | ||
| PUSH_ON_BUILD="false" | ||
| if [[ "${{ github.actor }}" != "dependabot[bot]" ]]; then | ||
| if [[ "${{ github.event_name }}" == "pull_request" && "${{ github.event.pull_request.head.repo.full_name }}" == "${{ github.repository }}" ]]; then | ||
| PUSH_ON_BUILD="true" | ||
| elif [[ "${{ github.event_name }}" == "push" ]]; then | ||
| PUSH_ON_BUILD="true" | ||
| elif [[ "${{ github.event_name }}" == "workflow_dispatch" ]]; then | ||
| PUSH_ON_BUILD="true" | ||
| fi |
There was a problem hiding this comment.
When the variables workflow is called from the ci.yaml push event (line 30-31 in ci.yaml), the github.event_name will be "push" inside the variables workflow, not "workflow_call". Line 84 checks for workflow_dispatch, but that won't match when called as a reusable workflow. The PUSH_ON_BUILD logic may not work correctly. Consider checking github.event_name == 'workflow_call' or relying solely on the calling workflow's event type.
tests/scripts/install-operator.sh
Outdated
| # Create a combined values file | ||
| COMBINED_VALUES=$(mktemp) | ||
| cat "${VALUES_FILE}" "${TEMP_ENV_VALUES}" > "${COMBINED_VALUES}" |
There was a problem hiding this comment.
Simply concatenating YAML files with cat does not properly merge them. If both files define the same keys (e.g., both define toolkit:), the resulting YAML will have duplicate keys, which may cause unpredictable behavior or errors when parsed by Helm. Consider using a proper YAML merge tool like yq to merge the values files, or ensure that the files don't have overlapping keys.
| # Create a combined values file | |
| COMBINED_VALUES=$(mktemp) | |
| cat "${VALUES_FILE}" "${TEMP_ENV_VALUES}" > "${COMBINED_VALUES}" | |
| # Create a combined values file by merging YAML documents | |
| COMBINED_VALUES=$(mktemp) | |
| if command -v yq >/dev/null 2>&1; then | |
| yq ea '. as $item ireduce ({}; . * $item )' "${VALUES_FILE}" "${TEMP_ENV_VALUES}" > "${COMBINED_VALUES}" | |
| else | |
| echo "Error: yq is required to merge YAML values files but was not found in PATH." >&2 | |
| exit 1 | |
| fi |
| TOOLKIT=$(.github/scripts/get-latest-images.sh toolkit) | ||
|
|
||
| echo "::notice::Fetching latest device-plugin image..." | ||
| DEVICE_PLUGIN=$(.github/scripts/get-latest-images.sh device-plugin) | ||
|
|
||
| echo "::notice::Fetching latest mig-manager image..." | ||
| MIG_MANAGER=$(.github/scripts/get-latest-images.sh mig-manager) | ||
|
|
||
| # Generate values override file | ||
| .github/scripts/generate-values-overrides.sh \ | ||
| values-overrides.yaml \ | ||
| "${TOOLKIT}" \ | ||
| "${DEVICE_PLUGIN}" \ | ||
| "${MIG_MANAGER}" |
There was a problem hiding this comment.
The new shell scripts (get-latest-images.sh, generate-values-overrides.sh) are being executed directly in the workflow without a chmod command. These files must be committed with executable permissions set (chmod +x), otherwise the workflow will fail with a "Permission denied" error. Ensure these files have the executable bit set in git before merging.
.github/workflows/e2e-tests.yaml
Outdated
| push: | ||
| branches: | ||
| - "pull-request/[0-9]+" | ||
| - main | ||
| - release-* |
There was a problem hiding this comment.
When e2e-tests.yaml is triggered via push event (lines 18-22), secrets are not inherited automatically and the workflow will fail when trying to access AWS_ACCESS_KEY_ID and other required secrets. The workflow expects these secrets to be passed explicitly (as done in ci.yaml line 57 with secrets: inherit), but when triggered directly via push, no secrets context exists. Either remove the push trigger or make secrets available for push events.
| push: | |
| branches: | |
| - "pull-request/[0-9]+" | |
| - main | |
| - release-* |
Break the monolithic ci.yaml into focused, reusable workflow files: code-scanning, config-checks, golang-checks, image-builds, and release. Centralize shared variables in variables.yaml. Standardize copyright headers across all workflow files. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add e2e-tests.yaml reusable workflow for end-to-end GPU operator testing. Introduce env-to-values.sh to convert environment variables to Helm values overrides. Update install-operator.sh to use yq-based YAML merging for component image configuration. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add forward-compatibility.yaml workflow that runs weekly to test the GPU operator against latest upstream component images (toolkit, device-plugin, mig-manager). Includes get-latest-images.sh with retry/backoff for image verification and generate-values-overrides.sh for Helm values generation. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
203941a to
764a66e
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 11 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| echo "Warning: yq not found, falling back to concatenation (may have issues with duplicate keys)" >&2 | ||
| cat "${VALUES_FILE}" "${TEMP_ENV_VALUES}" > "${COMBINED_VALUES}" |
There was a problem hiding this comment.
When yq is not available, the script falls back to simple concatenation of YAML files which can produce invalid YAML if there are duplicate top-level keys. While this provides a fallback, consider either:
- Making yq a required dependency and failing if it's not found
- Documenting this limitation more clearly
- Implementing a more robust merge strategy that at least detects and warns about duplicate keys
The current fallback might lead to silent failures that are difficult to debug.
| echo "Warning: yq not found, falling back to concatenation (may have issues with duplicate keys)" >&2 | |
| cat "${VALUES_FILE}" "${TEMP_ENV_VALUES}" > "${COMBINED_VALUES}" | |
| echo "Error: 'yq' command not found. It is required to safely merge values files." >&2 | |
| echo "Please install yq (https://mikefarah.gitbook.io/yq/) and re-run this script." >&2 | |
| exit 1 |
| operator_version: | ||
| description: 'Operator version to test (override)' | ||
| required: false | ||
| type: string |
There was a problem hiding this comment.
The use_values_override input is available for workflow_call but not for workflow_dispatch. This inconsistency means that users who manually trigger the E2E Tests workflow via workflow_dispatch won't be able to use the values override feature, even though the workflow logic supports it. Consider adding this input to the workflow_dispatch section for consistency.
| type: string | |
| type: string | |
| use_values_override: | |
| description: 'Use values-overrides artifact for component image configuration' | |
| required: false | |
| type: boolean | |
| default: false |
| - name: Run e2e tests | ||
| env: | ||
| OPERATOR_VERSION: ${{ needs.variables.outputs.operator_version }} | ||
| OPERATOR_IMAGE: ${{ needs.variables.outputs.operator_image }} | ||
| GPU_PRODUCT_NAME: "Tesla-T4" | ||
| SKIP_LAUNCH: "true" | ||
| CONTAINER_RUNTIME: "containerd" | ||
| TEST_CASE: "./tests/cases/defaults.sh" | ||
| run: | | ||
| ./tests/ci-run-e2e.sh ${OPERATOR_IMAGE} ${OPERATOR_VERSION} ${GPU_PRODUCT_NAME} ${TEST_CASE} || rc=$? | ||
| ./tests/scripts/pull.sh /tmp/logs logs | ||
| exit $rc |
There was a problem hiding this comment.
The VALUES_FILE environment variable is set in step "Copy values override file to remote" but is not explicitly included in the env block of the "Run e2e tests" step. While variables set with $GITHUB_ENV persist across steps, it's more explicit and maintainable to include VALUES_FILE in the env block of the test step, similar to how other environment variables like OPERATOR_VERSION and OPERATOR_IMAGE are specified. This makes the dependencies clearer and prevents potential issues if the step execution context changes.
| - name: Run e2e tests | ||
| env: | ||
| OPERATOR_VERSION: ${{ needs.variables.outputs.operator_version }} | ||
| OPERATOR_IMAGE: ${{ needs.variables.outputs.operator_image }} | ||
| GPU_PRODUCT_NAME: "Tesla-T4" | ||
| SKIP_LAUNCH: "true" | ||
| CONTAINER_RUNTIME: "containerd" | ||
| TEST_CASE: "./tests/cases/nvidia-driver.sh" | ||
| run: | | ||
| ./tests/ci-run-e2e.sh ${OPERATOR_IMAGE} ${OPERATOR_VERSION} ${GPU_PRODUCT_NAME} ${TEST_CASE} || rc=$? | ||
| ./tests/scripts/pull.sh /tmp/logs logs | ||
| exit $rc |
There was a problem hiding this comment.
The VALUES_FILE environment variable is set in step "Copy values override file to remote" but is not explicitly included in the env block of the "Run e2e tests" step. While variables set with $GITHUB_ENV persist across steps, it's more explicit and maintainable to include VALUES_FILE in the env block of the test step, similar to how other environment variables like OPERATOR_VERSION and OPERATOR_IMAGE are specified. This makes the dependencies clearer and prevents potential issues if the step execution context changes.
| - name: Get latest component images and generate values override file | ||
| env: | ||
| GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} | ||
| run: | | ||
| # Fetch latest images from component repositories | ||
| echo "::notice::Fetching latest container-toolkit image..." | ||
| TOOLKIT=$(.github/scripts/get-latest-images.sh toolkit) | ||
|
|
||
| echo "::notice::Fetching latest device-plugin image..." | ||
| DEVICE_PLUGIN=$(.github/scripts/get-latest-images.sh device-plugin) | ||
|
|
||
| echo "::notice::Fetching latest mig-manager image..." | ||
| MIG_MANAGER=$(.github/scripts/get-latest-images.sh mig-manager) | ||
|
|
||
| # Generate values override file | ||
| .github/scripts/generate-values-overrides.sh \ | ||
| values-overrides.yaml \ | ||
| "${TOOLKIT}" \ | ||
| "${DEVICE_PLUGIN}" \ | ||
| "${MIG_MANAGER}" |
There was a problem hiding this comment.
The get-latest-images.sh script requires jq to parse JSON responses from the GitHub API, but there's no installation step for jq in the fetch-latest-images job. While jq is pre-installed on ubuntu-latest runners, this creates an implicit dependency that isn't documented. Consider either:
- Adding an explicit jq installation step for clarity and robustness
- Adding a comment noting the dependency on the ubuntu-latest pre-installed tools
- Adding a check at the beginning of the script similar to the regctl check (lines 25-29)
| name: values-overrides | ||
| path: ${{ github.workspace }} | ||
| - name: Set up Holodeck | ||
| uses: NVIDIA/holodeck@v0.2.17 |
There was a problem hiding this comment.
The Holodeck version has been downgraded from v0.2.18 (used in the original ci.yaml) to v0.2.17. This should match the version used in the e2e-tests-containerd job for consistency. Unless this is intentional for a specific reason, the version should be updated to v0.2.18 to match the original CI workflow.
| uses: NVIDIA/holodeck@v0.2.17 | |
| uses: NVIDIA/holodeck@v0.2.18 |
| - name: Send Slack alert notification | ||
| uses: slackapi/slack-github-action@v2.1.1 | ||
| with: | ||
| method: chat.postMessage | ||
| token: ${{ secrets.SLACK_BOT_TOKEN }} | ||
| payload: | | ||
| { | ||
| "channel": "${{ secrets.SLACK_CHANNEL_ID }}", | ||
| "text": ":x: *Forward Compatibility Test Failed for GPU Operator*\n\n*Workflow:* ${{ github.workflow }}\n*Repository:* ${{ github.repository }}\n*Trigger:* ${{ github.event_name }}\n\n*Tested Components:*\nDownload `values-overrides` artifact to see tested component versions\n\n*Details:* <https://github.com/${{ github.repository }}/actions/runs/${{ github.run_id }}|View Failed Run>\n\n${{ secrets.SLACK_MENTION_LIST }}" | ||
| } |
There was a problem hiding this comment.
The Slack notification step will fail if SLACK_BOT_TOKEN or SLACK_CHANNEL_ID secrets are not configured in the repository. While the secrets are marked as optional in the e2e-tests.yaml workflow definition (lines 38-41), the Slack action will still fail if they're missing or empty when the notify-failure job runs.
Consider adding a condition to check if the secrets are available before attempting to send the notification, such as:
if: ${{ always() && (needs.fetch-latest-images.result == 'failure' || needs.run-e2e-tests.result == 'failure') && secrets.SLACK_BOT_TOKEN != '' }}
| operator_version: | ||
| required: true | ||
| type: string |
There was a problem hiding this comment.
The operator_version input is defined in the workflow_call inputs (line 27) but is never used in the workflow. The variables job only uses commit_short_sha and operator_image_base inputs. Either this input should be removed if it's not needed, or it should be used appropriately in the workflow logic.
| operator_version: | |
| required: true | |
| type: string |
| e2e-tests-containerd: | ||
| needs: [variables] | ||
| runs-on: linux-amd64-cpu4 | ||
| permissions: | ||
| contents: read | ||
| id-token: write | ||
| steps: | ||
| - uses: actions/checkout@v6 | ||
| name: Check out code | ||
| - name: Download values override file | ||
| if: ${{ inputs.use_values_override }} | ||
| uses: actions/download-artifact@v6 | ||
| with: | ||
| name: values-overrides | ||
| path: ${{ github.workspace }} | ||
| - name: Set up Holodeck | ||
| uses: NVIDIA/holodeck@v0.2.17 | ||
| with: | ||
| aws_access_key_id: ${{ secrets.AWS_ACCESS_KEY_ID }} | ||
| aws_secret_access_key: ${{ secrets.AWS_SECRET_ACCESS_KEY }} | ||
| aws_ssh_key: ${{ secrets.AWS_SSH_KEY }} | ||
| holodeck_config: "tests/holodeck.yaml" | ||
| - name: Get public dns name | ||
| id: get_public_dns_name | ||
| uses: mikefarah/yq@master | ||
| with: | ||
| cmd: yq '.status.properties[] | select(.name == "public-dns-name") | .value' /github/workspace/.cache/holodeck.yaml | ||
| - name: Set test environment | ||
| run: | | ||
| echo "instance_hostname=ubuntu@${{ steps.get_public_dns_name.outputs.result }}" >> $GITHUB_ENV | ||
| echo "private_key=${{ github.workspace }}/key.pem" >> $GITHUB_ENV | ||
| - name: Write SSH key | ||
| run: | | ||
| echo "${{ secrets.AWS_SSH_KEY }}" > ${private_key} && chmod 400 ${private_key} | ||
| - name: Copy values override file to remote | ||
| if: ${{ inputs.use_values_override }} | ||
| run: | | ||
| scp -i ${private_key} -o StrictHostKeyChecking=no \ | ||
| ${{ github.workspace }}/values-overrides.yaml \ | ||
| ${instance_hostname}:/tmp/values-overrides.yaml | ||
| echo "VALUES_FILE=/tmp/values-overrides.yaml" >> $GITHUB_ENV | ||
| - name: Run e2e tests | ||
| env: | ||
| OPERATOR_VERSION: ${{ needs.variables.outputs.operator_version }} | ||
| OPERATOR_IMAGE: ${{ needs.variables.outputs.operator_image }} | ||
| GPU_PRODUCT_NAME: "Tesla-T4" | ||
| SKIP_LAUNCH: "true" | ||
| CONTAINER_RUNTIME: "containerd" | ||
| TEST_CASE: "./tests/cases/defaults.sh" | ||
| run: | | ||
| ./tests/ci-run-e2e.sh ${OPERATOR_IMAGE} ${OPERATOR_VERSION} ${GPU_PRODUCT_NAME} ${TEST_CASE} || rc=$? | ||
| ./tests/scripts/pull.sh /tmp/logs logs | ||
| exit $rc | ||
| - name: Archive test logs | ||
| if: ${{ failure() }} | ||
| uses: actions/upload-artifact@v6 | ||
| with: | ||
| name: containerd-e2e-test-logs | ||
| path: ./logs/ | ||
| retention-days: 15 | ||
|
|
||
| e2e-tests-nvidiadriver: | ||
| needs: [variables] | ||
| runs-on: linux-amd64-cpu4 | ||
| permissions: | ||
| contents: read | ||
| id-token: write | ||
| steps: | ||
| - uses: actions/checkout@v6 | ||
| name: Check out code | ||
| - name: Download values override file | ||
| if: ${{ inputs.use_values_override }} | ||
| uses: actions/download-artifact@v6 | ||
| with: | ||
| name: values-overrides | ||
| path: ${{ github.workspace }} | ||
| - name: Set up Holodeck | ||
| uses: NVIDIA/holodeck@v0.2.17 | ||
| with: | ||
| aws_access_key_id: ${{ secrets.AWS_ACCESS_KEY_ID }} | ||
| aws_secret_access_key: ${{ secrets.AWS_SECRET_ACCESS_KEY }} | ||
| aws_ssh_key: ${{ secrets.AWS_SSH_KEY }} | ||
| holodeck_config: "tests/holodeck.yaml" | ||
| - name: Get public dns name | ||
| id: get_public_dns_name | ||
| uses: mikefarah/yq@master | ||
| with: | ||
| cmd: yq '.status.properties[] | select(.name == "public-dns-name") | .value' /github/workspace/.cache/holodeck.yaml | ||
| - name: Set test environment | ||
| run: | | ||
| echo "instance_hostname=ubuntu@${{ steps.get_public_dns_name.outputs.result }}" >> $GITHUB_ENV | ||
| echo "private_key=${{ github.workspace }}/key.pem" >> $GITHUB_ENV | ||
| - name: Write SSH key | ||
| run: | | ||
| echo "${{ secrets.AWS_SSH_KEY }}" > ${private_key} && chmod 400 ${private_key} | ||
| - name: Copy values override file to remote | ||
| if: ${{ inputs.use_values_override }} | ||
| run: | | ||
| scp -i ${private_key} -o StrictHostKeyChecking=no \ | ||
| ${{ github.workspace }}/values-overrides.yaml \ | ||
| ${instance_hostname}:/tmp/values-overrides.yaml | ||
| echo "VALUES_FILE=/tmp/values-overrides.yaml" >> $GITHUB_ENV | ||
| - name: Run e2e tests | ||
| env: | ||
| OPERATOR_VERSION: ${{ needs.variables.outputs.operator_version }} | ||
| OPERATOR_IMAGE: ${{ needs.variables.outputs.operator_image }} | ||
| GPU_PRODUCT_NAME: "Tesla-T4" | ||
| SKIP_LAUNCH: "true" | ||
| CONTAINER_RUNTIME: "containerd" | ||
| TEST_CASE: "./tests/cases/nvidia-driver.sh" | ||
| run: | | ||
| ./tests/ci-run-e2e.sh ${OPERATOR_IMAGE} ${OPERATOR_VERSION} ${GPU_PRODUCT_NAME} ${TEST_CASE} || rc=$? | ||
| ./tests/scripts/pull.sh /tmp/logs logs | ||
| exit $rc | ||
| - name: Archive test logs | ||
| if: ${{ failure() }} | ||
| uses: actions/upload-artifact@v6 | ||
| with: | ||
| name: nvidiadriver-e2e-test-logs | ||
| path: ./logs/ | ||
| retention-days: 15 |
There was a problem hiding this comment.
The E2E test workflow only runs two test cases: defaults.sh and nvidia-driver.sh. There is also an experimental-runtime.sh test case in the tests/cases directory that is not being executed. Consider whether this test case should also be included in the CI pipeline for more comprehensive test coverage, or document why it's excluded.
| run-e2e-tests: | ||
| needs: [fetch-latest-images] | ||
| uses: ./.github/workflows/e2e-tests.yaml | ||
| with: | ||
| operator_image: ghcr.io/nvidia/gpu-operator | ||
| operator_version: main-latest | ||
| use_values_override: true | ||
| secrets: inherit |
There was a problem hiding this comment.
Artifacts uploaded in the fetch-latest-images job will not be accessible to the reusable e2e-tests.yaml workflow because artifacts are not shared across workflow boundaries when using workflow_call. The download-artifact action in e2e-tests.yaml will fail to find the values-overrides artifact.
To fix this, you need to either:
- Move the artifact generation logic into the e2e-tests.yaml workflow itself
- Pass the component images as inputs and regenerate the values file in e2e-tests.yaml
- Use outputs to pass the image names instead of artifacts
Implement automated forward compatibility tests that validate GPU Operator
against the latest published images from NVIDIA component repositories.
Changes:
Components tested:
We could add other operands later, but I wanted to start with the core ones.
Schedule: Every Monday at 2 AM UTC