Skip to content

[improvement] : Capture nvidiadriver CRs as well#2097

Merged
rahulait merged 1 commit intoNVIDIA:mainfrom
rahulait:update-must-gather
Feb 5, 2026
Merged

[improvement] : Capture nvidiadriver CRs as well#2097
rahulait merged 1 commit intoNVIDIA:mainfrom
rahulait:update-must-gather

Conversation

@rahulait
Copy link
Contributor

@rahulait rahulait commented Feb 4, 2026

Changes to must-gather script include:

  1. Additional step to capture nvidiadriver CRs as well
  2. Capture pod logs with timestamps. Without this, for ex for driver pods, its difficult to tell when the pod was restarted and when those log lines were written.
  3. Add labels to filter driver pods in case driver daemonsets are provisioned using nvidiadriver. This will now run nvidia-bug-report script for nodes where driver pod is provisioned by nvidiadriver CR
  4. Add logic to capture which processes are using GPU if any. This will help in identifying cases when we don't expect anyone to be using GPU

Description

Checklist

  • No secrets, sensitive information, or unrelated changes
  • Lint checks passing (make lint)
  • Generated assets in-sync (make validate-generated-assets)
  • Go mod artifacts in-sync (make validate-modules)
  • Test cases are added for new code paths

Testing

@rahulait
Copy link
Contributor Author

rahulait commented Feb 4, 2026

@cdesiniotis @rajathagasthya re-requesting review as I added one more commit which now adds labelselector for nvidiadriver pods so that we capture nvidia-bug-report output from nodes as well when nvidiadriver cr is used. I also added some additional logic to capture GPU usage on nodes for easier troubleshooting of scenarios where someone outside of gpu-operator took access to GPU and causes issues with driver install/upgrades.

It can be tested using:

curl -fsSL https://raw.githubusercontent.com/rahulait/gpu-operator/refs/heads/update-must-gather/hack/must-gather.sh | bash

@tariq1890
Copy link
Contributor

Can you amend the commit message to "[improvement] capture NVIDIADriver CRs in must-gather" ?

Changes include:
1. step to capture nvidiadriver CRs as well
2. capture pod logs with timestamps
3. use common labelselector which works for both clusterpolicy and nvidiadriver
4. add logic to capture which processes are using GPU if any. This will help in identifying cases when we don't expect anyone to be using GPU

Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>
@rahulait
Copy link
Contributor Author

rahulait commented Feb 4, 2026

Can you amend the commit message to "[improvement] capture NVIDIADriver CRs in must-gather" ?

Yup, updated.

Copy link
Member

@karthikvetrivel karthikvetrivel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look at this, Rahul. Looks great.

@rahulait rahulait merged commit 3555cc1 into NVIDIA:main Feb 5, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants