WIP: Updated NodeFeatureRules for newer chips#1973
WIP: Updated NodeFeatureRules for newer chips#1973zvonkok wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR consolidates GPU node labeling rules by replacing specific H100/H800 GPU model rules with broader family-based rules for Hopper and Blackwell architectures. The changes use PCI ID range matching via regex patterns instead of individual device IDs, and extend Confidential Computing (CC) capability support to include both Hopper and Blackwell GPU families.
Key changes:
- Consolidated 5 specific Hopper-based rules (H100, H100 PCIe, H100 80GB HBM3, H800, H800 PCIE) into a single "NVIDIA Hopper GPU" rule using regex pattern for PCI ID range 0x2300-0x23ff
- Added new "NVIDIA Blackwell GPU" rule covering PCI ID range 0x2b00-0x33ff
- Updated CC capability rules to recognize both "hopper" and "blackwell" GPU families with TDX/SEV-SNP support
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| matchExpressions: | ||
| vendor: {op: In, value: ["10de"]} | ||
| device: {op: In, value: ["2322"]} | ||
| # GB202-GB207, GB110-GB120 (0x2b00-0x33ff) from https://admin.pci-ids.ucw.cz/read/PC/10de |
There was a problem hiding this comment.
The comment mentions specific chip models "GB202-GB207, GB110-GB120" but the regex pattern ^(2[b-f]|3[0-3])[0-9a-f]{2}$ matches a much broader range (0x2b00-0x33ff). This includes values that don't correspond to the listed chip models. Consider either:
- Making the comment more generic (e.g., "Blackwell family (0x2b00-0x33ff)")
- Narrowing the regex pattern to match only the specific chip ranges if that's the intent
The current discrepancy between the comment and the pattern could be confusing for future maintainers.
| # GB202-GB207, GB110-GB120 (0x2b00-0x33ff) from https://admin.pci-ids.ucw.cz/read/PC/10de | |
| # Blackwell family (0x2b00-0x33ff) from https://admin.pci-ids.ucw.cz/read/PC/10de |
There was a problem hiding this comment.
We recently tweaked the NodeFeatureRules to work with a B200 (should be GB100, iirc) cluster we received, and the device ID is 0x2901, which wouldn't be matched by the config in this PR. any chance this can be changed?
|
We will likely want to adjust the nvidia-cc-manager too: https://github.com/NVIDIA/k8s-cc-manager/blob/45f968d36e9c4bc39e497bd87aa6f5265648d0dd/cmd/main.go#L113, see the |
|
Is there an alternative solution that does not involve maintaining an allowlist of PCI device ids? Is there way to detect whether a GPU is CC capable (when running on the node)? |
|
We need to rework the whole CC detection ... on my list. |
Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
7eb6da9 to
ef8d6da
Compare
All Hopper and Hopper+ architectures support CC.