[DRAFT]node-feature-rules: Add 0x2321 as CC-capable device#1798
[DRAFT]node-feature-rules: Add 0x2321 as CC-capable device#1798manuelh-dev wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
| - feature: pci.device | ||
| matchExpressions: | ||
| vendor: {op: In, value: ["10de"]} | ||
| device: {op: In, value: ["2321"]} |
There was a problem hiding this comment.
Ideally we won't add devices piece meal. Can "whitelist" : 0x23**, 0x2b**, GBXXX -- Blackwell, GBXXX -- Hopper. Some may need to be excluded via "blacklist" then: exclude 2b00 TA1090SA [THOR].
There was a problem hiding this comment.
One thing to note is that matchExpressions don't allow wildcards (as far as I am aware). Is there another component that could / should create thes labels instead of a nodefeature rule directly?
There was a problem hiding this comment.
By the way, we could do something like the following:
- name: "NVIDIA Hopper GPU Family"
labels:
"nvidia.com/gpu.family": "hopper"
matchFeatures:
- feature: pci.device
matchExpressions:
vendor: {op: In, value: ["10de"]}
device: {op: InRegexp, value: ["^23[0-9a-f]{2}$"]}
- name: "NVIDIA Blackwell GPU Family"
labels:
"nvidia.com/gpu.family": "blackwell"
matchFeatures:
- feature: pci.device
matchExpressions:
vendor: {op: In, value: ["10de"]}
device: {op: InRegexp, value: ["^2b[0-9a-f]{2}$"]}
| @@ -34,7 +34,9 @@ spec: | |||
| fieldRef: | |||
| fieldPath: spec.nodeName | |||
| - name: CC_CAPABLE_DEVICE_IDS | |||
There was a problem hiding this comment.
I wonder where this CC_CAPABLE_DEVICE_IDS variable is being referenced.
There was a problem hiding this comment.
There was a problem hiding this comment.
Thank you! It looks like we may need a change in the k8s-cc-manager as well then. If we want to allow all 23 and 2b Hopper/Blackwell GPUs, we may rather not want to pass a list of specific GPUs.
| imagePullSecrets: [] | ||
| env: | ||
| - name: CC_CAPABLE_DEVICE_IDS | ||
| value: "0x2339,0x2331,0x2330,0x2324,0x2322,0x233d" |
There was a problem hiding this comment.
As mentioned offline: The envvars from the values file should probably be removed so that a user can properly override them. The defaults should be specified in the daemonset template instead.
There was a problem hiding this comment.
Thank you! Offline we had a discussion that the change were some of the dev defaults were removed in values.yaml was the following: #1580 - the ccManager envvars may have potentially been missed to remove.
|
Closing this pull request as we will address this in a more generic way via #1973 |
No description provided.