Add cluster upgrade support by including MOFED dependency in RDMA scenarios#1942
Add cluster upgrade support by including MOFED dependency in RDMA scenarios#1942tginer wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
| if obj.Spec.Template.Spec.NodeSelector == nil { | ||
| obj.Spec.Template.Spec.NodeSelector = make(map[string]string) | ||
| } | ||
| obj.Spec.Template.Spec.NodeSelector["network.nvidia.com/operator.mofed.wait"] = "false" |
There was a problem hiding this comment.
This assumes the Network Operator is deployed (and MOFED is installed via their driver container). Will this always be the case? For example, on Ubuntu the MOFED driver can be preinstalled.
There was a problem hiding this comment.
What if we skip applying this label if useHostMofed is set to true?
There was a problem hiding this comment.
Yes, that should address my comment.
| if obj.Spec.Template.Spec.NodeSelector == nil { | ||
| obj.Spec.Template.Spec.NodeSelector = make(map[string]string) | ||
| } | ||
| obj.Spec.Template.Spec.NodeSelector["network.nvidia.com/operator.mofed.wait"] = "false" |
There was a problem hiding this comment.
Does this have any implications on what versions of Network Operator we support / work with? e.g. starting with what version of Network Operator do they add this label to nodes?
There was a problem hiding this comment.
This was the commit that introduced this label. So it's been around for a long time
…DMA scenarios. Signed-off-by: Teresa Giner <tginer@redhat.com>
Ensure Network Operator label is read only when it is installed; excluding those cases where MOFED is independently installed Co-authored-by: Tariq <tariq181290@gmail.com>
914e8e6 to
98a9b52
Compare
| setContainerEnv(driverToolkitContainer, "DRIVER_CONFIG_DIGEST", configDigest) | ||
| } | ||
|
|
||
| // add nodeSelector for MOFED wait label when GPUDirect RDMA is enabled |
There was a problem hiding this comment.
Let's expand this comment to describe why this change is needed / what problem it solves.
This commit adds the proper GPU driver wait for the MOFED driver to be ready so RDMA APIs are available when driver is recompiled. This ensures the operator supports cluster upgrades, i.e., kernel version upgrades.