Skip to content

Add cluster upgrade support by including MOFED dependency in RDMA scenarios#1942

Open
tginer wants to merge 2 commits intoNVIDIA:mainfrom
tginer:rdma-nodeselector-mofedwait
Open

Add cluster upgrade support by including MOFED dependency in RDMA scenarios#1942
tginer wants to merge 2 commits intoNVIDIA:mainfrom
tginer:rdma-nodeselector-mofedwait

Conversation

@tginer
Copy link

@tginer tginer commented Nov 26, 2025

This commit adds the proper GPU driver wait for the MOFED driver to be ready so RDMA APIs are available when driver is recompiled. This ensures the operator supports cluster upgrades, i.e., kernel version upgrades.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 26, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

if obj.Spec.Template.Spec.NodeSelector == nil {
obj.Spec.Template.Spec.NodeSelector = make(map[string]string)
}
obj.Spec.Template.Spec.NodeSelector["network.nvidia.com/operator.mofed.wait"] = "false"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes the Network Operator is deployed (and MOFED is installed via their driver container). Will this always be the case? For example, on Ubuntu the MOFED driver can be preinstalled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we skip applying this label if useHostMofed is set to true?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that should address my comment.

if obj.Spec.Template.Spec.NodeSelector == nil {
obj.Spec.Template.Spec.NodeSelector = make(map[string]string)
}
obj.Spec.Template.Spec.NodeSelector["network.nvidia.com/operator.mofed.wait"] = "false"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have any implications on what versions of Network Operator we support / work with? e.g. starting with what version of Network Operator do they add this label to nodes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the commit that introduced this label. So it's been around for a long time

tginer and others added 2 commits February 4, 2026 12:31
…DMA scenarios.

Signed-off-by: Teresa Giner <tginer@redhat.com>
Ensure Network Operator label is read only when it is installed; excluding those cases where MOFED is independently installed

Co-authored-by: Tariq <tariq181290@gmail.com>
@tginer tginer force-pushed the rdma-nodeselector-mofedwait branch 2 times, most recently from 914e8e6 to 98a9b52 Compare February 4, 2026 15:23
setContainerEnv(driverToolkitContainer, "DRIVER_CONFIG_DIGEST", configDigest)
}

// add nodeSelector for MOFED wait label when GPUDirect RDMA is enabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's expand this comment to describe why this change is needed / what problem it solves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants