Skip to content

feat(gpu): disable NFD/GFD and remove nodeAffinity from device plugin chart#497

Merged
pimlock merged 1 commit intomainfrom
feat/simplify-device-plugin-deployment
Mar 20, 2026
Merged

feat(gpu): disable NFD/GFD and remove nodeAffinity from device plugin chart#497
pimlock merged 1 commit intomainfrom
feat/simplify-device-plugin-deployment

Conversation

@elezar
Copy link
Member

@elezar elezar commented Mar 20, 2026

Summary

  • Disables GPU Feature Discovery (GFD) and Node Feature Discovery (NFD) DaemonSets in the NVIDIA device plugin HelmChart
  • Overrides the device plugin's default nodeAffinity to null so the DaemonSet schedules unconditionally on the single-node gateway without requiring NFD/GFD labels (feature.node.kubernetes.io/pci-10de.present=true or nvidia.com/gpu.present=true)
  • Updates architecture docs and debug skill to reflect the change

Related Issue

N/A — pure declarative simplification with no new runtime code paths.

Changes

  • deploy/kube/gpu-manifests/nvidia-device-plugin-helmchart.yaml: disable gfd/nfd, set affinity: null override, update comment block
  • architecture/gateway-single-node.md: update GPU Enablement section to explain NFD/GFD are disabled and why
  • .agents/skills/debug-openshell-cluster/SKILL.md: add troubleshooting entry for lingering NFD/GFD DaemonSets on clusters deployed before this change

Testing

  • openshell gateway start --gpu — device plugin DaemonSet reaches Ready (1/1)
  • kubectl get daemonset -A | grep -E 'nfd|gfd|node-feature' — no output
  • kubectl get node -o jsonpath='{.items[0].status.allocatable}'nvidia.com/gpu key present
  • mise run test — no regressions

Checklist

  • Conventional commit message
  • mise run pre-commit passed
  • Architecture docs updated
  • Debug skill updated per cluster infra change instructions in AGENTS.md

drew
drew previously approved these changes Mar 20, 2026
… chart

Disables GPU Feature Discovery and Node Feature Discovery DaemonSets and
overrides the device plugin's default nodeAffinity to null so it schedules
unconditionally on the single-node gateway without requiring NFD/GFD labels.

Setting affinity to an empty map ({}) does not override the chart defaults
because Helm deep-merges user values with chart defaults. Using null explicitly
removes the key, causing the chart template to skip the affinity block entirely.
@elezar elezar force-pushed the feat/simplify-device-plugin-deployment branch from 677c6cb to cca911f Compare March 20, 2026 15:24
@elezar elezar marked this pull request as ready for review March 20, 2026 15:37
@elezar elezar requested a review from a team as a code owner March 20, 2026 15:37
@pimlock pimlock merged commit dac6cd9 into main Mar 20, 2026
9 checks passed
@pimlock pimlock deleted the feat/simplify-device-plugin-deployment branch March 20, 2026 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants