E2E: Refactor OVS affinity tests into shared helpers and add new workload-partitioning and control-plane reboot test cases.#1509
Conversation
Extract reusable helper functions (getOvnContainerAffinity, getOvsAffinities, verifyOvsAffinity, verifyOvsMatchesExpected, createGuPod, waitForDeploymentReady, expectedOvsAffinity, getGuPodCPUs, collectGuCPUsFromPodList) to eliminate repeated patterns across OVS dynamic CPU affinity test cases. Add cluster.IsWorkloadPartitioningEnabled() to the shared cluster utility package, querying Infrastructure.Status.CPUPartitioning. Refactor test_id:64100, 64101, 64102, 64103, 75257 to use the new helpers, add WP-aware assertion branching, fix indentation inconsistencies, and reduce overall test noise. AIA Human-AI blend, Content edits, Human-initiated, claude-4.6-opus-high-thinking v1.0 Signed-off-by: Niranjan M.R <mniranja@redhat.com>
Add new test cases for OVS dynamic CPU affinity under workload partitioning and control plane reboot scenarios: - Skip test suite early when workerRTNode is a non-schedulable control plane node - Add control plane reboot test: verifies OVS affinity is not restricted to reserved CPUs before and after rebooting a control plane node with dynamic pinning enabled - Add workload partitioning test context with four tests: - OVN pod restricted to reserved CPUs - OVS affinity wider than OVN pod - Reserved CPUs always included in OVS affinity (with/without GU pods) - OVS falls to reserved-only when all isolated CPUs are exhausted AIA Human-AI blend, Content edits, Human-initiated, claude-4.6-opus-high-thinking v1.0 Signed-off-by: Niranjan M.R <mniranja@redhat.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughThe test suite now detects whether Workload Partitioning (WP) is enabled and adjusts CPU affinity expectations accordingly. A cluster helper to check WP was added, and the cgroups test was refactored to use shared helpers to collect and verify OVN/OVS affinities across dynamic OVS pinning, guaranteed-pod scenarios, and node/reboot lifecycle checks (including control-plane reboot coverage). Changes
Sequence Diagram(s)sequenceDiagram
participant Test as Test Harness
participant API as Kubernetes API
participant Node as Node (kubelet)
participant OVN as OVN container
participant OVS as OVS processes
Test->>API: Query Infrastructure (CPUPartitioning)
API-->>Test: WP enabled? (true/false)
alt WP disabled
Test->>Node: Read onlineCPUSet
Test->>OVN: Read OVN container cpuset -> expect == onlineCPUSet
Test->>OVS: Read OVS pids & affinities -> expect == OVN cpuset
else WP enabled
Test->>Node: Read reservedCPUSet, onlineCPUSet
Test->>OVN: Read OVN container cpuset -> expect == reservedCPUSet
Test->>Test: Create GU pods -> get guPodCPUs (union)
Test->>OVS: Read OVS pids & affinities -> expect == onlineCPUSet \ guPodCPUs (also include reservedCPUs)
end
Test->>Node: Delete GU pods / Reboot node
Node-->>Test: Node back online
Test->>OVS: Re-read OVS affinities -> verify exclusions persist
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 9 | ❌ 3❌ Failed checks (3 warnings)
✅ Passed checks (9 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.12.1)level=error msg="Running error: context loading failed: failed to load packages: failed to load packages: failed to load with go/packages: err: exit status 1: stderr: go: inconsistent vendoring in :\n\tgithub.com/RHsyseng/operator-utils@v1.4.13: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/coreos/go-systemd@v0.0.0-20191104093116-d3cd4ed1dbcf: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/coreos/ignition@v0.35.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/coreos/ignition/v2@v2.26.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/docker/go-units@v0.5.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/go-logr/stdr@v1.2.2: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/google/go-cmp@v0.7.0 ... [truncated 19340 characters] ... is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/legacy-cloud-providers: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/metrics: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/mount-utils: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/pod-security-admission: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/sample-apiserver: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tgithub.com/onsi/ginkgo/v2: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\n\tTo ignore the vendor directory, use -mod=readonly or -mod=mod.\n\tTo sync the vendor directory, run:\n\t\tgo mod vendor\n" Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mrniranjan The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (3)
test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go (3)
562-567: 💤 Low value
BeforeEachre-checks a value that's already cluster-stable.
isWorkloadPartitioningEnabledis computed once inBeforeAll. UsingBeforeEachhere re-evaluates the skip on everyIt, which is harmless but wasteful. Switching toBeforeAllmakes the intent — "skip the whole context if WP is disabled" — clearer and avoids per-spec churn in test logs.♻️ Proposed change
- Context("Workload Partitioning OVS affinity", Label(string(label.Tier2)), func() { - BeforeEach(func() { + Context("Workload Partitioning OVS affinity", Label(string(label.Tier2)), Ordered, func() { + BeforeAll(func() { if !isWorkloadPartitioningEnabled { Skip("Workload partitioning is not enabled on this cluster") } })🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go` around lines 562 - 567, The context "Workload Partitioning OVS affinity" currently uses BeforeEach to skip when the cluster-wide flag isWorkloadPartitioningEnabled is false; change this to BeforeAll so the check runs once for the whole context instead of before every spec—replace the BeforeEach hook with a BeforeAll hook that calls Skip("Workload partitioning is not enabled on this cluster") when isWorkloadPartitioningEnabled is false to avoid per-test churn in logs.
937-959: 💤 Low valueExtract the 30-second stabilisation delay into a named constant.
The same
time.Sleep(30 * time.Second)is used ingetOvnContainerAffinity(Line 944),getOvsAffinities(Line 955), and inline in the WP test (Line 597). A named constant near the top of the file documents intent (waiting for OVS dynamic pinning to converge) and avoids drift between call sites.♻️ Proposed change
const minRequiredCPUs = 8 + +// affinityStabilizationDelay is the time to wait for OVS dynamic pinning +// to converge after pod creation/deletion before sampling affinities. +const affinityStabilizationDelay = 30 * time.Second…then replace the magic literal at each site with
affinityStabilizationDelay.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go` around lines 937 - 959, Introduce a named constant affinityStabilizationDelay (value 30 * time.Second) near the top of the file and replace the hard-coded time.Sleep(30 * time.Second) calls in getOvnContainerAffinity and getOvsAffinities (and the inline wait in the WP test mentioned in the comment) with time.Sleep(affinityStabilizationDelay) so the intent ("waiting for OVS dynamic pinning to converge") is documented and the delay is centralized for consistency.
588-616: ⚡ Quick winDuplicates the
getOvnContainerAffinity/getOvsAffinitieshelpers.This block re-implements logic that was just refactored into helpers (Lines 937-959), including a bare
time.Sleep(30 * time.Second). Reusing the helpers makes intent clearer and keeps the stabilisation delay in one place.♻️ Proposed simplification
It("Verify OVS affinity is wider than OVN pod under workload partitioning", func() { By("Get OVN container affinity") - ovnPod, err := ovnCnfNodePod(ctx, workerRTNode) - Expect(err).ToNot(HaveOccurred(), "Unable to get ovnPod") - ovnContainerids, err := ovnPodContainers(&ovnPod) - Expect(err).ToNot(HaveOccurred()) - containerPid, err := nodes.ContainerPid(ctx, workerRTNode, ovnContainerids[0]) - Expect(err).ToNot(HaveOccurred()) - - time.Sleep(30 * time.Second) - ctnCpuset := taskSet(ctx, containerPid, workerRTNode) - testlog.Infof("OVN container affinity: %s", ctnCpuset) + ctnCpuset := getOvnContainerAffinity(ctx, workerRTNode) Expect(ctnCpuset).To(Equal(reservedCPUSet), "OVN container should be restricted to reserved CPUs under workload partitioning") By("Get OVS process affinity and verify it is wider") - pidList, err := ovsPids(ctx, ovsSystemdServices, workerRTNode) - Expect(err).ToNot(HaveOccurred()) - pidToCPUs, err := getCPUMaskForPids(ctx, pidList, workerRTNode) - Expect(err).ToNot(HaveOccurred()) + pidToCPUs := getOvsAffinities(ctx, ovsSystemdServices, workerRTNode) for pid, ovsAffinity := range pidToCPUs {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go` around lines 588 - 616, This test duplicates existing helpers: replace the manual OVN/OVS affinity logic and the explicit time.Sleep with calls to the refactored helpers getOvnContainerAffinity and getOvsAffinities; call getOvnContainerAffinity(ctx, workerRTNode, reservedCPUSet) to get the OVN container cpuset and assert it equals reservedCPUSet, then call getOvsAffinities(ctx, ovsSystemdServices, workerRTNode) to retrieve pid->cpus and loop the results asserting reservedCPUSet.IsSubsetOf(ovsAffinity) and that ovsAffinity.Equals(reservedCPUSet) is false when appropriate—remove the duplicated containerPid/ovnPod/ovnContainerids code and the 30s sleep so the stabilisation logic stays centralized in the helpers.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`:
- Around line 654-723: The test currently only checks
reservedCPUSet.IsSubsetOf(mask) which allows mask to still equal onlineCPUSet;
update the assertion in the OVS verification loop (using ovsAffinities from
getOvsAffinities) to require exact equality with reservedCPUSet instead of
subset—e.g. replace the IsSubsetOf Expect with an equality check
(mask.Equals(reservedCPUSet) or equivalent) and update the failure message to
state that OVS affinity must equal reserved CPUs after isolated CPUs are
exhausted.
- Around line 402-477: The test uses reservedCPUSet (parsed in BeforeAll for the
workerRTNode) but then may pick a different control-plane node (cpNode) whose
PerformanceProfile and reserved CPUs differ; update the test to obtain the
cpNode's own reserved CPU set before doing affinity checks or skip if none:
after selecting cpNode (the variable determined from workerRTNode or the
control-plane node loop), lookup the PerformanceProfile for cpNode via its
labels (same approach used in BeforeAll), parse Spec.CPU.Reserved into a
cpReservedCPUSet, and replace uses of reservedCPUSet in the mask.Equals(...)
checks and in the expectedOvsAffinity(cpOnlineCPUs, reservedCPUSet, ...) call
with cpReservedCPUSet; if no profile/reserved CPUs can be resolved for cpNode,
call Skip to avoid unsound assertions. Ensure changes reference reservedCPUSet,
workerRTNode, cpNode, expectedOvsAffinity, getOvsAffinities, and BeforeAll so
reviewers can locate the logic.
---
Nitpick comments:
In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`:
- Around line 562-567: The context "Workload Partitioning OVS affinity"
currently uses BeforeEach to skip when the cluster-wide flag
isWorkloadPartitioningEnabled is false; change this to BeforeAll so the check
runs once for the whole context instead of before every spec—replace the
BeforeEach hook with a BeforeAll hook that calls Skip("Workload partitioning is
not enabled on this cluster") when isWorkloadPartitioningEnabled is false to
avoid per-test churn in logs.
- Around line 937-959: Introduce a named constant affinityStabilizationDelay
(value 30 * time.Second) near the top of the file and replace the hard-coded
time.Sleep(30 * time.Second) calls in getOvnContainerAffinity and
getOvsAffinities (and the inline wait in the WP test mentioned in the comment)
with time.Sleep(affinityStabilizationDelay) so the intent ("waiting for OVS
dynamic pinning to converge") is documented and the delay is centralized for
consistency.
- Around line 588-616: This test duplicates existing helpers: replace the manual
OVN/OVS affinity logic and the explicit time.Sleep with calls to the refactored
helpers getOvnContainerAffinity and getOvsAffinities; call
getOvnContainerAffinity(ctx, workerRTNode, reservedCPUSet) to get the OVN
container cpuset and assert it equals reservedCPUSet, then call
getOvsAffinities(ctx, ovsSystemdServices, workerRTNode) to retrieve pid->cpus
and loop the results asserting reservedCPUSet.IsSubsetOf(ovsAffinity) and that
ovsAffinity.Equals(reservedCPUSet) is false when appropriate—remove the
duplicated containerPid/ovnPod/ovnContainerids code and the 30s sleep so the
stabilisation logic stays centralized in the helpers.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c301f9bd-d453-41ea-a409-7d57d2b1fbfb
📒 Files selected for processing (2)
test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.gotest/e2e/performanceprofile/functests/utils/cluster/cluster.go
Tal-or
left a comment
There was a problem hiding this comment.
Good work, couple of comments inside.
On the u/s the cluster are not configured with WP, are we going to test cluster with WP configured d/s?
| Expect(err).ToNot(HaveOccurred(), "Unable to check if workload partitioning is enabled") | ||
| testlog.Infof("Workload partitioning enabled: %v", isWorkloadPartitioningEnabled) | ||
|
|
||
| Expect(profile.Spec.CPU.Reserved).NotTo(BeNil()) |
| Expect(err).ToNot(HaveOccurred()) | ||
| Expect(containerCpuset).To(Equal(onlineCPUSet), "Burstable pod containers cpuset.cpus do not match total online cpus") | ||
| if isWorkloadPartitioningEnabled { | ||
| Expect(containerCpuset).To(Equal(reservedCPUSet), |
There was a problem hiding this comment.
Use the built-in containerCpuset.Equals() function
| Expect(containerCpuset).To(Equal(reservedCPUSet), | ||
| "Under workload partitioning, OVN pod cpuset.cpus should match reserved cpus, got %s expected %s", containerCpuset, reservedCPUSet) | ||
| } else { | ||
| Expect(containerCpuset).To(Equal(onlineCPUSet), |
| // wait till the ovs process affinity is reverted back | ||
| pidList, err := ovsPids(ctx, ovsSystemdServices, workerRTNode) | ||
| Expect(err).ToNot(HaveOccurred()) | ||
| testlog.Infof("Deleting Deployment %s", dp.Name) |
There was a problem hiding this comment.
maybe add NS as well for better logging experience
| By("Verifying affinities before reboot") | ||
| collectAndVerify("before-reboot") | ||
|
|
||
| By("Rebooting the worker node") |
There was a problem hiding this comment.
you may consolidate with the line below.
| isolated := onlineCPUs.Difference(reservedCPUs) | ||
| return isolated.Difference(guPodCPUs).Union(reservedCPUs) |
There was a problem hiding this comment.
or simply:
return onlineCPUs.Difference(guPodCPUs)
|
|
||
| // getOvnContainerAffinity returns the CPU affinity of the first OVN container | ||
| // on the given node, waiting 30s for affinity to stabilise. | ||
| func getOvnContainerAffinity(ctx context.Context, node *corev1.Node) cpuset.CPUSet { |
There was a problem hiding this comment.
add GinkgoHelper() in all test helpers that are throwing assertions outside the It() containers
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go (1)
50-73:⚠️ Potential issue | 🟠 Major | ⚡ Quick winOuter
workerRTNodesis shadowed inBeforeAlland is never populated.Line 68 uses
:=inside theBeforeAllclosure, which creates a new localworkerRTNodes(anderr) that shadow the outer-scope declarations from line 53/61. The local variable is correctly populated and used on line 73 to setworkerRTNode, but the outer-scopeworkerRTNodesslice remainsnilfor the lifetime of the suite.Consequently, the
for _, node := range workerRTNodesloops at lines 117–122 and 192–196 never execute their bodies — both test assertions are vacuous.🐛 Proposed fix — use `=` for the already-declared outer variables
- workerRTNodes, err := nodes.GetByLabels(testutils.NodeSelectorLabels) + var localErr error + workerRTNodes, localErr = nodes.GetByLabels(testutils.NodeSelectorLabels) + Expect(localErr).ToNot(HaveOccurred()) - workerRTNodes, err = nodes.MatchingOptionalSelector(workerRTNodes) - Expect(err).ToNot(HaveOccurred(), fmt.Sprintf("error looking for the optional selector: %v", err)) + workerRTNodes, localErr = nodes.MatchingOptionalSelector(workerRTNodes) + Expect(localErr).ToNot(HaveOccurred(), fmt.Sprintf("error looking for the optional selector: %v", localErr))Or, simpler — because
erris already declared in the outer scope, drop the:and use plain assignment:- workerRTNodes, err := nodes.GetByLabels(testutils.NodeSelectorLabels) + workerRTNodes, err = nodes.GetByLabels(testutils.NodeSelectorLabels) Expect(err).ToNot(HaveOccurred())🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go` around lines 50 - 73, The BeforeAll closure is accidentally shadowing the outer workerRTNodes and err by using :=; change the assignments in the BeforeAll block to use plain = so the already-declared outer-scope variables (workerRTNodes, err) are populated (specifically update the nodes.GetByLabels(...) and nodes.MatchingOptionalSelector(...) lines to use = instead of :=), ensuring the outer workerRTNodes slice is available for the later for _, node := range workerRTNodes loops and workerRTNode assignment.
🧹 Nitpick comments (1)
test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go (1)
942-966: ⚡ Quick winHardcoded
time.Sleep(30s)ingetOvnContainerAffinityandgetOvsAffinitiesis both fragile and slow.Both helpers unconditionally sleep 30 seconds "for affinity to stabilise." This sleep fires on every call — including the multiple calls inside
collectAndVerifyand the WP tests — adding 60–120+ seconds of artificial delay per test run. If stabilization takes longer than 30 s the tests remain flaky; if it's faster the sleep is wasted.Replace with an
Eventuallypoll on the affinity value itself:♻️ Proposed refactor for `getOvsAffinities`
func getOvsAffinities(ctx context.Context, services []string, node *corev1.Node) map[string]cpuset.CPUSet { GinkgoHelper() pidList, err := ovsPids(ctx, services, node) Expect(err).ToNot(HaveOccurred()) - time.Sleep(30 * time.Second) - affinities, err := getCPUMaskForPids(ctx, pidList, node) - Expect(err).ToNot(HaveOccurred()) - return affinities + var affinities map[string]cpuset.CPUSet + Eventually(func() error { + var e error + affinities, e = getCPUMaskForPids(ctx, pidList, node) + return e + }, 2*time.Minute, 5*time.Second).Should(Succeed()) + return affinities }Apply the same pattern to
getOvnContainerAffinityand remove thetime.Sleep(30 * time.Second)at line 591.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go` around lines 942 - 966, The fixed code should remove the hardcoded time.Sleep(30 * time.Second) calls in getOvnContainerAffinity and getOvsAffinities and instead poll until the CPU affinity stabilises using Gomega's Eventually: for getOvsAffinities, repeatedly call ovsPids(ctx, services, node) and getCPUMaskForPids(ctx, pidList, node) (or a helper that returns the computed map) inside an Eventually block with a reasonable timeout/interval until the returned affinities stop changing; for getOvnContainerAffinity, poll the sequence ovnCnfNodePod(ctx, node) -> ovnPodContainers(&ovnPod) -> nodes.ContainerPid(ctx, node, containerId) -> taskSet(ctx, pid, node) inside Eventually until the affinity equals a previous value (stable) instead of sleeping; remove the Sleep lines and ensure Expect checks remain outside or inside Eventually as appropriate so tests fail fast on errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`:
- Around line 1033-1040: The waitForDeploymentReady function currently only
checks container readiness for podList.Items[0], which misses other replicas;
update the readiness check to iterate over all pods in podList.Items and for
each pod iterate over pod.Status.ContainerStatuses ensuring every container is
Ready before returning true (i.e., only return true when all pods' containers
are ready); this will ensure subsequent calls like getGuPodCPUs observe all
replicas ready.
- Around line 932-938: The function expectedOvsAffinity has an unused parameter
reservedCPUs; remove reservedCPUs from the signature (change
expectedOvsAffinity(onlineCPUs, guPodCPUs cpuset.CPUSet) cpuset.CPUSet) and
update every call site that passes three args to pass only onlineCPUs and
guPodCPUs instead, ensuring the function body remains return
onlineCPUs.Difference(guPodCPUs); also update any references/imports or test
helpers that assert the old signature to match the new two-parameter form.
- Around line 1046-1058: The function collectGuCPUsFromPodList has an unused
parameter namespace (the namespace is provided via listOpts.Namespace); remove
the namespace parameter from its signature and from all call sites (e.g., where
collectGuCPUsFromPodList is invoked around the mentioned call site) so callers
pass only (ctx, listOpts), and update any references to match the new signature;
ensure build and tests pass after the change.
---
Outside diff comments:
In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`:
- Around line 50-73: The BeforeAll closure is accidentally shadowing the outer
workerRTNodes and err by using :=; change the assignments in the BeforeAll block
to use plain = so the already-declared outer-scope variables (workerRTNodes,
err) are populated (specifically update the nodes.GetByLabels(...) and
nodes.MatchingOptionalSelector(...) lines to use = instead of :=), ensuring the
outer workerRTNodes slice is available for the later for _, node := range
workerRTNodes loops and workerRTNode assignment.
---
Nitpick comments:
In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`:
- Around line 942-966: The fixed code should remove the hardcoded time.Sleep(30
* time.Second) calls in getOvnContainerAffinity and getOvsAffinities and instead
poll until the CPU affinity stabilises using Gomega's Eventually: for
getOvsAffinities, repeatedly call ovsPids(ctx, services, node) and
getCPUMaskForPids(ctx, pidList, node) (or a helper that returns the computed
map) inside an Eventually block with a reasonable timeout/interval until the
returned affinities stop changing; for getOvnContainerAffinity, poll the
sequence ovnCnfNodePod(ctx, node) -> ovnPodContainers(&ovnPod) ->
nodes.ContainerPid(ctx, node, containerId) -> taskSet(ctx, pid, node) inside
Eventually until the affinity equals a previous value (stable) instead of
sleeping; remove the Sleep lines and ensure Expect checks remain outside or
inside Eventually as appropriate so tests fail fast on errors.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b8924fa3-bc0f-46aa-81b9-12917df83050
📒 Files selected for processing (2)
test/e2e/performanceprofile/functests/2_performance_update/updating_profile.gotest/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go
| if len(podList.Items) == 0 { | ||
| return false | ||
| } | ||
| for _, s := range podList.Items[0].Status.ContainerStatuses { | ||
| if !s.Ready { | ||
| return false | ||
| } | ||
| } |
There was a problem hiding this comment.
waitForDeploymentReady only verifies the first pod's container readiness.
The loop at line 1036 iterates podList.Items[0].Status.ContainerStatuses, so for replicas=2 the second pod's containers are never checked. The function can return while only one of two pods is truly ready, causing subsequent getGuPodCPUs calls on the second pod to read an incorrect or unset affinity.
🐛 Proposed fix — check all pods in the list
- for _, s := range podList.Items[0].Status.ContainerStatuses {
- if !s.Ready {
- return false
- }
- }
+ for i := range podList.Items {
+ for _, s := range podList.Items[i].Status.ContainerStatuses {
+ if !s.Ready {
+ return false
+ }
+ }
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if len(podList.Items) == 0 { | |
| return false | |
| } | |
| for _, s := range podList.Items[0].Status.ContainerStatuses { | |
| if !s.Ready { | |
| return false | |
| } | |
| } | |
| if len(podList.Items) == 0 { | |
| return false | |
| } | |
| for i := range podList.Items { | |
| for _, s := range podList.Items[i].Status.ContainerStatuses { | |
| if !s.Ready { | |
| return false | |
| } | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`
around lines 1033 - 1040, The waitForDeploymentReady function currently only
checks container readiness for podList.Items[0], which misses other replicas;
update the readiness check to iterate over all pods in podList.Items and for
each pod iterate over pod.Status.ContainerStatuses ensuring every container is
Ready before returning true (i.e., only return true when all pods' containers
are ready); this will ensure subsequent calls like getGuPodCPUs observe all
replicas ready.
|
/retest |
- Add GinkgoHelper() to all test helper functions that perform assertions, so failures report the caller location - Use cpuset.Equals() for cpuset comparisons instead of struct equality - Refactor CP reboot test to assert against online CPUs directly, avoiding dependency on worker profile's reservedCPUSet which may not apply to the CP node on MNO clusters - Fix exhaustion test: rename to reflect it verifies GU pod exclusion (not full exhaustion), compute exact expected affinity using expectedOvsAffinity, and assert equality instead of weak subset check - Simplify expectedOvsAffinity to onlineCPUs.Difference(guPodCPUs) - Fix indentation to match surrounding test blocks
df26d1b to
cd213c0
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go (1)
68-68:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
workerRTNodesouter variable is never assigned due to:=shadowing in theBeforeAllclosure.Line 68 uses short variable declaration (
:=), creating a new localworkerRTNodesvariable scoped to theBeforeAllclosure. The outerworkerRTNodes(declared at line 53) remainsnil, causing the test loops at lines 117 and 192 to iterate zero times, making those assertions vacuous.Change line 68 to use
=instead:🐛 Proposed fix
- workerRTNodes, err := nodes.GetByLabels(testutils.NodeSelectorLabels) + workerRTNodes, err = nodes.GetByLabels(testutils.NodeSelectorLabels)The variable
erris already declared in the outervarblock (line 22), so no additional declaration is needed.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go` at line 68, The outer variable workerRTNodes is left nil because the BeforeAll closure uses short-declaration (:=) which shadows it; change the declaration in the BeforeAll where nodes.GetByLabels(testutils.NodeSelectorLabels) is called to an assignment (use = not :=) so the outer workerRTNodes is set, and reuse the already-declared err variable instead of redeclaring it; locate the nodes.GetByLabels call inside the BeforeAll closure and replace := with = so subsequent loops over workerRTNodes in the test (e.g., later assertions) operate on the populated slice.
♻️ Duplicate comments (3)
test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go (3)
1046-1058: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
namespaceparameter incollectGuCPUsFromPodListis still unused — not addressed.The
namespaceargument is never referenced in the function body; the namespace is already carried bylistOpts.Namespace.♻️ Proposed fix
-func collectGuCPUsFromPodList(ctx context.Context, namespace string, listOpts *client.ListOptions) cpuset.CPUSet { +func collectGuCPUsFromPodList(ctx context.Context, listOpts *client.ListOptions) cpuset.CPUSet {Update the call site at line 378:
-guCPUs := collectGuCPUsFromPodList(ctx, dp.Namespace, dpListOpts) +guCPUs := collectGuCPUsFromPodList(ctx, dpListOpts)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go` around lines 1046 - 1058, The function collectGuCPUsFromPodList currently ignores its namespace parameter; fix it by ensuring the list options actually use that namespace before calling testclient.DataPlaneClient.List: if listOpts.Namespace is empty or different, create or copy listOpts and set its Namespace to the namespace argument (so List will query the intended namespace), then proceed to list pods and union their CPUs as before; update any callers only if they depend on the previous behavior.
1033-1040:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
waitForDeploymentReadystill only checks the first pod's container statuses — not addressed.The loop at line 1036 iterates
podList.Items[0].Status.ContainerStatusesregardless ofreplicas. For a 2-replica deployment, the second pod's containers are never checked, so the function can return while only one pod is truly Ready.🐛 Proposed fix
- for _, s := range podList.Items[0].Status.ContainerStatuses { - if !s.Ready { - return false - } - } + for i := range podList.Items { + for _, s := range podList.Items[i].Status.ContainerStatuses { + if !s.Ready { + return false + } + } + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go` around lines 1033 - 1040, In waitForDeploymentReady, the readiness check only inspects podList.Items[0].Status.ContainerStatuses; change the logic to iterate over every pod in podList.Items and for each pod iterate over pod.Status.ContainerStatuses to ensure all containers in all pods are Ready (still return false if podList.Items is empty or any container is not Ready), keeping the existing early-return behavior and using the existing podList.Items and ContainerStatuses references so all replicas are validated.
936-938: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
reservedCPUsparameter inexpectedOvsAffinityis still unused — not addressed.The
reservedCPUsargument is accepted but never referenced in the function body, making every call site (lines 714, 987) silently pass a value that has no effect.♻️ Proposed fix
-func expectedOvsAffinity(onlineCPUs, reservedCPUs, guPodCPUs cpuset.CPUSet) cpuset.CPUSet { - return onlineCPUs.Difference(guPodCPUs) +func expectedOvsAffinity(onlineCPUs, guPodCPUs cpuset.CPUSet) cpuset.CPUSet { + return onlineCPUs.Difference(guPodCPUs) }And update call sites at lines 714 and 987:
-expected := expectedOvsAffinity(onlineCPUSet, reservedCPUSet, guPodCPUs) +expected := expectedOvsAffinity(onlineCPUSet, guPodCPUs)-expected := expectedOvsAffinity(onlineCPUs, reservedCPUs, guCPUs) +expected := expectedOvsAffinity(onlineCPUs, guCPUs)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go` around lines 936 - 938, The function expectedOvsAffinity currently ignores its reservedCPUs parameter; update its implementation to exclude reservedCPUs from the resulting CPU set (e.g., return onlineCPUs.Difference(guPodCPUs).Difference(reservedCPUs) or return onlineCPUs.Difference(guPodCPUs.Union(reservedCPUs))) so the reserved CPUs are not assigned to OVS, and leave the function signature unchanged so existing callers of expectedOvsAffinity continue to work without changes.
🧹 Nitpick comments (3)
test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go (3)
981-991: ⚡ Quick winNon-WP path in
verifyOvsMatchesExpectedignoresguCPUs— misleading test semantics.When
isWP=false, the function assertsOVS affinity == ovnAffinity. Under non-WP,ovnAffinity == onlineCPUSet, so the assertion reduces toOVS affinity == onlineCPUSetregardless of whether GU pods are present. This meanstest_id:64101("Creating gu pods modifies affinity of ovs") andtest_id:64102do not actually verify any change in OVS affinity on non-WP clusters — the test passes vacuously in that mode.If dynamic OVS pinning is not expected to change OVS affinity under non-WP, the test descriptions and the
guCPUsparameter toverifyOvsMatchesExpectedare misleading and the unused parameter should be removed from the non-WP branch. If OVS affinity is expected to change under non-WP, the assertion formula should account forguCPUsthere too.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go` around lines 981 - 991, The non-WP branch of verifyOvsMatchesExpected currently ignores guCPUs and just compares OVS to ovnAffinity, which makes tests vacuous; change the else branch to compute expected := expectedOvsAffinity(onlineCPUs, reservedCPUs, guCPUs) and call verifyOvsAffinity(ovsAffinities, expected) (instead of verifyOvsAffinity(ovsAffinities, ovnAffinity)) so non-WP behavior accounts for guCPUs; alternatively, if OVS should not change under non-WP, remove/stop passing the guCPUs parameter and update the test description accordingly — but the quick fix is to use expectedOvsAffinity(...) in the else branch of verifyOvsMatchesExpected.
591-591: ⚡ Quick winAnother hardcoded
time.Sleep(30 * time.Second)directly in a test body.Line 591 sleeps unconditionally inside the
Itblock before readingctnCpuset. The same concerns apply as for the helper-level sleeps: unreliable on slow clusters, always wasteful on fast ones. This particular call also duplicates the 30-second wait thatgetOvnContainerAffinitywould already impose if that helper were used here instead. Consider either usingEventuallyto polltaskSetuntil the affinity stabilises, or unifying the wait with the helper.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go` at line 591, Replace the hardcoded time.Sleep(30 * time.Second) inside the It block with a deterministic wait: either call the existing helper getOvnContainerAffinity (which already contains the required wait) instead of the manual sleep, or use Gomega.Eventually to poll the taskSet/ctnCpuset until the container CPU set/affinity stabilizes (e.g. assert ctnCpuset equals the expected value within a timeout). Locate the sleep in the test body where ctnCpuset is read, and modify the code to poll taskSet/ctnCpuset (or delegate to getOvnContainerAffinity) so the test is robust on slow clusters and not wasteful on fast ones.
950-962: ⚡ Quick winUnconditional 30-second sleeps in shared helpers will compound across every test call.
Both
getOvnContainerAffinity(line 950) andgetOvsAffinities(line 962) eachtime.Sleep(30 * time.Second). Tests that call both helpers spend 60 s sleeping per invocation pair;test_id:64102calls them twice (before and after pod deletion) — 120 s of fixed sleep per run regardless of cluster speed. This pattern is also unreliable: fast clusters waste 30 s, slow clusters may still read stale state.Replace with
Eventually-based polling on the affinity value, or at minimum centralise the single wait in the call site rather than burying it in each helper.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go` around lines 950 - 962, Both getOvnContainerAffinity and getOvsAffinities contain unconditional time.Sleep(30 * time.Second) calls causing compounded fixed delays; remove those sleeps and replace with polling (Gomega Eventually or a shared WaitForAffinityStable helper) that repeatedly calls the underlying probes (taskSet for getOvnContainerAffinity, ovsPids/affinity reader for getOvsAffinities) until the affinity stabilises or a timeout is reached, or alternatively centralise a single wait in the test caller instead of sleeping inside these helpers so callers control timeout and frequency.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`:
- Around line 443-449: The loops over the maps returned by getOvsAffinities
(e.g., ovsBeforeReboot and ovsAfterReboot) can vacuously succeed when the map is
empty; add an explicit non-empty guard before each loop by asserting the map
length is > 0 (e.g., Expect(len(ovsBeforeReboot)).To(BeNumerically(">", 0), "no
OVS affinities found for %v on node %v; check %v", cpOvsServices, cpNode,
cpOvsServices)) so the test fails fast when no OVS PIDs were discovered; apply
the same check for all other getOvsAffinities consumers referenced (the blocks
using ovsBeforeReboot, ovsAfterReboot and the other call sites around lines
noted) so the subsequent per-PID Expect(mask.Equals(...)) assertions actually
run.
---
Outside diff comments:
In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`:
- Line 68: The outer variable workerRTNodes is left nil because the BeforeAll
closure uses short-declaration (:=) which shadows it; change the declaration in
the BeforeAll where nodes.GetByLabels(testutils.NodeSelectorLabels) is called to
an assignment (use = not :=) so the outer workerRTNodes is set, and reuse the
already-declared err variable instead of redeclaring it; locate the
nodes.GetByLabels call inside the BeforeAll closure and replace := with = so
subsequent loops over workerRTNodes in the test (e.g., later assertions) operate
on the populated slice.
---
Duplicate comments:
In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`:
- Around line 1046-1058: The function collectGuCPUsFromPodList currently ignores
its namespace parameter; fix it by ensuring the list options actually use that
namespace before calling testclient.DataPlaneClient.List: if listOpts.Namespace
is empty or different, create or copy listOpts and set its Namespace to the
namespace argument (so List will query the intended namespace), then proceed to
list pods and union their CPUs as before; update any callers only if they depend
on the previous behavior.
- Around line 1033-1040: In waitForDeploymentReady, the readiness check only
inspects podList.Items[0].Status.ContainerStatuses; change the logic to iterate
over every pod in podList.Items and for each pod iterate over
pod.Status.ContainerStatuses to ensure all containers in all pods are Ready
(still return false if podList.Items is empty or any container is not Ready),
keeping the existing early-return behavior and using the existing podList.Items
and ContainerStatuses references so all replicas are validated.
- Around line 936-938: The function expectedOvsAffinity currently ignores its
reservedCPUs parameter; update its implementation to exclude reservedCPUs from
the resulting CPU set (e.g., return
onlineCPUs.Difference(guPodCPUs).Difference(reservedCPUs) or return
onlineCPUs.Difference(guPodCPUs.Union(reservedCPUs))) so the reserved CPUs are
not assigned to OVS, and leave the function signature unchanged so existing
callers of expectedOvsAffinity continue to work without changes.
---
Nitpick comments:
In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`:
- Around line 981-991: The non-WP branch of verifyOvsMatchesExpected currently
ignores guCPUs and just compares OVS to ovnAffinity, which makes tests vacuous;
change the else branch to compute expected := expectedOvsAffinity(onlineCPUs,
reservedCPUs, guCPUs) and call verifyOvsAffinity(ovsAffinities, expected)
(instead of verifyOvsAffinity(ovsAffinities, ovnAffinity)) so non-WP behavior
accounts for guCPUs; alternatively, if OVS should not change under non-WP,
remove/stop passing the guCPUs parameter and update the test description
accordingly — but the quick fix is to use expectedOvsAffinity(...) in the else
branch of verifyOvsMatchesExpected.
- Line 591: Replace the hardcoded time.Sleep(30 * time.Second) inside the It
block with a deterministic wait: either call the existing helper
getOvnContainerAffinity (which already contains the required wait) instead of
the manual sleep, or use Gomega.Eventually to poll the taskSet/ctnCpuset until
the container CPU set/affinity stabilizes (e.g. assert ctnCpuset equals the
expected value within a timeout). Locate the sleep in the test body where
ctnCpuset is read, and modify the code to poll taskSet/ctnCpuset (or delegate to
getOvnContainerAffinity) so the test is robust on slow clusters and not wasteful
on fast ones.
- Around line 950-962: Both getOvnContainerAffinity and getOvsAffinities contain
unconditional time.Sleep(30 * time.Second) calls causing compounded fixed
delays; remove those sleeps and replace with polling (Gomega Eventually or a
shared WaitForAffinityStable helper) that repeatedly calls the underlying probes
(taskSet for getOvnContainerAffinity, ovsPids/affinity reader for
getOvsAffinities) until the affinity stabilises or a timeout is reached, or
alternatively centralise a single wait in the test caller instead of sleeping
inside these helpers so callers control timeout and frequency.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 6f9a886a-e21b-435e-a245-31c24902c58d
📒 Files selected for processing (1)
test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go
Verify ovs affinity map is not empty Signed-off-by: Niranjan M.R <mniranja@redhat.com>
|
@mrniranjan: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This PR has 2 commits
Code was generated using AI-Assisted:
AIA Human-AI blend, Content edits, Human-initiated, claude-4.6-opus-high-thinking
Summary by CodeRabbit