Skip to content

E2E: Refactor OVS affinity tests into shared helpers and add new workload-partitioning and control-plane reboot test cases.#1509

Open
mrniranjan wants to merge 4 commits intoopenshift:mainfrom
mrniranjan:ovs_dynamic_pinning_wp
Open

E2E: Refactor OVS affinity tests into shared helpers and add new workload-partitioning and control-plane reboot test cases.#1509
mrniranjan wants to merge 4 commits intoopenshift:mainfrom
mrniranjan:ovs_dynamic_pinning_wp

Conversation

@mrniranjan
Copy link
Copy Markdown
Contributor

@mrniranjan mrniranjan commented May 6, 2026

This PR has 2 commits

  1. First commit refactors some of the existing code in terms of readability etc
  2. Add new test cases related related to workload partitioning

Code was generated using AI-Assisted:
AIA Human-AI blend, Content edits, Human-initiated, claude-4.6-opus-high-thinking

Summary by CodeRabbit

  • Tests
    • Enhanced performance-profile tests for Workload Partitioning with adjusted CPU-affinity expectations for enabled/disabled states.
    • Refactored and expanded Open vSwitch pinning scenarios using shared helpers, including ovs-vswitchd thread affinity checks and larger multi-pod pinning scenarios.
    • Added persistence checks across pod deletion, node and control-plane reboot.
    • Added cluster utility to detect Workload Partitioning status.

mrniranjan added 2 commits May 6, 2026 18:56
Extract reusable helper functions (getOvnContainerAffinity,
getOvsAffinities, verifyOvsAffinity, verifyOvsMatchesExpected,
createGuPod, waitForDeploymentReady, expectedOvsAffinity,
getGuPodCPUs, collectGuCPUsFromPodList) to eliminate repeated
patterns across OVS dynamic CPU affinity test cases.

Add cluster.IsWorkloadPartitioningEnabled() to the shared cluster
utility package, querying Infrastructure.Status.CPUPartitioning.

Refactor test_id:64100, 64101, 64102, 64103, 75257 to use the new
helpers, add WP-aware assertion branching, fix indentation
inconsistencies, and reduce overall test noise.

AIA Human-AI blend, Content edits, Human-initiated, claude-4.6-opus-high-thinking v1.0

Signed-off-by: Niranjan M.R <mniranja@redhat.com>
Add new test cases for OVS dynamic CPU affinity under workload
partitioning and control plane reboot scenarios:

- Skip test suite early when workerRTNode is a non-schedulable
  control plane node

- Add control plane reboot test: verifies OVS affinity is not
  restricted to reserved CPUs before and after rebooting a control
  plane node with dynamic pinning enabled

- Add workload partitioning test context with four tests:
  - OVN pod restricted to reserved CPUs
  - OVS affinity wider than OVN pod
  - Reserved CPUs always included in OVS affinity (with/without GU pods)
  - OVS falls to reserved-only when all isolated CPUs are exhausted

AIA Human-AI blend, Content edits, Human-initiated, claude-4.6-opus-high-thinking v1.0

Signed-off-by: Niranjan M.R <mniranja@redhat.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 6, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f6868581-dde4-4094-b537-a35e426f2731

📥 Commits

Reviewing files that changed from the base of the PR and between cd213c0 and b02b822.

📒 Files selected for processing (1)
  • test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go

Walkthrough

The test suite now detects whether Workload Partitioning (WP) is enabled and adjusts CPU affinity expectations accordingly. A cluster helper to check WP was added, and the cgroups test was refactored to use shared helpers to collect and verify OVN/OVS affinities across dynamic OVS pinning, guaranteed-pod scenarios, and node/reboot lifecycle checks (including control-plane reboot coverage).

Changes

Cohort / File(s) Summary
Workload Partitioning Detection
test/e2e/performanceprofile/functests/utils/cluster/cluster.go
Added IsWorkloadPartitioningEnabled(ctx) to read the cluster Infrastructure CPUPartitioning setting and report whether WP is enabled.
CPU Affinity Test Refactoring & Expansion
test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go
Refactored OVN/OVS affinity assertions to branch on WP state (OVN: reservedCPUSet when WP enabled, otherwise onlineCPUSet). Introduced helpers to collect OVN container cpuset, collect OVS service pids and affinities, compute expected OVS affinities (non-WP: match OVN; WP: onlineCPUSet \ guPodCPUs), and verify affinities. Expanded dynamic OVS pinning scenarios: create GU pods, compute union of GU-pinned CPUs, assert OVS excludes GU CPUs, persist checks across pod deletion and node reboot, added control-plane reboot coverage, and updated ovs-vswitchd thread affinity tests to use shared helpers.

Sequence Diagram(s)

sequenceDiagram
    participant Test as Test Harness
    participant API as Kubernetes API
    participant Node as Node (kubelet)
    participant OVN as OVN container
    participant OVS as OVS processes

    Test->>API: Query Infrastructure (CPUPartitioning)
    API-->>Test: WP enabled? (true/false)

    alt WP disabled
        Test->>Node: Read onlineCPUSet
        Test->>OVN: Read OVN container cpuset -> expect == onlineCPUSet
        Test->>OVS: Read OVS pids & affinities -> expect == OVN cpuset
    else WP enabled
        Test->>Node: Read reservedCPUSet, onlineCPUSet
        Test->>OVN: Read OVN container cpuset -> expect == reservedCPUSet
        Test->>Test: Create GU pods -> get guPodCPUs (union)
        Test->>OVS: Read OVS pids & affinities -> expect == onlineCPUSet \ guPodCPUs (also include reservedCPUs)
    end

    Test->>Node: Delete GU pods / Reboot node
    Node-->>Test: Node back online
    Test->>OVS: Re-read OVS affinities -> verify exclusions persist
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 9 | ❌ 3

❌ Failed checks (3 warnings)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning Missing assertion messages (5 instances), hardcoded time.Sleep waits instead of Eventually (3 instances), and test 64102 tests multiple behaviors. Add messages to assertions at lines 273, 279, 286, 690, 975. Replace time.Sleep at lines 593, 952, 964 with Eventually blocks. Split test 64102 into separate tests.
Microshift Test Compatibility ⚠️ Warning IsWorkloadPartitioningEnabled() calls Infrastructure API (config.openshift.io/v1) in BeforeAll(), which is unavailable on MicroShift. This breaks all tests in the file. Add [apigroup:config.openshift.io] to outer Describe, guard the call with exutil.IsMicroShiftCluster() Skip, or move the call into contexts that need it.
Single Node Openshift (Sno) Test Compatibility ⚠️ Warning Test at line 407 assumes multi-node topology by iterating control-plane nodes (lines 420-429). No [Skipped:SingleReplicaTopology] label or SNO runtime checks present. Add [Skipped:SingleReplicaTopology] label to test name or guard with SNO runtime check: exutil.IsSingleNode() or infrastructure topology check with g.Skip().
✅ Passed checks (9 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: refactoring OVS affinity tests into shared helpers and adding workload-partitioning and control-plane reboot test cases, which align with the detailed code changes.
Docstring Coverage ✅ Passed Docstring coverage is 94.12% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All Ginkgo test names are stable and deterministic. No dynamic values like pod names, timestamps, UUIDs, node names, IPs, or generated suffixes appear in any test titles.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies test code only. No manifests, operator code or controllers affected. Test pods have no topology-unaware constraints.
Ote Binary Stdout Contract ✅ Passed New code in cluster.go and cgroups.go contains no process-level stdout writes, fmt.Print*/log.Print*/klog calls, or OTE stdout contract violations.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed Tests have no IPv4 assumptions or external connectivity needs. Image registry is environment-configurable for disconnected CI.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.1)

level=error msg="Running error: context loading failed: failed to load packages: failed to load packages: failed to load with go/packages: err: exit status 1: stderr: go: inconsistent vendoring in :\n\tgithub.com/RHsyseng/operator-utils@v1.4.13: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/coreos/go-systemd@v0.0.0-20191104093116-d3cd4ed1dbcf: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/coreos/ignition@v0.35.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/coreos/ignition/v2@v2.26.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/docker/go-units@v0.5.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/go-logr/stdr@v1.2.2: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/google/go-cmp@v0.7.0

... [truncated 19340 characters] ...

is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/legacy-cloud-providers: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/metrics: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/mount-utils: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/pod-security-admission: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/sample-apiserver: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tgithub.com/onsi/ginkgo/v2: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\n\tTo ignore the vendor directory, use -mod=readonly or -mod=mod.\n\tTo sync the vendor directory, run:\n\t\tgo mod vendor\n"

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from Tal-or and swatisehgal May 6, 2026 13:33
@mrniranjan mrniranjan changed the title AI-Assisted: E2E: Refactor ovs dynamic pinning test case and add new test cases related to schedulable control plane and workload partitioning AI-Assisted: E2E: Refactor OVS affinity e2e tests into shared helpers and add new workload-partitioning and control-plane reboot test cases. May 6, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 6, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mrniranjan
Once this PR has been reviewed and has the lgtm label, please assign ffromani for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mrniranjan mrniranjan changed the title AI-Assisted: E2E: Refactor OVS affinity e2e tests into shared helpers and add new workload-partitioning and control-plane reboot test cases. AI-Assisted: E2E: Refactor OVS affinity tests into shared helpers and add new workload-partitioning and control-plane reboot test cases. May 6, 2026
@mrniranjan mrniranjan changed the title AI-Assisted: E2E: Refactor OVS affinity tests into shared helpers and add new workload-partitioning and control-plane reboot test cases. E2E: Refactor OVS affinity tests into shared helpers and add new workload-partitioning and control-plane reboot test cases. May 6, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go (3)

562-567: 💤 Low value

BeforeEach re-checks a value that's already cluster-stable.

isWorkloadPartitioningEnabled is computed once in BeforeAll. Using BeforeEach here re-evaluates the skip on every It, which is harmless but wasteful. Switching to BeforeAll makes the intent — "skip the whole context if WP is disabled" — clearer and avoids per-spec churn in test logs.

♻️ Proposed change
-		Context("Workload Partitioning OVS affinity", Label(string(label.Tier2)), func() {
-			BeforeEach(func() {
+		Context("Workload Partitioning OVS affinity", Label(string(label.Tier2)), Ordered, func() {
+			BeforeAll(func() {
 				if !isWorkloadPartitioningEnabled {
 					Skip("Workload partitioning is not enabled on this cluster")
 				}
 			})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`
around lines 562 - 567, The context "Workload Partitioning OVS affinity"
currently uses BeforeEach to skip when the cluster-wide flag
isWorkloadPartitioningEnabled is false; change this to BeforeAll so the check
runs once for the whole context instead of before every spec—replace the
BeforeEach hook with a BeforeAll hook that calls Skip("Workload partitioning is
not enabled on this cluster") when isWorkloadPartitioningEnabled is false to
avoid per-test churn in logs.

937-959: 💤 Low value

Extract the 30-second stabilisation delay into a named constant.

The same time.Sleep(30 * time.Second) is used in getOvnContainerAffinity (Line 944), getOvsAffinities (Line 955), and inline in the WP test (Line 597). A named constant near the top of the file documents intent (waiting for OVS dynamic pinning to converge) and avoids drift between call sites.

♻️ Proposed change
 const minRequiredCPUs = 8
+
+// affinityStabilizationDelay is the time to wait for OVS dynamic pinning
+// to converge after pod creation/deletion before sampling affinities.
+const affinityStabilizationDelay = 30 * time.Second

…then replace the magic literal at each site with affinityStabilizationDelay.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`
around lines 937 - 959, Introduce a named constant affinityStabilizationDelay
(value 30 * time.Second) near the top of the file and replace the hard-coded
time.Sleep(30 * time.Second) calls in getOvnContainerAffinity and
getOvsAffinities (and the inline wait in the WP test mentioned in the comment)
with time.Sleep(affinityStabilizationDelay) so the intent ("waiting for OVS
dynamic pinning to converge") is documented and the delay is centralized for
consistency.

588-616: ⚡ Quick win

Duplicates the getOvnContainerAffinity / getOvsAffinities helpers.

This block re-implements logic that was just refactored into helpers (Lines 937-959), including a bare time.Sleep(30 * time.Second). Reusing the helpers makes intent clearer and keeps the stabilisation delay in one place.

♻️ Proposed simplification
 		It("Verify OVS affinity is wider than OVN pod under workload partitioning", func() {
 			By("Get OVN container affinity")
-			ovnPod, err := ovnCnfNodePod(ctx, workerRTNode)
-			Expect(err).ToNot(HaveOccurred(), "Unable to get ovnPod")
-			ovnContainerids, err := ovnPodContainers(&ovnPod)
-			Expect(err).ToNot(HaveOccurred())
-			containerPid, err := nodes.ContainerPid(ctx, workerRTNode, ovnContainerids[0])
-			Expect(err).ToNot(HaveOccurred())
-
-			time.Sleep(30 * time.Second)
-			ctnCpuset := taskSet(ctx, containerPid, workerRTNode)
-			testlog.Infof("OVN container affinity: %s", ctnCpuset)
+			ctnCpuset := getOvnContainerAffinity(ctx, workerRTNode)
 			Expect(ctnCpuset).To(Equal(reservedCPUSet),
 				"OVN container should be restricted to reserved CPUs under workload partitioning")
 
 			By("Get OVS process affinity and verify it is wider")
-			pidList, err := ovsPids(ctx, ovsSystemdServices, workerRTNode)
-			Expect(err).ToNot(HaveOccurred())
-			pidToCPUs, err := getCPUMaskForPids(ctx, pidList, workerRTNode)
-			Expect(err).ToNot(HaveOccurred())
+			pidToCPUs := getOvsAffinities(ctx, ovsSystemdServices, workerRTNode)
 
 			for pid, ovsAffinity := range pidToCPUs {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`
around lines 588 - 616, This test duplicates existing helpers: replace the
manual OVN/OVS affinity logic and the explicit time.Sleep with calls to the
refactored helpers getOvnContainerAffinity and getOvsAffinities; call
getOvnContainerAffinity(ctx, workerRTNode, reservedCPUSet) to get the OVN
container cpuset and assert it equals reservedCPUSet, then call
getOvsAffinities(ctx, ovsSystemdServices, workerRTNode) to retrieve pid->cpus
and loop the results asserting reservedCPUSet.IsSubsetOf(ovsAffinity) and that
ovsAffinity.Equals(reservedCPUSet) is false when appropriate—remove the
duplicated containerPid/ovnPod/ovnContainerids code and the 30s sleep so the
stabilisation logic stays centralized in the helpers.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`:
- Around line 654-723: The test currently only checks
reservedCPUSet.IsSubsetOf(mask) which allows mask to still equal onlineCPUSet;
update the assertion in the OVS verification loop (using ovsAffinities from
getOvsAffinities) to require exact equality with reservedCPUSet instead of
subset—e.g. replace the IsSubsetOf Expect with an equality check
(mask.Equals(reservedCPUSet) or equivalent) and update the failure message to
state that OVS affinity must equal reserved CPUs after isolated CPUs are
exhausted.
- Around line 402-477: The test uses reservedCPUSet (parsed in BeforeAll for the
workerRTNode) but then may pick a different control-plane node (cpNode) whose
PerformanceProfile and reserved CPUs differ; update the test to obtain the
cpNode's own reserved CPU set before doing affinity checks or skip if none:
after selecting cpNode (the variable determined from workerRTNode or the
control-plane node loop), lookup the PerformanceProfile for cpNode via its
labels (same approach used in BeforeAll), parse Spec.CPU.Reserved into a
cpReservedCPUSet, and replace uses of reservedCPUSet in the mask.Equals(...)
checks and in the expectedOvsAffinity(cpOnlineCPUs, reservedCPUSet, ...) call
with cpReservedCPUSet; if no profile/reserved CPUs can be resolved for cpNode,
call Skip to avoid unsound assertions. Ensure changes reference reservedCPUSet,
workerRTNode, cpNode, expectedOvsAffinity, getOvsAffinities, and BeforeAll so
reviewers can locate the logic.

---

Nitpick comments:
In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`:
- Around line 562-567: The context "Workload Partitioning OVS affinity"
currently uses BeforeEach to skip when the cluster-wide flag
isWorkloadPartitioningEnabled is false; change this to BeforeAll so the check
runs once for the whole context instead of before every spec—replace the
BeforeEach hook with a BeforeAll hook that calls Skip("Workload partitioning is
not enabled on this cluster") when isWorkloadPartitioningEnabled is false to
avoid per-test churn in logs.
- Around line 937-959: Introduce a named constant affinityStabilizationDelay
(value 30 * time.Second) near the top of the file and replace the hard-coded
time.Sleep(30 * time.Second) calls in getOvnContainerAffinity and
getOvsAffinities (and the inline wait in the WP test mentioned in the comment)
with time.Sleep(affinityStabilizationDelay) so the intent ("waiting for OVS
dynamic pinning to converge") is documented and the delay is centralized for
consistency.
- Around line 588-616: This test duplicates existing helpers: replace the manual
OVN/OVS affinity logic and the explicit time.Sleep with calls to the refactored
helpers getOvnContainerAffinity and getOvsAffinities; call
getOvnContainerAffinity(ctx, workerRTNode, reservedCPUSet) to get the OVN
container cpuset and assert it equals reservedCPUSet, then call
getOvsAffinities(ctx, ovsSystemdServices, workerRTNode) to retrieve pid->cpus
and loop the results asserting reservedCPUSet.IsSubsetOf(ovsAffinity) and that
ovsAffinity.Equals(reservedCPUSet) is false when appropriate—remove the
duplicated containerPid/ovnPod/ovnContainerids code and the 30s sleep so the
stabilisation logic stays centralized in the helpers.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c301f9bd-d453-41ea-a409-7d57d2b1fbfb

📥 Commits

Reviewing files that changed from the base of the PR and between fa5b0e1 and bd99aa1.

📒 Files selected for processing (2)
  • test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go
  • test/e2e/performanceprofile/functests/utils/cluster/cluster.go

Comment thread test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go Outdated
Copy link
Copy Markdown
Contributor

@Tal-or Tal-or left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work, couple of comments inside.

On the u/s the cluster are not configured with WP, are we going to test cluster with WP configured d/s?

Expect(err).ToNot(HaveOccurred(), "Unable to check if workload partitioning is enabled")
testlog.Infof("Workload partitioning enabled: %v", isWorkloadPartitioningEnabled)

Expect(profile.Spec.CPU.Reserved).NotTo(BeNil())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ToNot

Expect(err).ToNot(HaveOccurred())
Expect(containerCpuset).To(Equal(onlineCPUSet), "Burstable pod containers cpuset.cpus do not match total online cpus")
if isWorkloadPartitioningEnabled {
Expect(containerCpuset).To(Equal(reservedCPUSet),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the built-in containerCpuset.Equals() function

Expect(containerCpuset).To(Equal(reservedCPUSet),
"Under workload partitioning, OVN pod cpuset.cpus should match reserved cpus, got %s expected %s", containerCpuset, reservedCPUSet)
} else {
Expect(containerCpuset).To(Equal(onlineCPUSet),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

// wait till the ovs process affinity is reverted back
pidList, err := ovsPids(ctx, ovsSystemdServices, workerRTNode)
Expect(err).ToNot(HaveOccurred())
testlog.Infof("Deleting Deployment %s", dp.Name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add NS as well for better logging experience

By("Verifying affinities before reboot")
collectAndVerify("before-reboot")

By("Rebooting the worker node")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may consolidate with the line below.

Comment on lines +681 to +682
isolated := onlineCPUs.Difference(reservedCPUs)
return isolated.Difference(guPodCPUs).Union(reservedCPUs)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or simply:
return onlineCPUs.Difference(guPodCPUs)


// getOvnContainerAffinity returns the CPU affinity of the first OVN container
// on the given node, waiting 30s for affinity to stabilise.
func getOvnContainerAffinity(ctx context.Context, node *corev1.Node) cpuset.CPUSet {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add GinkgoHelper() in all test helpers that are throwing assertions outside the It() containers

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go (1)

50-73: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Outer workerRTNodes is shadowed in BeforeAll and is never populated.

Line 68 uses := inside the BeforeAll closure, which creates a new local workerRTNodes (and err) that shadow the outer-scope declarations from line 53/61. The local variable is correctly populated and used on line 73 to set workerRTNode, but the outer-scope workerRTNodes slice remains nil for the lifetime of the suite.

Consequently, the for _, node := range workerRTNodes loops at lines 117–122 and 192–196 never execute their bodies — both test assertions are vacuous.

🐛 Proposed fix — use `=` for the already-declared outer variables
-		workerRTNodes, err := nodes.GetByLabels(testutils.NodeSelectorLabels)
+		var localErr error
+		workerRTNodes, localErr = nodes.GetByLabels(testutils.NodeSelectorLabels)
+		Expect(localErr).ToNot(HaveOccurred())

-		workerRTNodes, err = nodes.MatchingOptionalSelector(workerRTNodes)
-		Expect(err).ToNot(HaveOccurred(), fmt.Sprintf("error looking for the optional selector: %v", err))
+		workerRTNodes, localErr = nodes.MatchingOptionalSelector(workerRTNodes)
+		Expect(localErr).ToNot(HaveOccurred(), fmt.Sprintf("error looking for the optional selector: %v", localErr))

Or, simpler — because err is already declared in the outer scope, drop the : and use plain assignment:

-		workerRTNodes, err := nodes.GetByLabels(testutils.NodeSelectorLabels)
+		workerRTNodes, err = nodes.GetByLabels(testutils.NodeSelectorLabels)
		Expect(err).ToNot(HaveOccurred())
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`
around lines 50 - 73, The BeforeAll closure is accidentally shadowing the outer
workerRTNodes and err by using :=; change the assignments in the BeforeAll block
to use plain = so the already-declared outer-scope variables (workerRTNodes,
err) are populated (specifically update the nodes.GetByLabels(...) and
nodes.MatchingOptionalSelector(...) lines to use = instead of :=), ensuring the
outer workerRTNodes slice is available for the later for _, node := range
workerRTNodes loops and workerRTNode assignment.
🧹 Nitpick comments (1)
test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go (1)

942-966: ⚡ Quick win

Hardcoded time.Sleep(30s) in getOvnContainerAffinity and getOvsAffinities is both fragile and slow.

Both helpers unconditionally sleep 30 seconds "for affinity to stabilise." This sleep fires on every call — including the multiple calls inside collectAndVerify and the WP tests — adding 60–120+ seconds of artificial delay per test run. If stabilization takes longer than 30 s the tests remain flaky; if it's faster the sleep is wasted.

Replace with an Eventually poll on the affinity value itself:

♻️ Proposed refactor for `getOvsAffinities`
 func getOvsAffinities(ctx context.Context, services []string, node *corev1.Node) map[string]cpuset.CPUSet {
 	GinkgoHelper()
 	pidList, err := ovsPids(ctx, services, node)
 	Expect(err).ToNot(HaveOccurred())
-	time.Sleep(30 * time.Second)
-	affinities, err := getCPUMaskForPids(ctx, pidList, node)
-	Expect(err).ToNot(HaveOccurred())
-	return affinities
+	var affinities map[string]cpuset.CPUSet
+	Eventually(func() error {
+		var e error
+		affinities, e = getCPUMaskForPids(ctx, pidList, node)
+		return e
+	}, 2*time.Minute, 5*time.Second).Should(Succeed())
+	return affinities
 }

Apply the same pattern to getOvnContainerAffinity and remove the time.Sleep(30 * time.Second) at line 591.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`
around lines 942 - 966, The fixed code should remove the hardcoded time.Sleep(30
* time.Second) calls in getOvnContainerAffinity and getOvsAffinities and instead
poll until the CPU affinity stabilises using Gomega's Eventually: for
getOvsAffinities, repeatedly call ovsPids(ctx, services, node) and
getCPUMaskForPids(ctx, pidList, node) (or a helper that returns the computed
map) inside an Eventually block with a reasonable timeout/interval until the
returned affinities stop changing; for getOvnContainerAffinity, poll the
sequence ovnCnfNodePod(ctx, node) -> ovnPodContainers(&ovnPod) ->
nodes.ContainerPid(ctx, node, containerId) -> taskSet(ctx, pid, node) inside
Eventually until the affinity equals a previous value (stable) instead of
sleeping; remove the Sleep lines and ensure Expect checks remain outside or
inside Eventually as appropriate so tests fail fast on errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`:
- Around line 1033-1040: The waitForDeploymentReady function currently only
checks container readiness for podList.Items[0], which misses other replicas;
update the readiness check to iterate over all pods in podList.Items and for
each pod iterate over pod.Status.ContainerStatuses ensuring every container is
Ready before returning true (i.e., only return true when all pods' containers
are ready); this will ensure subsequent calls like getGuPodCPUs observe all
replicas ready.
- Around line 932-938: The function expectedOvsAffinity has an unused parameter
reservedCPUs; remove reservedCPUs from the signature (change
expectedOvsAffinity(onlineCPUs, guPodCPUs cpuset.CPUSet) cpuset.CPUSet) and
update every call site that passes three args to pass only onlineCPUs and
guPodCPUs instead, ensuring the function body remains return
onlineCPUs.Difference(guPodCPUs); also update any references/imports or test
helpers that assert the old signature to match the new two-parameter form.
- Around line 1046-1058: The function collectGuCPUsFromPodList has an unused
parameter namespace (the namespace is provided via listOpts.Namespace); remove
the namespace parameter from its signature and from all call sites (e.g., where
collectGuCPUsFromPodList is invoked around the mentioned call site) so callers
pass only (ctx, listOpts), and update any references to match the new signature;
ensure build and tests pass after the change.

---

Outside diff comments:
In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`:
- Around line 50-73: The BeforeAll closure is accidentally shadowing the outer
workerRTNodes and err by using :=; change the assignments in the BeforeAll block
to use plain = so the already-declared outer-scope variables (workerRTNodes,
err) are populated (specifically update the nodes.GetByLabels(...) and
nodes.MatchingOptionalSelector(...) lines to use = instead of :=), ensuring the
outer workerRTNodes slice is available for the later for _, node := range
workerRTNodes loops and workerRTNode assignment.

---

Nitpick comments:
In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`:
- Around line 942-966: The fixed code should remove the hardcoded time.Sleep(30
* time.Second) calls in getOvnContainerAffinity and getOvsAffinities and instead
poll until the CPU affinity stabilises using Gomega's Eventually: for
getOvsAffinities, repeatedly call ovsPids(ctx, services, node) and
getCPUMaskForPids(ctx, pidList, node) (or a helper that returns the computed
map) inside an Eventually block with a reasonable timeout/interval until the
returned affinities stop changing; for getOvnContainerAffinity, poll the
sequence ovnCnfNodePod(ctx, node) -> ovnPodContainers(&ovnPod) ->
nodes.ContainerPid(ctx, node, containerId) -> taskSet(ctx, pid, node) inside
Eventually until the affinity equals a previous value (stable) instead of
sleeping; remove the Sleep lines and ensure Expect checks remain outside or
inside Eventually as appropriate so tests fail fast on errors.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b8924fa3-bc0f-46aa-81b9-12917df83050

📥 Commits

Reviewing files that changed from the base of the PR and between bd99aa1 and df26d1b.

📒 Files selected for processing (2)
  • test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go
  • test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go

Comment on lines +1033 to +1040
if len(podList.Items) == 0 {
return false
}
for _, s := range podList.Items[0].Status.ContainerStatuses {
if !s.Ready {
return false
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

waitForDeploymentReady only verifies the first pod's container readiness.

The loop at line 1036 iterates podList.Items[0].Status.ContainerStatuses, so for replicas=2 the second pod's containers are never checked. The function can return while only one of two pods is truly ready, causing subsequent getGuPodCPUs calls on the second pod to read an incorrect or unset affinity.

🐛 Proposed fix — check all pods in the list
-		for _, s := range podList.Items[0].Status.ContainerStatuses {
-			if !s.Ready {
-				return false
-			}
-		}
+		for i := range podList.Items {
+			for _, s := range podList.Items[i].Status.ContainerStatuses {
+				if !s.Ready {
+					return false
+				}
+			}
+		}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if len(podList.Items) == 0 {
return false
}
for _, s := range podList.Items[0].Status.ContainerStatuses {
if !s.Ready {
return false
}
}
if len(podList.Items) == 0 {
return false
}
for i := range podList.Items {
for _, s := range podList.Items[i].Status.ContainerStatuses {
if !s.Ready {
return false
}
}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`
around lines 1033 - 1040, The waitForDeploymentReady function currently only
checks container readiness for podList.Items[0], which misses other replicas;
update the readiness check to iterate over all pods in podList.Items and for
each pod iterate over pod.Status.ContainerStatuses ensuring every container is
Ready before returning true (i.e., only return true when all pods' containers
are ready); this will ensure subsequent calls like getGuPodCPUs observe all
replicas ready.

@mrniranjan
Copy link
Copy Markdown
Contributor Author

/retest

- Add GinkgoHelper() to all test helper functions that perform
  assertions, so failures report the caller location
- Use cpuset.Equals() for cpuset comparisons instead of struct equality
- Refactor CP reboot test to assert against online CPUs directly,
  avoiding dependency on worker profile's reservedCPUSet which may
  not apply to the CP node on MNO clusters
- Fix exhaustion test: rename to reflect it verifies GU pod exclusion
  (not full exhaustion), compute exact expected affinity using
  expectedOvsAffinity, and assert equality instead of weak subset check
- Simplify expectedOvsAffinity to onlineCPUs.Difference(guPodCPUs)
- Fix indentation to match surrounding test blocks
@mrniranjan mrniranjan force-pushed the ovs_dynamic_pinning_wp branch from df26d1b to cd213c0 Compare May 8, 2026 08:17
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go (1)

68-68: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

workerRTNodes outer variable is never assigned due to := shadowing in the BeforeAll closure.

Line 68 uses short variable declaration (:=), creating a new local workerRTNodes variable scoped to the BeforeAll closure. The outer workerRTNodes (declared at line 53) remains nil, causing the test loops at lines 117 and 192 to iterate zero times, making those assertions vacuous.

Change line 68 to use = instead:

🐛 Proposed fix
-		workerRTNodes, err := nodes.GetByLabels(testutils.NodeSelectorLabels)
+		workerRTNodes, err = nodes.GetByLabels(testutils.NodeSelectorLabels)

The variable err is already declared in the outer var block (line 22), so no additional declaration is needed.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`
at line 68, The outer variable workerRTNodes is left nil because the BeforeAll
closure uses short-declaration (:=) which shadows it; change the declaration in
the BeforeAll where nodes.GetByLabels(testutils.NodeSelectorLabels) is called to
an assignment (use = not :=) so the outer workerRTNodes is set, and reuse the
already-declared err variable instead of redeclaring it; locate the
nodes.GetByLabels call inside the BeforeAll closure and replace := with = so
subsequent loops over workerRTNodes in the test (e.g., later assertions) operate
on the populated slice.
♻️ Duplicate comments (3)
test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go (3)

1046-1058: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

namespace parameter in collectGuCPUsFromPodList is still unused — not addressed.

The namespace argument is never referenced in the function body; the namespace is already carried by listOpts.Namespace.

♻️ Proposed fix
-func collectGuCPUsFromPodList(ctx context.Context, namespace string, listOpts *client.ListOptions) cpuset.CPUSet {
+func collectGuCPUsFromPodList(ctx context.Context, listOpts *client.ListOptions) cpuset.CPUSet {

Update the call site at line 378:

-guCPUs := collectGuCPUsFromPodList(ctx, dp.Namespace, dpListOpts)
+guCPUs := collectGuCPUsFromPodList(ctx, dpListOpts)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`
around lines 1046 - 1058, The function collectGuCPUsFromPodList currently
ignores its namespace parameter; fix it by ensuring the list options actually
use that namespace before calling testclient.DataPlaneClient.List: if
listOpts.Namespace is empty or different, create or copy listOpts and set its
Namespace to the namespace argument (so List will query the intended namespace),
then proceed to list pods and union their CPUs as before; update any callers
only if they depend on the previous behavior.

1033-1040: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

waitForDeploymentReady still only checks the first pod's container statuses — not addressed.

The loop at line 1036 iterates podList.Items[0].Status.ContainerStatuses regardless of replicas. For a 2-replica deployment, the second pod's containers are never checked, so the function can return while only one pod is truly Ready.

🐛 Proposed fix
-		for _, s := range podList.Items[0].Status.ContainerStatuses {
-			if !s.Ready {
-				return false
-			}
-		}
+		for i := range podList.Items {
+			for _, s := range podList.Items[i].Status.ContainerStatuses {
+				if !s.Ready {
+					return false
+				}
+			}
+		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`
around lines 1033 - 1040, In waitForDeploymentReady, the readiness check only
inspects podList.Items[0].Status.ContainerStatuses; change the logic to iterate
over every pod in podList.Items and for each pod iterate over
pod.Status.ContainerStatuses to ensure all containers in all pods are Ready
(still return false if podList.Items is empty or any container is not Ready),
keeping the existing early-return behavior and using the existing podList.Items
and ContainerStatuses references so all replicas are validated.

936-938: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

reservedCPUs parameter in expectedOvsAffinity is still unused — not addressed.

The reservedCPUs argument is accepted but never referenced in the function body, making every call site (lines 714, 987) silently pass a value that has no effect.

♻️ Proposed fix
-func expectedOvsAffinity(onlineCPUs, reservedCPUs, guPodCPUs cpuset.CPUSet) cpuset.CPUSet {
-	return onlineCPUs.Difference(guPodCPUs)
+func expectedOvsAffinity(onlineCPUs, guPodCPUs cpuset.CPUSet) cpuset.CPUSet {
+	return onlineCPUs.Difference(guPodCPUs)
 }

And update call sites at lines 714 and 987:

-expected := expectedOvsAffinity(onlineCPUSet, reservedCPUSet, guPodCPUs)
+expected := expectedOvsAffinity(onlineCPUSet, guPodCPUs)
-expected := expectedOvsAffinity(onlineCPUs, reservedCPUs, guCPUs)
+expected := expectedOvsAffinity(onlineCPUs, guCPUs)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`
around lines 936 - 938, The function expectedOvsAffinity currently ignores its
reservedCPUs parameter; update its implementation to exclude reservedCPUs from
the resulting CPU set (e.g., return
onlineCPUs.Difference(guPodCPUs).Difference(reservedCPUs) or return
onlineCPUs.Difference(guPodCPUs.Union(reservedCPUs))) so the reserved CPUs are
not assigned to OVS, and leave the function signature unchanged so existing
callers of expectedOvsAffinity continue to work without changes.
🧹 Nitpick comments (3)
test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go (3)

981-991: ⚡ Quick win

Non-WP path in verifyOvsMatchesExpected ignores guCPUs — misleading test semantics.

When isWP=false, the function asserts OVS affinity == ovnAffinity. Under non-WP, ovnAffinity == onlineCPUSet, so the assertion reduces to OVS affinity == onlineCPUSet regardless of whether GU pods are present. This means test_id:64101 ("Creating gu pods modifies affinity of ovs") and test_id:64102 do not actually verify any change in OVS affinity on non-WP clusters — the test passes vacuously in that mode.

If dynamic OVS pinning is not expected to change OVS affinity under non-WP, the test descriptions and the guCPUs parameter to verifyOvsMatchesExpected are misleading and the unused parameter should be removed from the non-WP branch. If OVS affinity is expected to change under non-WP, the assertion formula should account for guCPUs there too.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`
around lines 981 - 991, The non-WP branch of verifyOvsMatchesExpected currently
ignores guCPUs and just compares OVS to ovnAffinity, which makes tests vacuous;
change the else branch to compute expected := expectedOvsAffinity(onlineCPUs,
reservedCPUs, guCPUs) and call verifyOvsAffinity(ovsAffinities, expected)
(instead of verifyOvsAffinity(ovsAffinities, ovnAffinity)) so non-WP behavior
accounts for guCPUs; alternatively, if OVS should not change under non-WP,
remove/stop passing the guCPUs parameter and update the test description
accordingly — but the quick fix is to use expectedOvsAffinity(...) in the else
branch of verifyOvsMatchesExpected.

591-591: ⚡ Quick win

Another hardcoded time.Sleep(30 * time.Second) directly in a test body.

Line 591 sleeps unconditionally inside the It block before reading ctnCpuset. The same concerns apply as for the helper-level sleeps: unreliable on slow clusters, always wasteful on fast ones. This particular call also duplicates the 30-second wait that getOvnContainerAffinity would already impose if that helper were used here instead. Consider either using Eventually to poll taskSet until the affinity stabilises, or unifying the wait with the helper.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`
at line 591, Replace the hardcoded time.Sleep(30 * time.Second) inside the It
block with a deterministic wait: either call the existing helper
getOvnContainerAffinity (which already contains the required wait) instead of
the manual sleep, or use Gomega.Eventually to poll the taskSet/ctnCpuset until
the container CPU set/affinity stabilizes (e.g. assert ctnCpuset equals the
expected value within a timeout). Locate the sleep in the test body where
ctnCpuset is read, and modify the code to poll taskSet/ctnCpuset (or delegate to
getOvnContainerAffinity) so the test is robust on slow clusters and not wasteful
on fast ones.

950-962: ⚡ Quick win

Unconditional 30-second sleeps in shared helpers will compound across every test call.

Both getOvnContainerAffinity (line 950) and getOvsAffinities (line 962) each time.Sleep(30 * time.Second). Tests that call both helpers spend 60 s sleeping per invocation pair; test_id:64102 calls them twice (before and after pod deletion) — 120 s of fixed sleep per run regardless of cluster speed. This pattern is also unreliable: fast clusters waste 30 s, slow clusters may still read stale state.

Replace with Eventually-based polling on the affinity value, or at minimum centralise the single wait in the call site rather than burying it in each helper.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`
around lines 950 - 962, Both getOvnContainerAffinity and getOvsAffinities
contain unconditional time.Sleep(30 * time.Second) calls causing compounded
fixed delays; remove those sleeps and replace with polling (Gomega Eventually or
a shared WaitForAffinityStable helper) that repeatedly calls the underlying
probes (taskSet for getOvnContainerAffinity, ovsPids/affinity reader for
getOvsAffinities) until the affinity stabilises or a timeout is reached, or
alternatively centralise a single wait in the test caller instead of sleeping
inside these helpers so callers control timeout and frequency.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`:
- Around line 443-449: The loops over the maps returned by getOvsAffinities
(e.g., ovsBeforeReboot and ovsAfterReboot) can vacuously succeed when the map is
empty; add an explicit non-empty guard before each loop by asserting the map
length is > 0 (e.g., Expect(len(ovsBeforeReboot)).To(BeNumerically(">", 0), "no
OVS affinities found for %v on node %v; check %v", cpOvsServices, cpNode,
cpOvsServices)) so the test fails fast when no OVS PIDs were discovered; apply
the same check for all other getOvsAffinities consumers referenced (the blocks
using ovsBeforeReboot, ovsAfterReboot and the other call sites around lines
noted) so the subsequent per-PID Expect(mask.Equals(...)) assertions actually
run.

---

Outside diff comments:
In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`:
- Line 68: The outer variable workerRTNodes is left nil because the BeforeAll
closure uses short-declaration (:=) which shadows it; change the declaration in
the BeforeAll where nodes.GetByLabels(testutils.NodeSelectorLabels) is called to
an assignment (use = not :=) so the outer workerRTNodes is set, and reuse the
already-declared err variable instead of redeclaring it; locate the
nodes.GetByLabels call inside the BeforeAll closure and replace := with = so
subsequent loops over workerRTNodes in the test (e.g., later assertions) operate
on the populated slice.

---

Duplicate comments:
In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`:
- Around line 1046-1058: The function collectGuCPUsFromPodList currently ignores
its namespace parameter; fix it by ensuring the list options actually use that
namespace before calling testclient.DataPlaneClient.List: if listOpts.Namespace
is empty or different, create or copy listOpts and set its Namespace to the
namespace argument (so List will query the intended namespace), then proceed to
list pods and union their CPUs as before; update any callers only if they depend
on the previous behavior.
- Around line 1033-1040: In waitForDeploymentReady, the readiness check only
inspects podList.Items[0].Status.ContainerStatuses; change the logic to iterate
over every pod in podList.Items and for each pod iterate over
pod.Status.ContainerStatuses to ensure all containers in all pods are Ready
(still return false if podList.Items is empty or any container is not Ready),
keeping the existing early-return behavior and using the existing podList.Items
and ContainerStatuses references so all replicas are validated.
- Around line 936-938: The function expectedOvsAffinity currently ignores its
reservedCPUs parameter; update its implementation to exclude reservedCPUs from
the resulting CPU set (e.g., return
onlineCPUs.Difference(guPodCPUs).Difference(reservedCPUs) or return
onlineCPUs.Difference(guPodCPUs.Union(reservedCPUs))) so the reserved CPUs are
not assigned to OVS, and leave the function signature unchanged so existing
callers of expectedOvsAffinity continue to work without changes.

---

Nitpick comments:
In `@test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go`:
- Around line 981-991: The non-WP branch of verifyOvsMatchesExpected currently
ignores guCPUs and just compares OVS to ovnAffinity, which makes tests vacuous;
change the else branch to compute expected := expectedOvsAffinity(onlineCPUs,
reservedCPUs, guCPUs) and call verifyOvsAffinity(ovsAffinities, expected)
(instead of verifyOvsAffinity(ovsAffinities, ovnAffinity)) so non-WP behavior
accounts for guCPUs; alternatively, if OVS should not change under non-WP,
remove/stop passing the guCPUs parameter and update the test description
accordingly — but the quick fix is to use expectedOvsAffinity(...) in the else
branch of verifyOvsMatchesExpected.
- Line 591: Replace the hardcoded time.Sleep(30 * time.Second) inside the It
block with a deterministic wait: either call the existing helper
getOvnContainerAffinity (which already contains the required wait) instead of
the manual sleep, or use Gomega.Eventually to poll the taskSet/ctnCpuset until
the container CPU set/affinity stabilizes (e.g. assert ctnCpuset equals the
expected value within a timeout). Locate the sleep in the test body where
ctnCpuset is read, and modify the code to poll taskSet/ctnCpuset (or delegate to
getOvnContainerAffinity) so the test is robust on slow clusters and not wasteful
on fast ones.
- Around line 950-962: Both getOvnContainerAffinity and getOvsAffinities contain
unconditional time.Sleep(30 * time.Second) calls causing compounded fixed
delays; remove those sleeps and replace with polling (Gomega Eventually or a
shared WaitForAffinityStable helper) that repeatedly calls the underlying probes
(taskSet for getOvnContainerAffinity, ovsPids/affinity reader for
getOvsAffinities) until the affinity stabilises or a timeout is reached, or
alternatively centralise a single wait in the test caller instead of sleeping
inside these helpers so callers control timeout and frequency.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6f9a886a-e21b-435e-a245-31c24902c58d

📥 Commits

Reviewing files that changed from the base of the PR and between df26d1b and cd213c0.

📒 Files selected for processing (1)
  • test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go

Verify ovs affinity map is not empty

Signed-off-by: Niranjan M.R <mniranja@redhat.com>
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 8, 2026

@mrniranjan: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants