Skip to content

OCPBUGS-62177: verify cert revocation against all KAS pods#8263

Open
sdminonne wants to merge 4 commits intoopenshift:mainfrom
sdminonne:OCPBUGS-62177
Open

OCPBUGS-62177: verify cert revocation against all KAS pods#8263
sdminonne wants to merge 4 commits intoopenshift:mainfrom
sdminonne:OCPBUGS-62177

Conversation

@sdminonne
Copy link
Copy Markdown
Contributor

@sdminonne sdminonne commented Apr 16, 2026

What this PR does / why we need it:

Updates the CertificateRevocationController to verify certificate trust and revocation against every individual KAS pod rather than through the service load balancer. In HA deployments with 3 KAS replicas, the service endpoint load-balances to a single pod, so the check could hit a pod that had loaded the updated trust bundle while others hadn't, causing premature state transitions in the revocation flow.

Changes

control-plane-pki-operator (CertificateRevocationController):

  • Add verifyCertificateAgainstAllKASPods which enumerates ready KAS pods via label selector and connects to each pod IP directly (with SNI set to the KAS service name for TLS validation)
  • Add dual-cert cross-check in ensureOldSignerCertificateRevoked: before declaring a certificate revoked on a pod, verify the new leaf cert is also being served by that pod, preventing false positives when a pod restart temporarily makes the old cert unreachable
  • Wire a KAS pod informer so pod readiness changes trigger immediate reconciliation
  • Extract verifyCertificateTrusted and verifyCertificateRevoked helpers for per-pod SelfSubjectReview checks

control-plane-operator (RBAC):

  • Grant list/watch on pods to the control-plane-pki-operator role so it can enumerate KAS pods (update role asset + 5 test fixtures)

support/podspec:

  • Add IsPodReady and ContainerPort helpers for pod readiness checks and named port lookup

e2e:

  • Add TestCreateClusterHABreakGlassCredentials exercising the break-glass credential flow on a HighlyAvailable control plane (3 KAS replicas)
  • Shorten cluster name from ha-break-glass-credentials to ha-break-glass-creds to stay within Azure's 64-char VNet name limit

Context

This was previously fixed in PRs #7405 + #7744 but reverted in #7784 because the e2e test added alongside it had a ~50% flake rate. The controller fix itself was sound — the flakiness was in the e2e test wiring. This PR re-applies the controller logic with comprehensive unit tests and a different, more targeted HA e2e test.

Which issue(s) this PR fixes:

Fixes https://issues.redhat.com/browse/OCPBUGS-62177

Special notes for your reviewer:

The dual-cert cross-check (ensureOldSignerCertificateRevoked verifying the new cert is served before declaring the old one revoked) is the key addition beyond the original fix. Without it, a KAS pod restart during revocation could cause a false positive: the old cert becomes unreachable (connection refused), which looks like revocation, but is actually just the pod being down.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.
  • This change includes e2e.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-area labels Apr 16, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 16, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 16, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The PR changes the certificate revocation controller to verify API server certificates by connecting to each non-terminating, ready KAS pod instead of a single service endpoint. It lists pods labeled app=kube-apiserver from a pod informer, filters by termination/ready/PodIP, resolves the kube-apiserver container port by name with a default fallback, and performs an HTTPS SelfSubjectReview request to https://<PodIP>:<port> with TLS SNI set to the KAS service name. The controller now requires all ready pods to accept the new signer or reject the old signer before progressing. Helpers isPodReady, containerPort, verifyCertificateAgainstAllKASPods, and pod-informer plumbing were added. The RBAC Role was expanded to allow list and watch on core pods. Tests covering pod readiness, port lookup, and verification logic were added.

Sequence Diagram(s)

sequenceDiagram
    participant Controller as CertificateRevocationController
    participant Informer as Pod Informer / Cache
    participant Pod as KAS Pod
    participant K8sAPI as Kubernetes API Server

    Controller->>Informer: list pods (label=app=kube-apiserver)
    Informer-->>Controller: pod list
    Controller->>Controller: filter non-terminating, check isPodReady & PodIP
    alt no ready pods
        Controller->>Controller: requeue
    else ready pods exist
        loop for each ready pod
            Controller->>Controller: resolve container port (named or default)
            Controller->>Pod: HTTPS SSR to https://<PodIP>:<port> (TLS SNI = KAS service)
            Pod->>K8sAPI: forward SSR to kube-apiserver instance
            K8sAPI-->>Pod: SSR response (authorized/unauthorized)
            Pod-->>Controller: SSR result
        end
        Controller->>Controller: require all pods to accept/reject as gate condition
    end
    Controller->>K8sAPI: update certificate revocation status
Loading
🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ❓ Inconclusive The custom check requests evaluation of Ginkgo test code patterns (BeforeEach/AfterEach, Eventually/Consistently), but the test framework type is unclear from the repository context. The referenced test file is inaccessible, making assessment impossible. Clarify whether tests use Ginkgo BDD framework or standard Go testing, then re-evaluate with appropriate framework-specific criteria.
✅ Passed checks (10 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main change: updating the certificate revocation controller to verify against all KAS pods instead of a single endpoint, which is the core architectural improvement in this changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Test file uses standard Go testing with t.Run() subtests and gomega assertions, not Ginkgo's BDD-style DSL. Check not applicable as no dynamic values embedded in stable test names.
Microshift Test Compatibility ✅ Passed This PR adds only standard Go unit tests for certificaterevocationcontroller, not Ginkgo e2e tests. The custom check targets only Ginkgo e2e tests (It(), Describe(), Context(), When() patterns), which are not present in this PR.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This pull request does not add any Ginkgo e2e tests; it only contains RBAC updates and standard Go unit tests.
Topology-Aware Scheduling Compatibility ✅ Passed The PR introduces no topology-incompatible scheduling constraints. The controller uses per-pod certificate verification with label-based pod listing and direct IP connections, functioning correctly across all OpenShift topologies (SNO, TNF, TNA, HA, HyperShift).
Ote Binary Stdout Contract ✅ Passed PR modifies control-plane-pki-operator code and unit tests using standard Go testing patterns, not OTE Binary Stdout Contract.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR does not add new Ginkgo e2e tests, only RBAC updates, controller logic modifications, and standard Go unit tests.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/control-plane-pki-operator Indicates the PR includes changes for the control plane PKI operator - in an OCP release and removed do-not-merge/needs-area labels Apr 16, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go`:
- Around line 1320-1349: The test currently only asserts the aggregate boolean
result for verifyCertificateAgainstAllKASPods; update the case so the testFunc
used in the scenario records/counts invocations (e.g., increment a counter
closed over by the test) and assert that the counter equals the number of ready,
non-terminating pods (two in this case) in addition to checking expectedResult;
modify the test case's testFunc and add an assertion after calling
verifyCertificateAgainstAllKASPods to enforce one callback invocation per ready
pod.

In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go`:
- Around line 603-612: The code currently forces pod-level TLS verification off
by setting podCfg.TLSClientConfig.Insecure = true and clearing CAData/CAFile;
instead remove that line and restore CA verification by copying the CA from the
original admin config into the pod config (e.g., set
podCfg.TLSClientConfig.CAData = adminCfg.TLSClientConfig.CAData and
podCfg.TLSClientConfig.CAFile = adminCfg.TLSClientConfig.CAFile) while keeping
podCfg.TLSClientConfig.ServerName = hcpmanifests.KubeAPIServerServiceName so
SNI/hostname verification works for the pod connection in
certificaterevocationcontroller (use the podCfg and adminCfg symbols to locate
the change).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 9b0040c3-97c3-458f-a2ee-f91dc0478343

📥 Commits

Reviewing files that changed from the base of the PR and between 846f2e9 and 7785913.

⛔ Files ignored due to path filters (5)
  • control-plane-operator/controllers/hostedcontrolplane/testdata/control-plane-pki-operator/AROSwift/zz_fixture_TestControlPlaneComponents_control_plane_pki_operator_role.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/control-plane-pki-operator/GCP/zz_fixture_TestControlPlaneComponents_control_plane_pki_operator_role.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/control-plane-pki-operator/IBMCloud/zz_fixture_TestControlPlaneComponents_control_plane_pki_operator_role.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/control-plane-pki-operator/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_control_plane_pki_operator_role.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/control-plane-pki-operator/zz_fixture_TestControlPlaneComponents_control_plane_pki_operator_role.yaml is excluded by !**/testdata/**
📒 Files selected for processing (3)
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/control-plane-pki-operator/role.yaml
  • control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go
  • control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 16, 2026

Codecov Report

❌ Patch coverage is 80.19802% with 40 lines in your changes missing coverage. Please review.
✅ Project coverage is 37.83%. Comparing base (bded456) to head (daf87ad).

Files with missing lines Patch % Lines
...ationcontroller/certificaterevocationcontroller.go 78.49% 30 Missing and 10 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8263      +/-   ##
==========================================
+ Coverage   37.53%   37.83%   +0.30%     
==========================================
  Files         751      399     -352     
  Lines       92026    63851   -28175     
==========================================
- Hits        34544    24161   -10383     
+ Misses      54841    37809   -17032     
+ Partials     2641     1881     -760     
Files with missing lines Coverage Δ
support/podspec/containers.go 42.26% <100.00%> (+6.07%) ⬆️
...ationcontroller/certificaterevocationcontroller.go 55.55% <78.49%> (+8.46%) ⬆️

... and 352 files with indirect coverage changes

Flag Coverage Δ
cmd-support 32.80% <100.00%> (+0.03%) ⬆️
cpo-hostedcontrolplane ?
cpo-other ?
hypershift-operator 47.93% <ø> (ø)
other 28.80% <78.49%> (+1.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go (1)

99-100: Wire the Pods informer into the controller dependency chain.

Pod readiness and IP now gate revocation progress. Calling Pods().Informer() for a side effect (lines 99-100) does not register the informer with the controller factory, so the pod cache won't participate in initial sync and pod transitions won't trigger re-enqueuing of CRR work. Pass the informer to the controller factory to make pod updates drive reconciliation immediately.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go`
around lines 99 - 100, The Pods informer is only being created for side effects
(kubeInformersForNamespaces.InformersFor(hostedControlPlane.Namespace).Core().V1().Pods().Informer())
but not registered with the controller factory, so pod events won't drive
reconciliation; change the code to capture the informer (e.g., podsInformer :=
kubeInformersForNamespaces.InformersFor(hostedControlPlane.Namespace).Core().V1().Pods().Informer())
and pass that informer into the controller factory/constructor used to build the
CertificateRevocation controller (the controller factory or
NewController/WithInformer call you use to wire controllers) so the pod cache
participates in initial sync and pod updates enqueue CRR work immediately.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go`:
- Around line 603-617: The per-pod client probes currently inherit adminCfg with
no timeout, so create a bounded timeout for each pod: set podCfg.Timeout =
<reasonable duration> (e.g., 10s) after building podCfg (the struct from
rest.AnonymousClientConfig) to ensure the generated http client has a request
timeout, and also wrap the testFunc call with a child context with the same
timeout (use context.WithTimeout(ctx, <same duration>) and defer cancel) before
calling testFunc(ctxWithTimeout, podClient) to avoid hanging reconcile workers;
update references around podCfg, podClient, and testFunc accordingly.

---

Nitpick comments:
In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go`:
- Around line 99-100: The Pods informer is only being created for side effects
(kubeInformersForNamespaces.InformersFor(hostedControlPlane.Namespace).Core().V1().Pods().Informer())
but not registered with the controller factory, so pod events won't drive
reconciliation; change the code to capture the informer (e.g., podsInformer :=
kubeInformersForNamespaces.InformersFor(hostedControlPlane.Namespace).Core().V1().Pods().Informer())
and pass that informer into the controller factory/constructor used to build the
CertificateRevocation controller (the controller factory or
NewController/WithInformer call you use to wire controllers) so the pod cache
participates in initial sync and pod updates enqueue CRR work immediately.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: e75e2dca-ee10-4fe8-951e-575e5dcc6fc0

📥 Commits

Reviewing files that changed from the base of the PR and between 7785913 and c345009.

📒 Files selected for processing (1)
  • control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go (2)

1376-1378: Strengthen the listPods mock contract.

Line 1376 currently ignores namespace and selector, so selector/namespace regressions won’t be caught by this test.

Suggested diff
 		t.Run(testCase.name, func(t *testing.T) {
+			expectedSelector := labels.SelectorFromSet(labels.Set{"app": "kube-apiserver"})
 			c := &CertificateRevocationController{
 				listPods: func(namespace string, selector labels.Selector) ([]*corev1.Pod, error) {
+					if namespace != "test-ns" {
+						t.Fatalf("unexpected namespace: %q", namespace)
+					}
+					if selector.String() != expectedSelector.String() {
+						t.Fatalf("unexpected selector: %q", selector.String())
+					}
 					return testCase.pods, nil
 				},
 			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go`
around lines 1376 - 1378, The test's listPods mock currently ignores its
namespace and selector parameters; update the mock used in the test case for
listPods to validate that the incoming namespace and selector match the expected
values for that test (e.g., compare against testCase.expectedNamespace and
testCase.expectedSelector or derive from testCase.namespace/testCase.selector)
and return an error if they differ, otherwise return testCase.pods; this will
ensure regressions in namespace/selector handling are caught by the test.

1246-1373: Add explicit cases for missing PodIP and error propagation branches.

The table currently misses the PodIP == "" requeue path and error-return branches (listPods error and testFunc error). Covering these will harden regression detection in this critical flow.

Suggested additions (table-driven cases)
 	for _, testCase := range []struct {
 		name     string
 		pods     []*corev1.Pod
 		testFunc func(ctx context.Context, client kubernetes.Interface) (bool, error)
+		listPodsErr error
 
 		expectedResult bool
 		expectedErr    bool
 		expectedCalls  int
 	}{
+		{
+			name: "When a ready pod has empty PodIP it should requeue",
+			pods: []*corev1.Pod{{
+				ObjectMeta: metav1.ObjectMeta{Name: "kas-1", Namespace: "test-ns"},
+				Spec:       kasPodSpec(),
+				Status: corev1.PodStatus{
+					PodIP: "",
+					Conditions: []corev1.PodCondition{{
+						Type:   corev1.PodReady,
+						Status: corev1.ConditionTrue,
+					}},
+				},
+			}},
+			expectedResult: false,
+			expectedCalls:  0,
+		},
+		{
+			name:          "When listing pods fails it should return an error",
+			listPodsErr:   assert.AnError,
+			expectedErr:   true,
+			expectedCalls: 0,
+		},
+		{
+			name: "When testFunc returns an error it should return an error",
+			pods: []*corev1.Pod{{
+				ObjectMeta: metav1.ObjectMeta{Name: "kas-1", Namespace: "test-ns"},
+				Spec:       kasPodSpec(),
+				Status: corev1.PodStatus{
+					PodIP: "10.0.0.1",
+					Conditions: []corev1.PodCondition{{Type: corev1.PodReady, Status: corev1.ConditionTrue}},
+				},
+			}},
+			testFunc: func(_ context.Context, _ kubernetes.Interface) (bool, error) {
+				return false, assert.AnError
+			},
+			expectedErr:   true,
+			expectedCalls: 1,
+		},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go`
around lines 1246 - 1373, Add three table-driven test cases to exercise the
missing branches: (1) a "pod with empty PodIP should requeue" case where pods
contains a Pod whose Status.PodIP == "" and expectedResult is false; (2) a
"listPods returns error" case that simulates listPods failing (set test setup to
return an error when listing pods) and set expectedErr true; and (3) a "testFunc
returns error" case where testFunc returns (false, error) and expectedErr true.
Place these entries alongside the existing cases in the same test table (the
loop over testCase) so the existing test runner uses them; reference the
existing fields testFunc and expectedErr to validate behavior and the listPods
path to simulate the list failure. Ensure the case names are descriptive and
expectedResult/expectedCalls are set appropriately.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go`:
- Around line 1376-1378: The test's listPods mock currently ignores its
namespace and selector parameters; update the mock used in the test case for
listPods to validate that the incoming namespace and selector match the expected
values for that test (e.g., compare against testCase.expectedNamespace and
testCase.expectedSelector or derive from testCase.namespace/testCase.selector)
and return an error if they differ, otherwise return testCase.pods; this will
ensure regressions in namespace/selector handling are caught by the test.
- Around line 1246-1373: Add three table-driven test cases to exercise the
missing branches: (1) a "pod with empty PodIP should requeue" case where pods
contains a Pod whose Status.PodIP == "" and expectedResult is false; (2) a
"listPods returns error" case that simulates listPods failing (set test setup to
return an error when listing pods) and set expectedErr true; and (3) a "testFunc
returns error" case where testFunc returns (false, error) and expectedErr true.
Place these entries alongside the existing cases in the same test table (the
loop over testCase) so the existing test runner uses them; reference the
existing fields testFunc and expectedErr to validate behavior and the listPods
path to simulate the list failure. Ensure the case names are descriptive and
expectedResult/expectedCalls are set appropriately.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 40f42242-fce3-4416-beb2-928fa3c63249

📥 Commits

Reviewing files that changed from the base of the PR and between c345009 and 0e87b3d.

📒 Files selected for processing (1)
  • control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go (1)

601-622: ⚠️ Potential issue | 🟠 Major

Add explicit timeout to prevent hanging on unresponsive pods.

The per-pod client probes inherit Timeout=0 from adminCfg (meaning no timeout). A blackholed or half-open PodIP can hang the reconcile worker indefinitely since rest.Config.Timeout=0 means "no timeout" in client-go.

🛡️ Proposed fix
+	const perPodTimeout = 10 * time.Second
+
 	for _, pod := range readyPods {
 		port := containerPort(pod, "client", config.KASPodDefaultPort)
 		podCfg := rest.AnonymousClientConfig(adminCfg)
+		podCfg.Timeout = perPodTimeout
 		podCfg.TLSClientConfig.CertData = certPEM
 		podCfg.TLSClientConfig.KeyData = keyPEM
 		podCfg.Host = fmt.Sprintf("https://%s", net.JoinHostPort(pod.Status.PodIP, strconv.Itoa(int(port))))
 		// We're connecting to the PodIP, but the serving cert is still issued for the KAS service
 		// name. Keep CA verification enabled and override ServerName for SNI + hostname validation.
 		podCfg.TLSClientConfig.ServerName = hcpmanifests.KubeAPIServerServiceName

 		podClient, err := kubernetes.NewForConfig(podCfg)
 		if err != nil {
 			return false, fmt.Errorf("couldn't create client for KAS pod %s/%s: %w", pod.Namespace, pod.Name, err)
 		}

-		passed, err := testFunc(ctx, podClient)
+		podCtx, cancel := context.WithTimeout(ctx, perPodTimeout)
+		passed, err := testFunc(podCtx, podClient)
+		cancel() // explicit cleanup instead of defer inside loop
 		if err != nil {
 			return false, err
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go`
around lines 601 - 622, The pod-specific REST client inherits a zero (no)
timeout from rest.AnonymousClientConfig(adminCfg), which can hang on blackholed
PodIPs; set an explicit timeout on the returned rest.Config (podCfg.Timeout)
before calling kubernetes.NewForConfig — e.g. use a sane default like 15–30s or
copy adminCfg.Timeout when non-zero — and ensure you import time if using a
time.Duration literal; update the code around
rest.AnonymousClientConfig(adminCfg) / podCfg and before kubernetes.NewForConfig
to assign podCfg.Timeout.
🧹 Nitpick comments (1)
control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go (1)

584-599: Consider adding debug logging for requeue reasons.

When requeuing because a pod isn't ready or no pods are available, it may be helpful for debugging to log which pod or condition triggered the requeue. This is optional since the behavior is correct.

💡 Optional: Add debug logging
 	var readyPods []*corev1.Pod
 	for _, pod := range pods {
 		if pod.DeletionTimestamp != nil {
 			continue
 		}
 		if !isPodReady(pod) || pod.Status.PodIP == "" {
 			// a non-terminating pod that's not ready: we can't check it yet, requeue
+			klog.V(4).Infof("KAS pod %s/%s not ready for verification (ready=%v, podIP=%q), requeueing", 
+				pod.Namespace, pod.Name, isPodReady(pod), pod.Status.PodIP)
 			return false, nil
 		}
 		readyPods = append(readyPods, pod)
 	}

 	if len(readyPods) == 0 {
 		// no pods to check yet, requeue
+		klog.V(4).Infof("No ready KAS pods found in namespace %s, requeueing", namespace)
 		return false, nil
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go`
around lines 584 - 599, Add debug logs before the two early returns to record
why we're requeuing: inside the loop, when skipping a pod because
!isPodReady(pod) or pod.Status.PodIP == "" log the pod's name/namespace and the
specific condition that triggered the requeue (use pod.Name, pod.Namespace,
isPodReady result and pod.Status.PodIP); and after the loop, when len(readyPods)
== 0 log that no non-terminating ready pods were found (include the original
pods list size or selector info if available). Use the controller's existing
logger (e.g., r.Log or reqLogger) to emit these debug messages right before the
respective return false, nil statements.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go`:
- Around line 601-622: The pod-specific REST client inherits a zero (no) timeout
from rest.AnonymousClientConfig(adminCfg), which can hang on blackholed PodIPs;
set an explicit timeout on the returned rest.Config (podCfg.Timeout) before
calling kubernetes.NewForConfig — e.g. use a sane default like 15–30s or copy
adminCfg.Timeout when non-zero — and ensure you import time if using a
time.Duration literal; update the code around
rest.AnonymousClientConfig(adminCfg) / podCfg and before kubernetes.NewForConfig
to assign podCfg.Timeout.

---

Nitpick comments:
In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go`:
- Around line 584-599: Add debug logs before the two early returns to record why
we're requeuing: inside the loop, when skipping a pod because !isPodReady(pod)
or pod.Status.PodIP == "" log the pod's name/namespace and the specific
condition that triggered the requeue (use pod.Name, pod.Namespace, isPodReady
result and pod.Status.PodIP); and after the loop, when len(readyPods) == 0 log
that no non-terminating ready pods were found (include the original pods list
size or selector info if available). Use the controller's existing logger (e.g.,
r.Log or reqLogger) to emit these debug messages right before the respective
return false, nil statements.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: a2b128b1-8046-4168-956c-a837c093fb68

📥 Commits

Reviewing files that changed from the base of the PR and between 0e87b3d and 3ac77d1.

⛔ Files ignored due to path filters (5)
  • control-plane-operator/controllers/hostedcontrolplane/testdata/control-plane-pki-operator/AROSwift/zz_fixture_TestControlPlaneComponents_control_plane_pki_operator_role.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/control-plane-pki-operator/GCP/zz_fixture_TestControlPlaneComponents_control_plane_pki_operator_role.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/control-plane-pki-operator/IBMCloud/zz_fixture_TestControlPlaneComponents_control_plane_pki_operator_role.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/control-plane-pki-operator/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_control_plane_pki_operator_role.yaml is excluded by !**/testdata/**
  • control-plane-operator/controllers/hostedcontrolplane/testdata/control-plane-pki-operator/zz_fixture_TestControlPlaneComponents_control_plane_pki_operator_role.yaml is excluded by !**/testdata/**
📒 Files selected for processing (3)
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/control-plane-pki-operator/role.yaml
  • control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go
  • control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go
✅ Files skipped from review due to trivial changes (1)
  • control-plane-operator/controllers/hostedcontrolplane/v2/assets/control-plane-pki-operator/role.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go

@sdminonne sdminonne changed the title fix(OCPBUGS-62177): verify cert revocation against all KAS pods OCPBUGS-62177: verify cert revocation against all KAS pods Apr 22, 2026
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 22, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@sdminonne: This pull request references Jira Issue OCPBUGS-62177, which is invalid:

  • expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "4.22" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

  • Add per-pod verification to CertificateRevocationController so that certificate trust/revocation checks connect to each KAS pod individually rather than through the service load balancer, preventing premature advancement of the revocation flow in HA setups
  • Grant list/watch on pods to the control-plane-pki-operator RBAC role (and update all 5 test fixtures)
  • Add unit tests for isPodReady, containerPort, and verifyCertificateAgainstAllKASPods

Context

The controller's ensureNewSignerCertificatePropagated and ensureOldSignerCertificateRevoked previously verified certificate trust/revocation via the KAS service endpoint. In HA setups with multiple KAS pods, the service load-balances to a single pod, so the check could hit a pod that had loaded the updated trust bundle while others hadn't, causing premature state transitions.

This was previously fixed in PRs #7405 + #7744 but reverted in #7784 because the e2e test added alongside it (TestCreateClusterRequestServingIsolation) had a ~50% flake rate. The controller fix itself was sound — the flakiness was in the e2e test wiring. This PR re-applies only the controller logic with unit tests, without the flaky e2e test.

Test plan

  • go test -v -race ./control-plane-pki-operator/certificaterevocationcontroller/... — all pass
  • go test -v -run TestControlPlaneComponents ./control-plane-operator/controllers/hostedcontrolplane/ — passes (validates RBAC fixture changes)
  • Existing integration test TestControlPlanePKIOperatorBreakGlassCredentials covers the full CRR flow
  • Existing e2e test TestCreateCluster covers single-replica CRR flow

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

  • Improved API server certificate propagation and revocation by validating trust/rejection on each active API-server pod.

  • Broadened operator permissions to allow listing and watching pod resources for reliable monitoring.

  • Tests

  • Added unit tests for pod readiness, container-port resolution, and per-pod certificate verification behaviors.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sdminonne sdminonne marked this pull request as ready for review April 23, 2026 05:53
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 23, 2026
@openshift-ci openshift-ci Bot requested review from cblecker and csrwng April 23, 2026 05:53
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go (1)

99-100: Wire pod updates into the queue, not just the cache.

The new gate now depends on pod readiness and PodIP, but this informer is only being initialized for lister access. That means CRRs still converge on those transitions only via SyntheticRequeueError. Hooking pod events into enqueueAll(...) would make rollouts react faster and cut needless requeues while KAS pods are coming up.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go`
around lines 99 - 100, The Pod informer is only created for lister access but
not wired to enqueue changes, so Pod readiness/PodIP transitions don't trigger
immediate reconciles; update the code around
kubeInformersForNamespaces.InformersFor(hostedControlPlane.Namespace).Core().V1().Pods().Informer()
to add event handlers that call enqueueAll(...) on Pod Add/Update/Delete (use
cache.ResourceEventHandlerFuncs with AddFunc, UpdateFunc that compares old/new
and always call enqueueAll for relevant changes, and DeleteFunc) so Pod events
drive the controller rather than relying on SyntheticRequeueError.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go`:
- Around line 99-100: The Pod informer is only created for lister access but not
wired to enqueue changes, so Pod readiness/PodIP transitions don't trigger
immediate reconciles; update the code around
kubeInformersForNamespaces.InformersFor(hostedControlPlane.Namespace).Core().V1().Pods().Informer()
to add event handlers that call enqueueAll(...) on Pod Add/Update/Delete (use
cache.ResourceEventHandlerFuncs with AddFunc, UpdateFunc that compares old/new
and always call enqueueAll for relevant changes, and DeleteFunc) so Pod events
drive the controller rather than relying on SyntheticRequeueError.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: b0d6c3c1-3ba0-403f-bbf5-2f676ef56184

📥 Commits

Reviewing files that changed from the base of the PR and between 3ac77d1 and 135f5b1.

📒 Files selected for processing (2)
  • control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go
  • control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go

@sdminonne sdminonne force-pushed the OCPBUGS-62177 branch 2 times, most recently from 146e31a to 5b7f272 Compare April 23, 2026 06:45
@sdminonne
Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Apr 23, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@sdminonne: This pull request references Jira Issue OCPBUGS-62177, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (jiezhao@redhat.com), skipping review request.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Apr 23, 2026
@sdminonne sdminonne marked this pull request as draft April 23, 2026 06:52
@sdminonne
Copy link
Copy Markdown
Contributor Author

/pipeline required

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2052701249999998976 | Cost: $3.862374349999999 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2052701249924501504 | Cost: $4.8982022999999995 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@sdminonne
Copy link
Copy Markdown
Contributor Author

Root Cause: The test uses hyperv1.NonePlatform in an AKS environment that requires explicit route hostnames. Since NonePlatform doesn't configure route hostnames in service publishing strategies, the OAuth
route was created without a spec.host field. The control-plane-operator's infrastructure readiness check requires valid hosts on routes, so it blocked indefinitely.

@sdminonne
Copy link
Copy Markdown
Contributor Author

/pipeline required

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2052748290994212864 | Cost: $4.427638500000001 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2052748291099070464 | Cost: $2.945335499999999 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aws

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2052802658598653952 | Cost: $4.5939735 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@sdminonne
Copy link
Copy Markdown
Contributor Author

/retest-required

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

sdminonne added 3 commits May 9, 2026 21:07
- Add IsPodReady() to check pod Ready condition status
- Add ContainerPort() to look up named container ports with a default
  fallback
- Add KASReadinessCheckContainer() sidecar that probes KAS /livez for
  PDB-aware readiness gating
- Add EnforceRestrictedSecurityContextToContainers() for restricted pod
  security standards enforcement

These helpers are used by the control-plane-pki-operator to connect
directly to individual KAS pods during certificate revocation
verification.

Signed-off-by: Salvatore Dario Minonne <sminonne@redhat.com>
Commit-Message-Assisted-by: Claude (via Claude Code)
The CertificateRevocationController now connects directly to each KAS
pod IP instead of going through the service load balancer during
certificate verification. This prevents false positives in HA
deployments where the LB might route to a single pod.

- Add verifyCertificateAgainstAllKASPods() that lists KAS pods and
  verifies each individually via direct PodIP connections with SNI
- Add cross-check against KAS Deployment spec.replicas to ensure all
  pods are visible before proceeding with verification
- Wire pod informer to trigger reconciliation on KAS pod transitions
- Add cross-verification in both ensureNewSignerCertificatePropagated
  (old signer still trusted) and ensureOldSignerCertificateRevoked
  (new signer still trusted) to detect mid-reload transient states
- Grant the control-plane-pki-operator RBAC to get deployments and
  list/get pods in the HCP namespace

Signed-off-by: Salvatore Dario Minonne <sminonne@redhat.com>
Commit-Message-Assisted-by: Claude (via Claude Code)
- Add TestCreateClusterHABreakGlassCredentials e2e test that creates a
  3-replica HA hosted cluster and runs the break-glass credential tests
- Replace one-shot SSR revocation check with a polling loop through the
  service LB (1s interval, 3m timeout) to tolerate transient states
  when a KAS pod restarts between the controller's verification and
  the test's check
- Detect worker node count from NodePool replicas for private clusters
  where direct node listing is unavailable
- Use CI platform for the HA break-glass credential test

Signed-off-by: Salvatore Dario Minonne <sminonne@redhat.com>
Commit-Message-Assisted-by: Claude (via Claude Code)
@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aws

@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aks

@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aks-4-22

@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aws-4-22

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aks | Build: 2053196904266731520 | Cost: $2.682732499999999 | Failed step: hypershift-azure-run-e2e

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@hypershift-jira-solve-ci
Copy link
Copy Markdown

AI Test Failure Analysis

Job: pull-ci-openshift-hypershift-main-e2e-aws | Build: 2053196876466884608 | Cost: $4.686815500000001 | Failed step: hypershift-aws-run-e2e-nested

View full analysis report


Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 10, 2026

@sdminonne: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws d40c1be link true /test e2e-aws
ci/prow/e2e-aks d40c1be link true /test e2e-aks
ci/prow/e2e-aws-4-22 d40c1be link true /test e2e-aws-4-22

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hypershift-jira-solve-ci
Copy link
Copy Markdown

Test Failure Analysis Complete

Job Information

  • Prow Job: pull-ci-openshift-hypershift-main-e2e-aws-4-22
  • Build ID: 2053216377770086400
  • Target: e2e-aws-4-22
  • Failed Step: hypershift-aws-run-e2e-nested

Test Failure Analysis

Error

TestCreateCluster timed out — process killed after 2h Prow timeout.
Two sub-tests hung indefinitely at control_plane_pki_operator.go:290:
  "polling SSR to confirm that the revoked certificate is rejected"

Summary

TestCreateCluster was the only test that never completed (no PASS/FAIL/SKIP line emitted). All other tests passed or were skipped. The test hung in the break-glass credentials revocation flow for both customer-break-glass and sre-break-glass signers. The CertificateRevocationRequest (CRR) completed successfully in both cases (customer: 2m3s, sre: 2m48s), but the post-CRR validation — which polls a SelfSubjectReview (SSR) through the KAS service load balancer expecting an Unauthorized response — never received the expected error, causing the polling loop to run until the 2h Prow timeout killed the process.

Root Cause — SSR calls with revoked client certificate hang at TLS handshake instead of returning 401 Unauthorized

PR #8263 (OCPBUGS-62177: verify cert revocation against all KAS pods) changed validateRevocation() in test/integration/control_plane_pki_operator.go from a single t.Fatalf-based assertion to a wait.PollUntilContextTimeout(ctx, 1*time.Second, 3*time.Minute, true, ...) polling loop. The loop performs SSR requests using a client configured with the now-revoked certificate, expecting Unauthorized.

The critical evidence: The log line "SSR not yet unauthorized (err=%v), retrying" — which the polling loop emits on each non-Unauthorized response — never appears in the build logs. This means the SSR HTTP requests themselves are blocking and never returning a response. The most likely cause is that the TLS handshake with the revoked client certificate hangs rather than being rejected:

  1. The controller's new verifyCertificateAgainstAllKASPods() connects directly to each KAS pod by PodIP and confirms the certificate is revoked (CRR completes, condition PreviousCertificatesRevokedType=True is set).
  2. The test's validateRevocation() then connects through the KAS service load balancer using previousCertClient (built with the revoked cert).
  3. The Go TLS client attempts a handshake with the revoked cert. KAS, having reloaded its trust bundle to exclude the revoked signer, does not send a clean tls.AlertCertificateRevoked — instead the connection appears to stall (likely a half-closed or reset connection that the Go HTTP client retries silently).
  4. Because the SSR call never returns (no response, no error), the conditionFunc inside PollUntilContextTimeout never executes, the retry log is never printed, and the loop blocks on the HTTP round-trip indefinitely.

Why HA passed but non-HA hung:

  • TestCreateClusterHABreakGlassCredentials (3 KAS replicas, 0 workers) passed in 1838.75s.
  • TestCreateCluster (1 KAS replica, non-HA / SingleReplica) hung.

In the HA case with 3 KAS pods, the service load balancer may route the SSR to a pod that has already fully reloaded its trust bundle and cleanly rejects the cert. In the SingleReplica case, the single KAS pod may be in a transient state during trust bundle reload where it neither accepts nor cleanly rejects the revoked cert, causing the TLS handshake to stall. The controller's per-pod verification (which connected by PodIP and bypassed the LB) succeeded, but the subsequent LB-routed SSR encounters the same pod in a different connection state.

Contributing factor — missing effective timeout: The PollUntilContextTimeout specifies a 3-minute timeout, but the ctx passed is the parent test context whose deadline is the 2h Prow job timeout. If the individual SSR HTTP call hangs (blocks on I/O), the 3-minute poll timeout cannot interrupt it because Go's context.Context cancellation only works if the HTTP client's transport respects context — and a hanging TLS handshake may not check the context until the OS-level TCP timeout (typically 2+ minutes per attempt).

Recommendations
  1. Add a per-request HTTP timeout to previousCertClient: The client used in validateRevocation() should have a short transport-level timeout (e.g., 10 seconds) so that hanging TLS handshakes are interrupted and the polling loop can actually retry. Example:

    previousCertClient.Timeout = 10 * time.Second

    Or configure the TLS dialer with a timeout:

    transport.TLSHandshakeTimeout = 5 * time.Second
  2. Use a dedicated context with a hard deadline for validateRevocation(): Instead of passing the parent ctx, create a child context with a 3–5 minute timeout:

    revocationCtx, cancel := context.WithTimeout(ctx, 5*time.Minute)
    defer cancel()

    This ensures the polling terminates even if individual HTTP calls hang.

  3. Log before and after the SSR call: Add logging around the actual SSR HTTP request (not just inside the condition function) to distinguish between "SSR returned non-401" and "SSR call is blocking":

    t.Logf("attempting SSR with revoked cert...")
    ssr, err := previousCertClient.AuthenticationV1().SelfSubjectReviews().Create(ctx, ...)
    t.Logf("SSR completed: err=%v", err)
  4. Consider matching the controller's verification strategy in the test: Since the controller now verifies per-pod via PodIP (bypassing the LB), the test's validateRevocation() could also verify per-pod instead of through the service LB. This would eliminate the LB routing variable and match the controller's own verification semantics.

  5. Investigate KAS trust bundle reload behavior under SingleReplica: The fact that HA passes but SingleReplica hangs suggests a KAS behavior difference during trust bundle hot-reload. File a follow-up to investigate whether KAS's serving cert reload can cause brief connection stalls.

Evidence

1. Test never completed (build-log.txt tail):

2026-05-09T23:21:17Z  INFO step failed after 2h30m23s: step/hypershift-aws-run-e2e-nested
Process did not finish before 2h0m0s timeout; sending interrupt signal.
Process interrupted, running postExec process.
exit status 127

2. Hang location (hypershift-aws-run-e2e-nested-build-log.txt:1646, 1799):

=== CONT  TestCreateCluster/Main/break-glass-credentials/customer-break-glass/CSR_flow/revocation
    control_plane_pki_operator.go:290: polling SSR to confirm that the revoked certificate is rejected

=== CONT  TestCreateCluster/Main/break-glass-credentials/sre-break-glass/CSR_flow/revocation
    control_plane_pki_operator.go:290: polling SSR to confirm that the revoked certificate is rejected

No further output from these sub-tests. No "SSR not yet unauthorized" retry log ever appears.

3. CRR completed successfully:

control_plane_pki_operator.go:274: customer-break-glass: CertificateRevocationRequest completed in 2m3.003s
control_plane_pki_operator.go:274: sre-break-glass: CertificateRevocationRequest completed in 2m48.005s

4. HA test passed:

--- PASS: TestCreateClusterHABreakGlassCredentials (1838.75s)

5. TestCreateCluster never emitted PASS/FAIL/SKIP — confirmed by scanning all test results in the build log. Every other === RUN test has a corresponding result line except TestCreateCluster.

6. PR #8263 changed validateRevocation() (test/integration/control_plane_pki_operator.go):

  • Before: single t.Fatalf assertion
  • After: wait.PollUntilContextTimeout(ctx, 1*time.Second, 3*time.Minute, true, func(ctx) (bool, error) { ... }) polling loop with "SSR not yet unauthorized" retry log

7. PR #8263 added per-pod verification in the controller (verifyCertificateAgainstAllKASPods()) — connects to each KAS pod by PodIP, bypassing the service LB. This per-pod check succeeds (CRR completes), but the test's LB-routed SSR does not.

Artifacts

  • Test artifacts: .work/prow-job-analyze-test-failure/2053216377770086400/logs/
  • Build log: .work/prow-job-analyze-test-failure/2053216377770086400/logs/build-log.txt
  • Step log: .work/prow-job-analyze-test-failure/2053216377770086400/logs/hypershift-aws-run-e2e-nested-build-log.txt

The validateRevocation polling loop could hang indefinitely when KAS
stalls the TLS handshake during trust bundle reload after certificate
revocation. Without a per-request HTTP timeout, the blocking I/O
prevented the 3-minute PollUntilContextTimeout from interrupting the
call, causing TestCreateCluster to hang until the 2h Prow timeout.

Set a 10-second per-request timeout on the client so stalled TLS
handshakes are interrupted and the poll loop can retry. Also add
logging before the SSR call to distinguish between a blocking call
and a non-401 response in future CI logs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aks

@sdminonne
Copy link
Copy Markdown
Contributor Author

/test e2e-aws

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release area/control-plane-pki-operator Indicates the PR includes changes for the control plane PKI operator - in an OCP release area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release area/testing Indicates the PR includes changes for e2e testing jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants