OCPBUGS-62177: verify cert revocation against all KAS pods#8263
OCPBUGS-62177: verify cert revocation against all KAS pods#8263sdminonne wants to merge 4 commits intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Skipping CI for Draft Pull Request. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThe PR changes the certificate revocation controller to verify API server certificates by connecting to each non-terminating, ready KAS pod instead of a single service endpoint. It lists pods labeled Sequence Diagram(s)sequenceDiagram
participant Controller as CertificateRevocationController
participant Informer as Pod Informer / Cache
participant Pod as KAS Pod
participant K8sAPI as Kubernetes API Server
Controller->>Informer: list pods (label=app=kube-apiserver)
Informer-->>Controller: pod list
Controller->>Controller: filter non-terminating, check isPodReady & PodIP
alt no ready pods
Controller->>Controller: requeue
else ready pods exist
loop for each ready pod
Controller->>Controller: resolve container port (named or default)
Controller->>Pod: HTTPS SSR to https://<PodIP>:<port> (TLS SNI = KAS service)
Pod->>K8sAPI: forward SSR to kube-apiserver instance
K8sAPI-->>Pod: SSR response (authorized/unauthorized)
Pod-->>Controller: SSR result
end
Controller->>Controller: require all pods to accept/reject as gate condition
end
Controller->>K8sAPI: update certificate revocation status
🚥 Pre-merge checks | ✅ 10 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go`:
- Around line 1320-1349: The test currently only asserts the aggregate boolean
result for verifyCertificateAgainstAllKASPods; update the case so the testFunc
used in the scenario records/counts invocations (e.g., increment a counter
closed over by the test) and assert that the counter equals the number of ready,
non-terminating pods (two in this case) in addition to checking expectedResult;
modify the test case's testFunc and add an assertion after calling
verifyCertificateAgainstAllKASPods to enforce one callback invocation per ready
pod.
In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go`:
- Around line 603-612: The code currently forces pod-level TLS verification off
by setting podCfg.TLSClientConfig.Insecure = true and clearing CAData/CAFile;
instead remove that line and restore CA verification by copying the CA from the
original admin config into the pod config (e.g., set
podCfg.TLSClientConfig.CAData = adminCfg.TLSClientConfig.CAData and
podCfg.TLSClientConfig.CAFile = adminCfg.TLSClientConfig.CAFile) while keeping
podCfg.TLSClientConfig.ServerName = hcpmanifests.KubeAPIServerServiceName so
SNI/hostname verification works for the pod connection in
certificaterevocationcontroller (use the podCfg and adminCfg symbols to locate
the change).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 9b0040c3-97c3-458f-a2ee-f91dc0478343
⛔ Files ignored due to path filters (5)
control-plane-operator/controllers/hostedcontrolplane/testdata/control-plane-pki-operator/AROSwift/zz_fixture_TestControlPlaneComponents_control_plane_pki_operator_role.yamlis excluded by!**/testdata/**control-plane-operator/controllers/hostedcontrolplane/testdata/control-plane-pki-operator/GCP/zz_fixture_TestControlPlaneComponents_control_plane_pki_operator_role.yamlis excluded by!**/testdata/**control-plane-operator/controllers/hostedcontrolplane/testdata/control-plane-pki-operator/IBMCloud/zz_fixture_TestControlPlaneComponents_control_plane_pki_operator_role.yamlis excluded by!**/testdata/**control-plane-operator/controllers/hostedcontrolplane/testdata/control-plane-pki-operator/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_control_plane_pki_operator_role.yamlis excluded by!**/testdata/**control-plane-operator/controllers/hostedcontrolplane/testdata/control-plane-pki-operator/zz_fixture_TestControlPlaneComponents_control_plane_pki_operator_role.yamlis excluded by!**/testdata/**
📒 Files selected for processing (3)
control-plane-operator/controllers/hostedcontrolplane/v2/assets/control-plane-pki-operator/role.yamlcontrol-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.gocontrol-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8263 +/- ##
==========================================
+ Coverage 37.53% 37.83% +0.30%
==========================================
Files 751 399 -352
Lines 92026 63851 -28175
==========================================
- Hits 34544 24161 -10383
+ Misses 54841 37809 -17032
+ Partials 2641 1881 -760
... and 352 files with indirect coverage changes
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go (1)
99-100: Wire the Pods informer into the controller dependency chain.Pod readiness and IP now gate revocation progress. Calling
Pods().Informer()for a side effect (lines 99-100) does not register the informer with the controller factory, so the pod cache won't participate in initial sync and pod transitions won't trigger re-enqueuing of CRR work. Pass the informer to the controller factory to make pod updates drive reconciliation immediately.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go` around lines 99 - 100, The Pods informer is only being created for side effects (kubeInformersForNamespaces.InformersFor(hostedControlPlane.Namespace).Core().V1().Pods().Informer()) but not registered with the controller factory, so pod events won't drive reconciliation; change the code to capture the informer (e.g., podsInformer := kubeInformersForNamespaces.InformersFor(hostedControlPlane.Namespace).Core().V1().Pods().Informer()) and pass that informer into the controller factory/constructor used to build the CertificateRevocation controller (the controller factory or NewController/WithInformer call you use to wire controllers) so the pod cache participates in initial sync and pod updates enqueue CRR work immediately.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go`:
- Around line 603-617: The per-pod client probes currently inherit adminCfg with
no timeout, so create a bounded timeout for each pod: set podCfg.Timeout =
<reasonable duration> (e.g., 10s) after building podCfg (the struct from
rest.AnonymousClientConfig) to ensure the generated http client has a request
timeout, and also wrap the testFunc call with a child context with the same
timeout (use context.WithTimeout(ctx, <same duration>) and defer cancel) before
calling testFunc(ctxWithTimeout, podClient) to avoid hanging reconcile workers;
update references around podCfg, podClient, and testFunc accordingly.
---
Nitpick comments:
In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go`:
- Around line 99-100: The Pods informer is only being created for side effects
(kubeInformersForNamespaces.InformersFor(hostedControlPlane.Namespace).Core().V1().Pods().Informer())
but not registered with the controller factory, so pod events won't drive
reconciliation; change the code to capture the informer (e.g., podsInformer :=
kubeInformersForNamespaces.InformersFor(hostedControlPlane.Namespace).Core().V1().Pods().Informer())
and pass that informer into the controller factory/constructor used to build the
CertificateRevocation controller (the controller factory or
NewController/WithInformer call you use to wire controllers) so the pod cache
participates in initial sync and pod updates enqueue CRR work immediately.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: e75e2dca-ee10-4fe8-951e-575e5dcc6fc0
📒 Files selected for processing (1)
control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go
There was a problem hiding this comment.
🧹 Nitpick comments (2)
control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go (2)
1376-1378: Strengthen thelistPodsmock contract.Line 1376 currently ignores
namespaceandselector, so selector/namespace regressions won’t be caught by this test.Suggested diff
t.Run(testCase.name, func(t *testing.T) { + expectedSelector := labels.SelectorFromSet(labels.Set{"app": "kube-apiserver"}) c := &CertificateRevocationController{ listPods: func(namespace string, selector labels.Selector) ([]*corev1.Pod, error) { + if namespace != "test-ns" { + t.Fatalf("unexpected namespace: %q", namespace) + } + if selector.String() != expectedSelector.String() { + t.Fatalf("unexpected selector: %q", selector.String()) + } return testCase.pods, nil }, }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go` around lines 1376 - 1378, The test's listPods mock currently ignores its namespace and selector parameters; update the mock used in the test case for listPods to validate that the incoming namespace and selector match the expected values for that test (e.g., compare against testCase.expectedNamespace and testCase.expectedSelector or derive from testCase.namespace/testCase.selector) and return an error if they differ, otherwise return testCase.pods; this will ensure regressions in namespace/selector handling are caught by the test.
1246-1373: Add explicit cases for missingPodIPand error propagation branches.The table currently misses the
PodIP == ""requeue path and error-return branches (listPodserror andtestFuncerror). Covering these will harden regression detection in this critical flow.Suggested additions (table-driven cases)
for _, testCase := range []struct { name string pods []*corev1.Pod testFunc func(ctx context.Context, client kubernetes.Interface) (bool, error) + listPodsErr error expectedResult bool expectedErr bool expectedCalls int }{ + { + name: "When a ready pod has empty PodIP it should requeue", + pods: []*corev1.Pod{{ + ObjectMeta: metav1.ObjectMeta{Name: "kas-1", Namespace: "test-ns"}, + Spec: kasPodSpec(), + Status: corev1.PodStatus{ + PodIP: "", + Conditions: []corev1.PodCondition{{ + Type: corev1.PodReady, + Status: corev1.ConditionTrue, + }}, + }, + }}, + expectedResult: false, + expectedCalls: 0, + }, + { + name: "When listing pods fails it should return an error", + listPodsErr: assert.AnError, + expectedErr: true, + expectedCalls: 0, + }, + { + name: "When testFunc returns an error it should return an error", + pods: []*corev1.Pod{{ + ObjectMeta: metav1.ObjectMeta{Name: "kas-1", Namespace: "test-ns"}, + Spec: kasPodSpec(), + Status: corev1.PodStatus{ + PodIP: "10.0.0.1", + Conditions: []corev1.PodCondition{{Type: corev1.PodReady, Status: corev1.ConditionTrue}}, + }, + }}, + testFunc: func(_ context.Context, _ kubernetes.Interface) (bool, error) { + return false, assert.AnError + }, + expectedErr: true, + expectedCalls: 1, + },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go` around lines 1246 - 1373, Add three table-driven test cases to exercise the missing branches: (1) a "pod with empty PodIP should requeue" case where pods contains a Pod whose Status.PodIP == "" and expectedResult is false; (2) a "listPods returns error" case that simulates listPods failing (set test setup to return an error when listing pods) and set expectedErr true; and (3) a "testFunc returns error" case where testFunc returns (false, error) and expectedErr true. Place these entries alongside the existing cases in the same test table (the loop over testCase) so the existing test runner uses them; reference the existing fields testFunc and expectedErr to validate behavior and the listPods path to simulate the list failure. Ensure the case names are descriptive and expectedResult/expectedCalls are set appropriately.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go`:
- Around line 1376-1378: The test's listPods mock currently ignores its
namespace and selector parameters; update the mock used in the test case for
listPods to validate that the incoming namespace and selector match the expected
values for that test (e.g., compare against testCase.expectedNamespace and
testCase.expectedSelector or derive from testCase.namespace/testCase.selector)
and return an error if they differ, otherwise return testCase.pods; this will
ensure regressions in namespace/selector handling are caught by the test.
- Around line 1246-1373: Add three table-driven test cases to exercise the
missing branches: (1) a "pod with empty PodIP should requeue" case where pods
contains a Pod whose Status.PodIP == "" and expectedResult is false; (2) a
"listPods returns error" case that simulates listPods failing (set test setup to
return an error when listing pods) and set expectedErr true; and (3) a "testFunc
returns error" case where testFunc returns (false, error) and expectedErr true.
Place these entries alongside the existing cases in the same test table (the
loop over testCase) so the existing test runner uses them; reference the
existing fields testFunc and expectedErr to validate behavior and the listPods
path to simulate the list failure. Ensure the case names are descriptive and
expectedResult/expectedCalls are set appropriately.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 40f42242-fce3-4416-beb2-928fa3c63249
📒 Files selected for processing (1)
control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go
0e87b3d to
3ac77d1
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go (1)
601-622:⚠️ Potential issue | 🟠 MajorAdd explicit timeout to prevent hanging on unresponsive pods.
The per-pod client probes inherit
Timeout=0fromadminCfg(meaning no timeout). A blackholed or half-open PodIP can hang the reconcile worker indefinitely sincerest.Config.Timeout=0means "no timeout" in client-go.🛡️ Proposed fix
+ const perPodTimeout = 10 * time.Second + for _, pod := range readyPods { port := containerPort(pod, "client", config.KASPodDefaultPort) podCfg := rest.AnonymousClientConfig(adminCfg) + podCfg.Timeout = perPodTimeout podCfg.TLSClientConfig.CertData = certPEM podCfg.TLSClientConfig.KeyData = keyPEM podCfg.Host = fmt.Sprintf("https://%s", net.JoinHostPort(pod.Status.PodIP, strconv.Itoa(int(port)))) // We're connecting to the PodIP, but the serving cert is still issued for the KAS service // name. Keep CA verification enabled and override ServerName for SNI + hostname validation. podCfg.TLSClientConfig.ServerName = hcpmanifests.KubeAPIServerServiceName podClient, err := kubernetes.NewForConfig(podCfg) if err != nil { return false, fmt.Errorf("couldn't create client for KAS pod %s/%s: %w", pod.Namespace, pod.Name, err) } - passed, err := testFunc(ctx, podClient) + podCtx, cancel := context.WithTimeout(ctx, perPodTimeout) + passed, err := testFunc(podCtx, podClient) + cancel() // explicit cleanup instead of defer inside loop if err != nil { return false, err }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go` around lines 601 - 622, The pod-specific REST client inherits a zero (no) timeout from rest.AnonymousClientConfig(adminCfg), which can hang on blackholed PodIPs; set an explicit timeout on the returned rest.Config (podCfg.Timeout) before calling kubernetes.NewForConfig — e.g. use a sane default like 15–30s or copy adminCfg.Timeout when non-zero — and ensure you import time if using a time.Duration literal; update the code around rest.AnonymousClientConfig(adminCfg) / podCfg and before kubernetes.NewForConfig to assign podCfg.Timeout.
🧹 Nitpick comments (1)
control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go (1)
584-599: Consider adding debug logging for requeue reasons.When requeuing because a pod isn't ready or no pods are available, it may be helpful for debugging to log which pod or condition triggered the requeue. This is optional since the behavior is correct.
💡 Optional: Add debug logging
var readyPods []*corev1.Pod for _, pod := range pods { if pod.DeletionTimestamp != nil { continue } if !isPodReady(pod) || pod.Status.PodIP == "" { // a non-terminating pod that's not ready: we can't check it yet, requeue + klog.V(4).Infof("KAS pod %s/%s not ready for verification (ready=%v, podIP=%q), requeueing", + pod.Namespace, pod.Name, isPodReady(pod), pod.Status.PodIP) return false, nil } readyPods = append(readyPods, pod) } if len(readyPods) == 0 { // no pods to check yet, requeue + klog.V(4).Infof("No ready KAS pods found in namespace %s, requeueing", namespace) return false, nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go` around lines 584 - 599, Add debug logs before the two early returns to record why we're requeuing: inside the loop, when skipping a pod because !isPodReady(pod) or pod.Status.PodIP == "" log the pod's name/namespace and the specific condition that triggered the requeue (use pod.Name, pod.Namespace, isPodReady result and pod.Status.PodIP); and after the loop, when len(readyPods) == 0 log that no non-terminating ready pods were found (include the original pods list size or selector info if available). Use the controller's existing logger (e.g., r.Log or reqLogger) to emit these debug messages right before the respective return false, nil statements.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go`:
- Around line 601-622: The pod-specific REST client inherits a zero (no) timeout
from rest.AnonymousClientConfig(adminCfg), which can hang on blackholed PodIPs;
set an explicit timeout on the returned rest.Config (podCfg.Timeout) before
calling kubernetes.NewForConfig — e.g. use a sane default like 15–30s or copy
adminCfg.Timeout when non-zero — and ensure you import time if using a
time.Duration literal; update the code around
rest.AnonymousClientConfig(adminCfg) / podCfg and before kubernetes.NewForConfig
to assign podCfg.Timeout.
---
Nitpick comments:
In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go`:
- Around line 584-599: Add debug logs before the two early returns to record why
we're requeuing: inside the loop, when skipping a pod because !isPodReady(pod)
or pod.Status.PodIP == "" log the pod's name/namespace and the specific
condition that triggered the requeue (use pod.Name, pod.Namespace, isPodReady
result and pod.Status.PodIP); and after the loop, when len(readyPods) == 0 log
that no non-terminating ready pods were found (include the original pods list
size or selector info if available). Use the controller's existing logger (e.g.,
r.Log or reqLogger) to emit these debug messages right before the respective
return false, nil statements.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: a2b128b1-8046-4168-956c-a837c093fb68
⛔ Files ignored due to path filters (5)
control-plane-operator/controllers/hostedcontrolplane/testdata/control-plane-pki-operator/AROSwift/zz_fixture_TestControlPlaneComponents_control_plane_pki_operator_role.yamlis excluded by!**/testdata/**control-plane-operator/controllers/hostedcontrolplane/testdata/control-plane-pki-operator/GCP/zz_fixture_TestControlPlaneComponents_control_plane_pki_operator_role.yamlis excluded by!**/testdata/**control-plane-operator/controllers/hostedcontrolplane/testdata/control-plane-pki-operator/IBMCloud/zz_fixture_TestControlPlaneComponents_control_plane_pki_operator_role.yamlis excluded by!**/testdata/**control-plane-operator/controllers/hostedcontrolplane/testdata/control-plane-pki-operator/TechPreviewNoUpgrade/zz_fixture_TestControlPlaneComponents_control_plane_pki_operator_role.yamlis excluded by!**/testdata/**control-plane-operator/controllers/hostedcontrolplane/testdata/control-plane-pki-operator/zz_fixture_TestControlPlaneComponents_control_plane_pki_operator_role.yamlis excluded by!**/testdata/**
📒 Files selected for processing (3)
control-plane-operator/controllers/hostedcontrolplane/v2/assets/control-plane-pki-operator/role.yamlcontrol-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.gocontrol-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go
✅ Files skipped from review due to trivial changes (1)
- control-plane-operator/controllers/hostedcontrolplane/v2/assets/control-plane-pki-operator/role.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
- control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go
|
@sdminonne: This pull request references Jira Issue OCPBUGS-62177, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go (1)
99-100: Wire pod updates into the queue, not just the cache.The new gate now depends on pod readiness and
PodIP, but this informer is only being initialized for lister access. That means CRRs still converge on those transitions only viaSyntheticRequeueError. Hooking pod events intoenqueueAll(...)would make rollouts react faster and cut needless requeues while KAS pods are coming up.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go` around lines 99 - 100, The Pod informer is only created for lister access but not wired to enqueue changes, so Pod readiness/PodIP transitions don't trigger immediate reconciles; update the code around kubeInformersForNamespaces.InformersFor(hostedControlPlane.Namespace).Core().V1().Pods().Informer() to add event handlers that call enqueueAll(...) on Pod Add/Update/Delete (use cache.ResourceEventHandlerFuncs with AddFunc, UpdateFunc that compares old/new and always call enqueueAll for relevant changes, and DeleteFunc) so Pod events drive the controller rather than relying on SyntheticRequeueError.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In
`@control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.go`:
- Around line 99-100: The Pod informer is only created for lister access but not
wired to enqueue changes, so Pod readiness/PodIP transitions don't trigger
immediate reconciles; update the code around
kubeInformersForNamespaces.InformersFor(hostedControlPlane.Namespace).Core().V1().Pods().Informer()
to add event handlers that call enqueueAll(...) on Pod Add/Update/Delete (use
cache.ResourceEventHandlerFuncs with AddFunc, UpdateFunc that compares old/new
and always call enqueueAll for relevant changes, and DeleteFunc) so Pod events
drive the controller rather than relying on SyntheticRequeueError.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: b0d6c3c1-3ba0-403f-bbf5-2f676ef56184
📒 Files selected for processing (2)
control-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller.gocontrol-plane-pki-operator/certificaterevocationcontroller/certificaterevocationcontroller_test.go
146e31a to
5b7f272
Compare
|
/jira refresh |
|
@sdminonne: This pull request references Jira Issue OCPBUGS-62177, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (jiezhao@redhat.com), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/pipeline required |
|
Scheduling tests matching the |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
Root Cause: The test uses hyperv1.NonePlatform in an AKS environment that requires explicit route hostnames. Since |
|
/pipeline required |
|
Scheduling tests matching the |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
/test e2e-aws |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
/retest-required |
|
Scheduling tests matching the |
- Add IsPodReady() to check pod Ready condition status - Add ContainerPort() to look up named container ports with a default fallback - Add KASReadinessCheckContainer() sidecar that probes KAS /livez for PDB-aware readiness gating - Add EnforceRestrictedSecurityContextToContainers() for restricted pod security standards enforcement These helpers are used by the control-plane-pki-operator to connect directly to individual KAS pods during certificate revocation verification. Signed-off-by: Salvatore Dario Minonne <sminonne@redhat.com> Commit-Message-Assisted-by: Claude (via Claude Code)
The CertificateRevocationController now connects directly to each KAS pod IP instead of going through the service load balancer during certificate verification. This prevents false positives in HA deployments where the LB might route to a single pod. - Add verifyCertificateAgainstAllKASPods() that lists KAS pods and verifies each individually via direct PodIP connections with SNI - Add cross-check against KAS Deployment spec.replicas to ensure all pods are visible before proceeding with verification - Wire pod informer to trigger reconciliation on KAS pod transitions - Add cross-verification in both ensureNewSignerCertificatePropagated (old signer still trusted) and ensureOldSignerCertificateRevoked (new signer still trusted) to detect mid-reload transient states - Grant the control-plane-pki-operator RBAC to get deployments and list/get pods in the HCP namespace Signed-off-by: Salvatore Dario Minonne <sminonne@redhat.com> Commit-Message-Assisted-by: Claude (via Claude Code)
- Add TestCreateClusterHABreakGlassCredentials e2e test that creates a 3-replica HA hosted cluster and runs the break-glass credential tests - Replace one-shot SSR revocation check with a polling loop through the service LB (1s interval, 3m timeout) to tolerate transient states when a KAS pod restarts between the controller's verification and the test's check - Detect worker node count from NodePool replicas for private clusters where direct node listing is unavailable - Use CI platform for the HA break-glass credential test Signed-off-by: Salvatore Dario Minonne <sminonne@redhat.com> Commit-Message-Assisted-by: Claude (via Claude Code)
|
/test e2e-aws |
|
/test e2e-aks |
|
/test e2e-aks-4-22 |
|
/test e2e-aws-4-22 |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
AI Test Failure AnalysisJob: Generated by hypershift-analyze-e2e-failure post-step using Claude claude-opus-4-6 |
|
@sdminonne: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummary
Root Cause — SSR calls with revoked client certificate hang at TLS handshake instead of returning 401 UnauthorizedPR #8263 ( The critical evidence: The log line
Why HA passed but non-HA hung:
In the HA case with 3 KAS pods, the service load balancer may route the SSR to a pod that has already fully reloaded its trust bundle and cleanly rejects the cert. In the SingleReplica case, the single KAS pod may be in a transient state during trust bundle reload where it neither accepts nor cleanly rejects the revoked cert, causing the TLS handshake to stall. The controller's per-pod verification (which connected by PodIP and bypassed the LB) succeeded, but the subsequent LB-routed SSR encounters the same pod in a different connection state. Contributing factor — missing effective timeout: The Recommendations
Evidence1. Test never completed (build-log.txt tail): 2. Hang location (hypershift-aws-run-e2e-nested-build-log.txt:1646, 1799): No further output from these sub-tests. No 3. CRR completed successfully: 4. HA test passed: 5. TestCreateCluster never emitted PASS/FAIL/SKIP — confirmed by scanning all test results in the build log. Every other 6. PR #8263 changed
7. PR #8263 added per-pod verification in the controller ( Artifacts
|
The validateRevocation polling loop could hang indefinitely when KAS stalls the TLS handshake during trust bundle reload after certificate revocation. Without a per-request HTTP timeout, the blocking I/O prevented the 3-minute PollUntilContextTimeout from interrupting the call, causing TestCreateCluster to hang until the 2h Prow timeout. Set a 10-second per-request timeout on the client so stalled TLS handshakes are interrupted and the poll loop can retry. Also add logging before the SSR call to distinguish between a blocking call and a non-401 response in future CI logs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/test e2e-aks |
|
/test e2e-aws |
What this PR does / why we need it:
Updates the
CertificateRevocationControllerto verify certificate trust and revocation against every individual KAS pod rather than through the service load balancer. In HA deployments with 3 KAS replicas, the service endpoint load-balances to a single pod, so the check could hit a pod that had loaded the updated trust bundle while others hadn't, causing premature state transitions in the revocation flow.Changes
control-plane-pki-operator (
CertificateRevocationController):verifyCertificateAgainstAllKASPodswhich enumerates ready KAS pods via label selector and connects to each pod IP directly (with SNI set to the KAS service name for TLS validation)ensureOldSignerCertificateRevoked: before declaring a certificate revoked on a pod, verify the new leaf cert is also being served by that pod, preventing false positives when a pod restart temporarily makes the old cert unreachableverifyCertificateTrustedandverifyCertificateRevokedhelpers for per-pod SelfSubjectReview checkscontrol-plane-operator (RBAC):
list/watchon pods to thecontrol-plane-pki-operatorrole so it can enumerate KAS pods (update role asset + 5 test fixtures)support/podspec:
IsPodReadyandContainerPorthelpers for pod readiness checks and named port lookupe2e:
TestCreateClusterHABreakGlassCredentialsexercising the break-glass credential flow on a HighlyAvailable control plane (3 KAS replicas)ha-break-glass-credentialstoha-break-glass-credsto stay within Azure's 64-char VNet name limitContext
This was previously fixed in PRs #7405 + #7744 but reverted in #7784 because the e2e test added alongside it had a ~50% flake rate. The controller fix itself was sound — the flakiness was in the e2e test wiring. This PR re-applies the controller logic with comprehensive unit tests and a different, more targeted HA e2e test.
Which issue(s) this PR fixes:
Fixes https://issues.redhat.com/browse/OCPBUGS-62177
Special notes for your reviewer:
The dual-cert cross-check (
ensureOldSignerCertificateRevokedverifying the new cert is served before declaring the old one revoked) is the key addition beyond the original fix. Without it, a KAS pod restart during revocation could cause a false positive: the old cert becomes unreachable (connection refused), which looks like revocation, but is actually just the pod being down.Checklist: