Skip to content

USHIFT-6401: Patch unbounded KAS context to break pre-hook deadlock#6635

Draft
copejon wants to merge 2 commits intoopenshift:mainfrom
copejon:fix-USHIFT-6401-alt-fix
Draft

USHIFT-6401: Patch unbounded KAS context to break pre-hook deadlock#6635
copejon wants to merge 2 commits intoopenshift:mainfrom
copejon:fix-USHIFT-6401-alt-fix

Conversation

@copejon
Copy link
Copy Markdown
Contributor

@copejon copejon commented May 7, 2026

Replace context.TODO() with the hook's cancelable context in the RBAC bootstrap post-start hook helpers (primeAggregatedClusterRoles, primeSplitClusterRoleBindings)

Summary by CodeRabbit

  • Refactor

    • Enhanced RBAC policy initialization and reconciliation with improved context propagation throughout role and role binding operations, enabling better cancellation and timeout handling.
  • Chores

    • Updated test suite to align with context handling improvements.

copejon added 2 commits May 6, 2026 11:24
…tion

On MicroShift restart, the RBAC bootstrap hook can deadlock when etcd
contains existing data. The hook uses context.TODO() for API calls,
which has no timeout. When the loopback client hangs, this creates a
circular dependency where the hook waits for the API server while the
API server waits for the hook to complete.

This change adds a parallel deadlock detector that:
- Monitors /readyz/poststarthook/rbac/bootstrap-roles specifically
- Checks if etcd is healthy while the hook is stuck
- Detects deadlock in ~15 seconds instead of waiting 60 seconds
- Restarts microshift-etcd.scope to recover from the deadlock

This breaks the crash loop by detecting the condition early and taking
recovery action at the MicroShift level, without requiring changes to
vendored upstream Kubernetes code.

Related upstream issues: kubernetes/kubernetes#86715, #97119

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

KAS rbac pre-hook creates an unbound context that can result in system deadlock, replace unbound context with caller-bound context to ensure the KAS can be restarted safely without restart microshift
  The RBAC bootstrap hook's helper functions primeAggregatedClusterRoles
  and primeSplitClusterRoleBindings use context.TODO(), which has no
  timeout or cancellation. On resource-constrained systems, these calls
  can block indefinitely through the loopback client, causing KAS to
  deadlock on restart with no recovery path.

  Pass the hook's cancelable context through to all API calls so they
  respect shutdown signals and cannot hang forever.

  Upstream: kubernetes/kubernetes#86715, kubernetes/kubernetes#97119
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 7, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 7, 2026

@copejon: This pull request references USHIFT-6401 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "openshift-4.22" instead.

Details

In response to this:

Replace context.TODO() with the hook's cancelable context in the RBAC bootstrap post-start hook helpers (primeAggregatedClusterRoles, primeSplitClusterRoleBindings)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 7, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 7, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

Walkthrough

Context is propagated through the RBAC bootstrap flow. EnsureRBACPolicy() delegates to a context-aware ensureRBACPolicy() function that uses the provided context for etcd readiness checks and priming operations. Helper functions primeAggregatedClusterRoles and primeSplitClusterRoleBindings are updated to accept and use the context parameter instead of hardcoded context.TODO().

Changes

RBAC Context Propagation

Layer / File(s) Summary
Function Signatures
deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go
ensureRBACPolicy, primeAggregatedClusterRoles, and primeSplitClusterRoleBindings are updated to accept context.Context as a parameter.
Context Usage in API Calls
storage_rbac.go
Etcd readiness checks and Get/Create/List API calls now use the provided context instead of context.TODO() throughout initialization and priming operations.
Tests
deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac_test.go
Benchmark calls ensureRBACPolicy with context.Background() as the first argument.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 12
✅ Passed checks (12 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change: replacing context.TODO() with cancelable context in RBAC bootstrap helpers to fix a pre-hook deadlock issue on resource-constrained devices.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed No Ginkgo test specs found in modified files. Changes are in production code (storage_rbac.go) and a standard Go benchmark test (storage_rbac_test.go). Check not applicable.
Test Structure And Quality ✅ Passed Test file contains only a benchmark test, not Ginkgo tests. The custom check requires review of Ginkgo test code (It blocks, BeforeEach/AfterEach), which is not applicable here.
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests were added in this PR. The modified test file is a standard Go unit test with a benchmark function, not a Ginkgo e2e test. The check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo e2e tests are added in this PR. The modified files contain only a benchmark test. The SNO compatibility check applies only when new e2e tests are added and is not applicable here.
Topology-Aware Scheduling Compatibility ✅ Passed Changes affect only Kubernetes RBAC bootstrap Go code. No deployment manifests, operators, or controllers added. Check not applicable.
Ote Binary Stdout Contract ✅ Passed PR contains only function signature updates (context parameter additions) with no stdout writes or logging changes at process-level code. OTE Binary Stdout Contract is unviolated.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests added. Changes are RBAC implementation code context propagation and benchmark test updates. Check applies only to new Ginkgo tests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.1)

level=warning msg="The linter 'gomodguard' is deprecated (since v2.12.0) due to: new major version. Replaced by gomodguard_v2."
level=warning msg="Suggested new configuration:\nlinters:\n enable:\n - gomodguard_v2\n"
level=error msg="Running error: context loading failed: failed to load packages: failed to load packages: failed to load with go/packages: err: exit status 1: stderr: go: inconsistent vendoring in :\n\tgithub.com/apparentlymart/go-cidr@v1.1.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/coreos/go-systemd@v0.0.0-20190321100706-95778dfbb74e: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/google/go-cmp@v0.7.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/miekg/dns@v1.1.63: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/openshift/api@v0.0.0-20260309155933-45fd88d185dd: is

... [truncated 29740 characters] ...

elet: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/metrics: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/mount-utils: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/pod-security-admission: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/sample-apiserver: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/sample-cli-plugin: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/sample-controller: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\n\tTo ignore the vendor directory, use -mod=readonly or -mod=mod.\n\tTo sync the vendor directory, run:\n\t\tgo mod vendor\n"

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 7, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: copejon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 7, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go (1)

166-173: ⚡ Quick win

wait.Poll loop is not context-aware — cancellation won't short-circuit it.

The hook context is now correctly threaded into the inner function, so individual API calls will fail fast when the context is cancelled. However, wait.Poll itself has no awareness of the context; if the context is cancelled mid-poll-interval, the loop continues blocking for up to 30 more seconds before the next iteration observes the error. Replacing it with wait.PollWithContext (or wait.PollUntilContextTimeout) fully honors the shutdown signal.

♻️ Proposed refactor
-		err := wait.Poll(1*time.Second, 30*time.Second, func() (done bool, err error) {
+		err := wait.PollUntilContextTimeout(hookContext.Context, 1*time.Second, 30*time.Second, true, func(ctx context.Context) (done bool, err error) {
 			client, err := clientset.NewForConfig(hookContext.LoopbackClientConfig)
 			if err != nil {
 				utilruntime.HandleError(fmt.Errorf("unable to initialize client set: %v", err))
 				return false, nil
 			}
-			return ensureRBACPolicy(hookContext, p, client)
+			return ensureRBACPolicy(ctx, p, client)
 		})

Note: adjust hookContext.Context to hookContext if PostStartHookContext embeds context.Context.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go`
around lines 166 - 173, The wait.Poll call in the RBAC setup loop is not
context-aware and can block after cancellation; replace the wait.Poll invocation
in storage_rbac.go with a context-aware variant (e.g., wait.PollWithContext or
wait.PollUntilContextTimeout) so the loop short-circuits on hookContext
cancellation; pass the hookContext (or hookContext.Context if
PostStartHookContext embeds context.Context) as the context argument and keep
the same polling interval and timeout while preserving the existing
ensureRBACPolicy(hookContext, p, client) call and error handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go`:
- Around line 166-173: The wait.Poll call in the RBAC setup loop is not
context-aware and can block after cancellation; replace the wait.Poll invocation
in storage_rbac.go with a context-aware variant (e.g., wait.PollWithContext or
wait.PollUntilContextTimeout) so the loop short-circuits on hookContext
cancellation; pass the hookContext (or hookContext.Context if
PostStartHookContext embeds context.Context) as the context argument and keep
the same polling interval and timeout while preserving the existing
ensureRBACPolicy(hookContext, p, client) call and error handling.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 9f88f2a4-91ab-409f-bd6c-6d75d87351ef

📥 Commits

Reviewing files that changed from the base of the PR and between e98bbde and d255cf8.

⛔ Files ignored due to path filters (1)
  • vendor/k8s.io/kubernetes/pkg/registry/rbac/rest/storage_rbac.go is excluded by !**/vendor/**, !vendor/**
📒 Files selected for processing (2)
  • deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go
  • deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac_test.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants