USHIFT-6401: Patch unbounded KAS context to break pre-hook deadlock#6635
USHIFT-6401: Patch unbounded KAS context to break pre-hook deadlock#6635copejon wants to merge 2 commits intoopenshift:mainfrom
Conversation
…tion On MicroShift restart, the RBAC bootstrap hook can deadlock when etcd contains existing data. The hook uses context.TODO() for API calls, which has no timeout. When the loopback client hangs, this creates a circular dependency where the hook waits for the API server while the API server waits for the hook to complete. This change adds a parallel deadlock detector that: - Monitors /readyz/poststarthook/rbac/bootstrap-roles specifically - Checks if etcd is healthy while the hook is stuck - Detects deadlock in ~15 seconds instead of waiting 60 seconds - Restarts microshift-etcd.scope to recover from the deadlock This breaks the crash loop by detecting the condition early and taking recovery action at the MicroShift level, without requiring changes to vendored upstream Kubernetes code. Related upstream issues: kubernetes/kubernetes#86715, #97119 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> KAS rbac pre-hook creates an unbound context that can result in system deadlock, replace unbound context with caller-bound context to ensure the KAS can be restarted safely without restart microshift
The RBAC bootstrap hook's helper functions primeAggregatedClusterRoles and primeSplitClusterRoleBindings use context.TODO(), which has no timeout or cancellation. On resource-constrained systems, these calls can block indefinitely through the loopback client, causing KAS to deadlock on restart with no recovery path. Pass the hook's cancelable context through to all API calls so they respect shutdown signals and cannot hang forever. Upstream: kubernetes/kubernetes#86715, kubernetes/kubernetes#97119
|
@copejon: This pull request references USHIFT-6401 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "openshift-4.22" instead. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
WalkthroughContext is propagated through the RBAC bootstrap flow. ChangesRBAC Context Propagation
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 12✅ Passed checks (12 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.12.1)level=warning msg="The linter 'gomodguard' is deprecated (since v2.12.0) due to: new major version. Replaced by gomodguard_v2." ... [truncated 29740 characters] ... elet: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/metrics: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/mount-utils: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/pod-security-admission: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/sample-apiserver: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/sample-cli-plugin: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\tk8s.io/sample-controller: is replaced in go.mod, but not marked as replaced in vendor/modules.txt\n\n\tTo ignore the vendor directory, use -mod=readonly or -mod=mod.\n\tTo sync the vendor directory, run:\n\t\tgo mod vendor\n" Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: copejon The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go (1)
166-173: ⚡ Quick win
wait.Pollloop is not context-aware — cancellation won't short-circuit it.The hook context is now correctly threaded into the inner function, so individual API calls will fail fast when the context is cancelled. However,
wait.Pollitself has no awareness of the context; if the context is cancelled mid-poll-interval, the loop continues blocking for up to 30 more seconds before the next iteration observes the error. Replacing it withwait.PollWithContext(orwait.PollUntilContextTimeout) fully honors the shutdown signal.♻️ Proposed refactor
- err := wait.Poll(1*time.Second, 30*time.Second, func() (done bool, err error) { + err := wait.PollUntilContextTimeout(hookContext.Context, 1*time.Second, 30*time.Second, true, func(ctx context.Context) (done bool, err error) { client, err := clientset.NewForConfig(hookContext.LoopbackClientConfig) if err != nil { utilruntime.HandleError(fmt.Errorf("unable to initialize client set: %v", err)) return false, nil } - return ensureRBACPolicy(hookContext, p, client) + return ensureRBACPolicy(ctx, p, client) })Note: adjust
hookContext.ContexttohookContextifPostStartHookContextembedscontext.Context.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go` around lines 166 - 173, The wait.Poll call in the RBAC setup loop is not context-aware and can block after cancellation; replace the wait.Poll invocation in storage_rbac.go with a context-aware variant (e.g., wait.PollWithContext or wait.PollUntilContextTimeout) so the loop short-circuits on hookContext cancellation; pass the hookContext (or hookContext.Context if PostStartHookContext embeds context.Context) as the context argument and keep the same polling interval and timeout while preserving the existing ensureRBACPolicy(hookContext, p, client) call and error handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go`:
- Around line 166-173: The wait.Poll call in the RBAC setup loop is not
context-aware and can block after cancellation; replace the wait.Poll invocation
in storage_rbac.go with a context-aware variant (e.g., wait.PollWithContext or
wait.PollUntilContextTimeout) so the loop short-circuits on hookContext
cancellation; pass the hookContext (or hookContext.Context if
PostStartHookContext embeds context.Context) as the context argument and keep
the same polling interval and timeout while preserving the existing
ensureRBACPolicy(hookContext, p, client) call and error handling.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 9f88f2a4-91ab-409f-bd6c-6d75d87351ef
⛔ Files ignored due to path filters (1)
vendor/k8s.io/kubernetes/pkg/registry/rbac/rest/storage_rbac.gois excluded by!**/vendor/**,!vendor/**
📒 Files selected for processing (2)
deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.godeps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac_test.go
Replace
context.TODO()with the hook's cancelable context in the RBAC bootstrap post-start hook helpers (primeAggregatedClusterRoles,primeSplitClusterRoleBindings)Summary by CodeRabbit
Refactor
Chores