CNTRLPLANE-3167: Auto-detect credential type for HCPEtcdBackup by jparrill · Pull Request #8368 · openshift/hypershift

jparrill · 2026-04-29T11:14:11Z

Summary

Auto-detect credential mode (STS/IRSA for AWS, Workload Identity for Azure) from the credential Secret content in the HCPEtcdBackup controller
Extract credential resolution logic into credentials.go with dedicated unit tests
Add managed services credential documentation covering OADP plugin Secret flow, IAM/federated credential prerequisites, and troubleshooting

Description

ROSA HCP and ARO HCP use short-lived federated credentials instead of long-lived static keys. This PR enables the HCPEtcdBackup controller to auto-detect the credential type from the Secret format:

AWS STS/IRSA: When the credentials key contains a bare IAM role ARN (arn:...), the controller configures a projected SA token volume with sts.amazonaws.com audience and sets AWS_ROLE_ARN/AWS_WEB_IDENTITY_TOKEN_FILE env vars
Azure Workload Identity: When the Secret has a cloud key with a non-empty AZURE_CLIENT_ID=, the controller adds the azure.workload.identity/use=true pod label and annotates the SA with azure.workload.identity/client-id
Static credentials (AWS INI file, Azure client-secret, Azure managed-identity) continue to work as before

No API changes — detection is purely based on Secret content inspection.

Related PRs

Plugin side: openshift/hypershift-oadp-plugin#247 — OADP plugin changes to copy credentials preserving the cloud key for Azure WI detection

Jira

CNTRLPLANE-3167

Test plan

Unit tests for resolveAWSCredentials, resolveAzureCredentials, resolveCredentials, needsCredentialsFile
Unit tests for createBackupJob with STS, Workload Identity, and managed-identity modes
Unit tests for ensureServiceAccount annotation management
E2E: AWS STS backup on ROSA HCP cluster
E2E: Azure Workload Identity backup on ARO HCP cluster

🤖 Generated with Claude Code

openshift-merge-bot · 2026-04-29T11:14:15Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci-robot · 2026-04-29T11:14:17Z

@jparrill: This pull request references CNTRLPLANE-3167 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

Auto-detect credential mode (STS/IRSA for AWS, Workload Identity for Azure) from the credential Secret content in the HCPEtcdBackup controller

Extract credential resolution logic into credentials.go with dedicated unit tests

Add managed services credential documentation covering OADP plugin Secret flow, IAM/federated credential prerequisites, and troubleshooting

Description

ROSA HCP and ARO HCP use short-lived federated credentials instead of long-lived static keys. This PR enables the HCPEtcdBackup controller to auto-detect the credential type from the Secret format:

AWS STS/IRSA: When the credentials key contains a bare IAM role ARN (arn:...), the controller configures a projected SA token volume with sts.amazonaws.com audience and sets AWS_ROLE_ARN/AWS_WEB_IDENTITY_TOKEN_FILE env vars

Azure Workload Identity: When the Secret has a cloud key, the controller adds the azure.workload.identity/use=true pod label and annotates the SA with azure.workload.identity/client-id

Static credentials (AWS INI file, Azure client-secret, Azure managed-identity) continue to work as before

No API changes — detection is purely based on Secret content inspection.

Jira

CNTRLPLANE-3167

Test plan

Unit tests for resolveAWSCredentials, resolveAzureCredentials, resolveCredentials, needsCredentialsFile

Unit tests for createBackupJob with STS, Workload Identity, and managed-identity modes

Unit tests for ensureServiceAccount annotation management

E2E: AWS STS backup on ROSA HCP cluster

E2E: Azure Workload Identity backup on ARO HCP cluster

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-04-29T11:14:28Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

The pull request implements credential auto-detection and handling for HCPEtcdBackup, adding a credential-resolution module that classifies AWS (static vs STS/IRSA) and Azure (workload-identity, client-secret, managed-identity) modes from Secret contents. The reconciler now derives a resolvedCredentials value and passes it to ServiceAccount and Job creation, which conditionally set annotations, labels, volumes, mounts, env vars, and CLI flags. Unit tests validate parsing and Job/ServiceAccount behavior. Docs and navigation add a Managed Services Credentials tech-preview guide.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Reconciler
    participant APIServer as Kubernetes API
    participant Cloud as Cloud Platform

    User->>APIServer: Create HCPEtcdBackup CR + credentials Secret
    Reconciler->>APIServer: Watch HCPEtcdBackup
    Note over Reconciler: Trigger reconciliation

    Reconciler->>APIServer: Read credentials Secret
    APIServer-->>Reconciler: Secret data (cloud, credentials)

    Note over Reconciler: Resolve credential mode
    alt AWS STS detected
        Reconciler->>APIServer: Create ServiceAccount
        Reconciler->>APIServer: Create Job with projected token volume
        Reconciler->>APIServer: Set AWS_ROLE_ARN & AWS_WEB_IDENTITY_TOKEN_FILE env
        Note over Reconciler: No --credentials-file
    else AWS Static detected
        Reconciler->>APIServer: Create ServiceAccount
        Reconciler->>APIServer: Create Job with credentials secret volume
        Note over Reconciler: Include --credentials-file
    else Azure Workload Identity detected
        Reconciler->>APIServer: Create ServiceAccount with azure.workload.identity/client-id
        Reconciler->>APIServer: Create Job with pod label azure.workload.identity/use=true
        Note over Reconciler: No --credentials-file
    else Azure Client Secret detected
        Reconciler->>APIServer: Create ServiceAccount
        Reconciler->>APIServer: Create Job with credentials secret volume
        Note over Reconciler: Include --credentials-file and --azure-auth-type client-secret
    else Azure Managed Identity detected
        Reconciler->>APIServer: Create ServiceAccount
        Reconciler->>APIServer: Create Job with credentials secret volume
        Note over Reconciler: Include --credentials-file and --azure-auth-type managed-identity
    end

    Reconciler->>APIServer: Job pod scheduled
    APIServer->>Cloud: Pod authenticates using detected credential mode
    Cloud-->>APIServer: Credentials validated
    Note over Cloud: Pod performs etcd snapshot upload

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (11 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly and specifically summarizes the main change: auto-detecting credential types for HCPEtcdBackup, which is the primary objective reflected throughout all changeset files (credentials resolution logic, controller refactoring, documentation, and tests).
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	All 21 test functions across test files use standard Go testing with t.Run() subtests containing 54 static, deterministic test names following consistent patterns with no dynamic content.
Test Structure And Quality	✅ Passed	Tests demonstrate high quality with single responsibility, focused subtests, functional helper setup/cleanup, exclusive use of fake clients without timeouts, and consistency with repository conventions.
Microshift Test Compatibility	✅ Passed	This PR adds only standard Go unit tests, not Ginkgo e2e tests. E2e tests are explicitly listed as TODO and not implemented.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	This pull request does not add any Ginkgo e2e tests. The new test files are unit tests using standard Go testing conventions, not Ginkgo patterns. No new e2e test files were added.
Topology-Aware Scheduling Compatibility	✅ Passed	The PR introduces no scheduling constraints that assume a standard HA topology; backup Job uses standard Kubernetes scheduling without affinity rules, node selectors, or topology spread constraints.
Ote Binary Stdout Contract	✅ Passed	PR introduces credential resolution logic and controller modifications without violating OTE Binary Stdout Contract; all logging uses controller-runtime structured logging framework.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	The PR does not add any Ginkgo e2e tests. Test files use standard Go testing package, not Ginkgo patterns.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-04-29T11:14:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jparrill

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [jparrill]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

codecov · 2026-04-29T11:21:23Z

Codecov Report

❌ Patch coverage is 95.52239% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 37.60%. Comparing base (76d06fa) to head (313e3da).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
...hift-operator/controllers/etcdbackup/reconciler.go	95.03%	4 Missing and 3 partials ⚠️
...ift-operator/controllers/etcdbackup/credentials.go	96.66%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8368      +/-   ##
==========================================
+ Coverage   37.50%   37.60%   +0.09%     
==========================================
  Files         751      752       +1     
  Lines       91992    92137     +145     
==========================================
+ Hits        34505    34645     +140     
- Misses      54844    54848       +4     
- Partials     2643     2644       +1

Files with missing lines	Coverage Δ
...ift-operator/controllers/etcdbackup/credentials.go	`96.66% <96.66%> (ø)`
...hift-operator/controllers/etcdbackup/reconciler.go	`76.69% <95.03%> (+2.35%)`	⬆️

Flag	Coverage Δ
cmd-support	`32.68% <ø> (ø)`
cpo-hostedcontrolplane	`36.77% <ø> (ø)`
cpo-other	`37.73% <ø> (ø)`
hypershift-operator	`48.23% <95.52%> (+0.29%)`	⬆️
other	`27.77% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@docs/content/how-to/disaster-recovery/etcd-snapshot-backup/managed-services-credentials.md`:
- Around line 275-289: The unlabeled fenced code blocks under the "AWS STS:
Access Denied" and "Azure WI: No Matching Federated Identity Record" sections
are causing markdownlint MD040; update each triple-backtick fence that contains
the error messages (the blocks shown below those headings) to include a language
tag of text (i.e., ```text) so the snippets remain unchanged visually but
satisfy the linter; locate the blocks by the section headings "AWS STS: Access
Denied" and "Azure WI: No Matching Federated Identity Record" and modify the
opening fences accordingly.
- Around line 15-25: Update the documentation to match actual copy semantics and
resolver precedence: reconcile the Secret name examples (replace or mention both
`cloud-credentials` and `etcd-backup-creds` where CR examples use the latter)
and clarify that the HyperShift-copied Secret may contain either the remapped
`credentials` key, the original `cloud` key, or both depending on copy logic;
explicitly document the resolver behavior that for Azure the resolver checks the
`cloud` key first and will classify any Secret with a `cloud` key as
workload-identity (so client-secret mode must not include `cloud` and must
supply the `credentials` JSON with the client secret), and update the narrative
in the affected sections (including the blocks referencing `cloud-credentials`,
`etcd-backup-creds`, keys `cloud` and `credentials`, and the Azure mode
description) to reflect this precedence and required Secret shapes.
- Around line 88-117: In the "IAM Role Permissions" section update the JSON
policy snippet to remove the invalid IAM actions (`s3:CreateMultipartUpload`,
`s3:UploadPart`, `s3:CompleteMultipartUpload`) and only include the valid
actions `s3:PutObject` and `s3:AbortMultipartUpload` in the "Action" array for
the shown resource ARNs; also adjust the explanatory note below the snippet to
state that multipart upload is authorized by `s3:PutObject` and that
`s3:AbortMultipartUpload` is included for cleanup, removing the claim that the
removed actions are required for the AWS SDK v2 Transfer Manager.

In `@hypershift-operator/controllers/etcdbackup/reconciler.go`:
- Around line 560-575: The current ensureServiceAccount uses a single global
jobServiceAccountName and races when multiple HCP namespaces run backups; change
ensureServiceAccount to create a per-job/per-credential ServiceAccount (e.g.,
include the HCP namespace and a short credential or job identifier derived from
resolvedCredentials and the HCP being reconciled) instead of reusing
jobServiceAccountName, update any RoleBinding/ensureRoleBinding logic to
reference that per-job ServiceAccount subject (adjust symbols
ensureServiceAccount and any ensureRoleBinding or RoleBinding creation), and
ensure cleanup of the per-job ServiceAccount/RoleBinding after job completion or
failure.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 3f26774b-d142-433f-a5fe-0126039ec61b

📥 Commits

Reviewing files that changed from the base of the PR and between 60802b1 and c92c851.

⛔ Files ignored due to path filters (1)

docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md

📒 Files selected for processing (7)

docs/content/how-to/disaster-recovery/etcd-snapshot-backup/index.md
docs/content/how-to/disaster-recovery/etcd-snapshot-backup/managed-services-credentials.md
docs/mkdocs.yml
hypershift-operator/controllers/etcdbackup/credentials.go
hypershift-operator/controllers/etcdbackup/credentials_test.go
hypershift-operator/controllers/etcdbackup/reconciler.go
hypershift-operator/controllers/etcdbackup/reconciler_test.go

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

hypershift-operator/controllers/etcdbackup/reconciler_test.go (1)
1076-1092: Minor: Backup name generation may produce unexpected characters for larger test sets.

The expression string(rune('a'+i)) works correctly for i < 26, but if the test set were expanded beyond 26 items, it would produce non-letter characters. This is fine for the current test with only 3 items, but consider using fmt.Sprintf("backup-%d", i) for consistency with other similar tests in this file (e.g., line 1025).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/etcdbackup/reconciler_test.go` around lines
1076 - 1092, The backup name generation using string(rune('a'+i)) inside the
loop that creates HCPEtcdBackup objects can produce unexpected characters for
larger i; update the name to use a numeric format (e.g.,
fmt.Sprintf("backup-%d", i)) when constructing each HCPEtcdBackup in the for i
:= range 3 loop so names are predictable and consistent with other tests, and
add an import for fmt if not already present.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hypershift-operator/controllers/etcdbackup/credentials.go`:
- Around line 64-79: When resolving the "cloud" secret in
resolveAzureCredentials, do not return credentialModeAzureWorkloadIdentity with
an empty ClientID; instead check if clientID == "" and if so log a warning
(e.g., klog.Warningf or processLogger) referencing secret.Name and the missing
AZURE_CLIENT_ID, then return a non-workload mode such as credentialModeUnknown
(or the existing fallback mode) so the ServiceAccount is not annotated with
azure.workload.identity/client-id:""; only set Mode to
credentialModeAzureWorkloadIdentity and populate ClientID when a non-empty
AZURE_CLIENT_ID is found.

---

Nitpick comments:
In `@hypershift-operator/controllers/etcdbackup/reconciler_test.go`:
- Around line 1076-1092: The backup name generation using string(rune('a'+i))
inside the loop that creates HCPEtcdBackup objects can produce unexpected
characters for larger i; update the name to use a numeric format (e.g.,
fmt.Sprintf("backup-%d", i)) when constructing each HCPEtcdBackup in the for i
:= range 3 loop so names are predictable and consistent with other tests, and
add an import for fmt if not already present.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 3868b814-bbdb-466d-bad3-fbb6c8ba9d04

📥 Commits

Reviewing files that changed from the base of the PR and between c92c851 and b337ffc.

⛔ Files ignored due to path filters (1)

docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md

📒 Files selected for processing (7)

docs/content/how-to/disaster-recovery/etcd-snapshot-backup/index.md
docs/content/how-to/disaster-recovery/etcd-snapshot-backup/managed-services-credentials.md
docs/mkdocs.yml
hypershift-operator/controllers/etcdbackup/credentials.go
hypershift-operator/controllers/etcdbackup/credentials_test.go
hypershift-operator/controllers/etcdbackup/reconciler.go
hypershift-operator/controllers/etcdbackup/reconciler_test.go

✅ Files skipped from review due to trivial changes (3)

docs/mkdocs.yml
docs/content/how-to/disaster-recovery/etcd-snapshot-backup/index.md
hypershift-operator/controllers/etcdbackup/credentials_test.go

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (2)

docs/content/how-to/disaster-recovery/etcd-snapshot-backup/managed-services-credentials.md (2)
277-289: ⚠️ Potential issue | 🟡 Minor

Add language tags to the troubleshooting code fences.

This issue was flagged in a previous review but remains unaddressed. The fenced code blocks at Lines 277-279 and 287-289 lack language tags, triggering markdownlint MD040. Adding text as the language identifier will resolve the warning without changing rendering.
📝 Proposed fix
 ### AWS STS: Access Denied
 
-```
+```text
 An error occurred (AccessDenied) when calling the <S3_OPERATION> operation
 ```

 ...

 ### Azure WI: No Matching Federated Identity Record
 
-```
+```text
 AADSTS700213: No matching federated identity record found for presented assertion subject
 ```
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@docs/content/how-to/disaster-recovery/etcd-snapshot-backup/managed-services-credentials.md`
around lines 277 - 289, Two fenced code blocks containing the literal strings
"An error occurred (AccessDenied) when calling the <S3_OPERATION> operation" and
"AADSTS700213: No matching federated identity record found for presented
assertion subject" are missing language tags causing markdownlint MD040; fix by
adding the language identifier `text` immediately after the opening triple
backticks for both code fences (i.e., change the opening ``` to ```text for the
block containing the S3 AccessDenied message and for the block containing the
AADSTS700213 message) and keep the existing closing ``` unchanged.
88-117: ⚠️ Potential issue | 🟠 Major

Remove invalid IAM action names from the policy snippet; only s3:PutObject and s3:AbortMultipartUpload are needed.

This issue was flagged in a previous review but remains unaddressed. According to AWS documentation, s3:CreateMultipartUpload, s3:UploadPart, and s3:CompleteMultipartUpload are API operation names, not IAM policy action names. Multipart upload operations are authorized entirely by s3:PutObject on the target object. The policy should include only:

s3:PutObject (covers CreateMultipartUpload, UploadPart, and CompleteMultipartUpload operations)

s3:AbortMultipartUpload (valid IAM action for cleanup)

Update the policy snippet to remove Lines 100-102 and revise the explanatory note (Lines 114-116) to state that multipart upload is authorized by s3:PutObject and that s3:AbortMultipartUpload is included for cleanup.
🛠️ Proposed fix for the IAM policy
       "Action": [
         "s3:PutObject",
-        "s3:CreateMultipartUpload",
-        "s3:UploadPart",
-        "s3:CompleteMultipartUpload",
         "s3:AbortMultipartUpload"
       ],
And update the note:
 !!! note
 
-    The multipart permissions (`CreateMultipartUpload`, `UploadPart`, `CompleteMultipartUpload`, `AbortMultipartUpload`) are required because the upload uses the AWS SDK v2 Transfer Manager, which automatically splits files larger than 5 MB into multipart uploads. Etcd snapshots typically range from 30-100 MB, so multipart upload is used in practice. `AbortMultipartUpload` is needed to clean up incomplete uploads on failure.
+    The `s3:PutObject` permission authorizes all multipart upload operations (CreateMultipartUpload, UploadPart, CompleteMultipartUpload) used by the AWS SDK v2 Transfer Manager, which automatically splits files larger than 5 MB into multipart uploads. Etcd snapshots typically range from 30-100 MB, so multipart upload is used in practice. `s3:AbortMultipartUpload` is included to clean up incomplete uploads on failure.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@docs/content/how-to/disaster-recovery/etcd-snapshot-backup/managed-services-credentials.md`
around lines 88 - 117, Update the IAM policy JSON and explanatory note: remove
the invalid IAM actions "s3:CreateMultipartUpload", "s3:UploadPart", and
"s3:CompleteMultipartUpload" from the Action array and keep only "s3:PutObject"
and "s3:AbortMultipartUpload" (so the statement lists s3:PutObject and
s3:AbortMultipartUpload for the bucket and objects), and revise the accompanying
note to state that multipart uploads are authorized by s3:PutObject (which
covers CreateMultipartUpload/UploadPart/CompleteMultipartUpload operations) and
that s3:AbortMultipartUpload is included solely for cleanup of incomplete
uploads.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@docs/content/how-to/disaster-recovery/etcd-snapshot-backup/managed-services-credentials.md`:
- Around line 15-27: There is a name mismatch between the destination Secret
shown in the explanation table ("cloud-credentials") and the HCPEtcdBackup CR
examples that reference "etcd-backup-creds"; pick the correct canonical Secret
name used by the OADP HyperShift plugin and make the docs consistent by either
(a) updating the table to show "etcd-backup-creds" if the plugin actually
creates that name, or (b) updating both HCPEtcdBackup CR examples to reference
"cloud-credentials" if that is the source of truth; additionally, add one short
clarifying sentence stating whether the plugin auto-creates and copies the
Secret (so no manual creation is required) or whether users must
create/reference a different Secret name, and apply the same change to both
HCPEtcdBackup examples currently showing the alternate name.
- Around line 275-284: Update the permissions listed in the "AWS STS: Access
Denied" troubleshooting bullet so it matches the corrected IAM policy: replace
the full multipart upload action list with only the two actions allowed by the
policy—s3:PutObject and s3:AbortMultipartUpload—so the bullet under "AWS STS:
Access Denied" contains only `s3:PutObject` and `s3:AbortMultipartUpload` and
remove `s3:CreateMultipartUpload`, `s3:UploadPart`, and
`s3:CompleteMultipartUpload`.

---

Duplicate comments:
In
`@docs/content/how-to/disaster-recovery/etcd-snapshot-backup/managed-services-credentials.md`:
- Around line 277-289: Two fenced code blocks containing the literal strings "An
error occurred (AccessDenied) when calling the <S3_OPERATION> operation" and
"AADSTS700213: No matching federated identity record found for presented
assertion subject" are missing language tags causing markdownlint MD040; fix by
adding the language identifier `text` immediately after the opening triple
backticks for both code fences (i.e., change the opening ``` to ```text for the
block containing the S3 AccessDenied message and for the block containing the
AADSTS700213 message) and keep the existing closing ``` unchanged.
- Around line 88-117: Update the IAM policy JSON and explanatory note: remove
the invalid IAM actions "s3:CreateMultipartUpload", "s3:UploadPart", and
"s3:CompleteMultipartUpload" from the Action array and keep only "s3:PutObject"
and "s3:AbortMultipartUpload" (so the statement lists s3:PutObject and
s3:AbortMultipartUpload for the bucket and objects), and revise the accompanying
note to state that multipart uploads are authorized by s3:PutObject (which
covers CreateMultipartUpload/UploadPart/CompleteMultipartUpload operations) and
that s3:AbortMultipartUpload is included solely for cleanup of incomplete
uploads.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 4351cd03-c94a-4676-b1ee-0ef34059bcec

📥 Commits

Reviewing files that changed from the base of the PR and between b337ffc and f26d30c.

⛔ Files ignored due to path filters (1)

docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md

📒 Files selected for processing (3)

docs/content/how-to/disaster-recovery/etcd-snapshot-backup/index.md
docs/content/how-to/disaster-recovery/etcd-snapshot-backup/managed-services-credentials.md
docs/mkdocs.yml

✅ Files skipped from review due to trivial changes (2)

docs/content/how-to/disaster-recovery/etcd-snapshot-backup/index.md
docs/mkdocs.yml

coderabbitai

♻️ Duplicate comments (2)

hypershift-operator/controllers/etcdbackup/reconciler.go (1)
61-62: ⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Shared ServiceAccount still creates a cross-namespace Azure WI race.

jobServiceAccountName is global in the operator namespace, but annotation state is mutated per reconcile. Because active-job gating is per HCP namespace, concurrent backups in different HCP namespaces can overwrite/remove azure.workload.identity/client-id on the same SA before pod admission.

Also applies to: 560-575, 625-629, 806-807
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/etcdbackup/reconciler.go` around lines 61 -
62, The global jobServiceAccountName leads to cross-namespace races because the
controller mutates the azure.workload.identity/client-id annotation per HCP
during Reconcile; change to use a namespace-scoped ServiceAccount name (e.g.,
derive SA name from jobServiceAccountName plus the HCP namespace or HCP UID) so
each HCP gets its own SA, update all places that reference jobServiceAccountName
(including the blocks around the earlier mentions) to use the derived per-HCP
name, and ensure all annotation reads/writes target that per-HCP ServiceAccount
rather than a single shared SA to avoid concurrent annotation overwrites.
docs/content/how-to/disaster-recovery/etcd-snapshot-backup/managed-services-credentials.md (1)
25-26: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Azure detection description is internally inconsistent with resolver precedence.

Line 25 says Azure falls back from cloud to credentials, but Line 230 says finding cloud triggers WI. In hypershift-operator/controllers/etcdbackup/credentials.go (Lines 65-79), a present cloud key with missing AZURE_CLIENT_ID returns managed-identity immediately (no credentials fallback). Please align this section to the actual precedence to avoid misconfiguration.

Also applies to: 230-230
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@docs/content/how-to/disaster-recovery/etcd-snapshot-backup/managed-services-credentials.md`
around lines 25 - 26, Documentation text in managed-services-credentials.md
incorrectly states Azure falls back from `cloud` to `credentials` but the actual
resolver in hypershift-operator/controllers/etcdbackup/credentials.go (function
handling detection around lines 65-79) treats a present `cloud` key with missing
AZURE_CLIENT_ID as immediate managed-identity (no `credentials` fallback).
Update both occurrences (around lines 25 and 230) to reflect the real
precedence: explain that presence of the `cloud` key triggers the
managed-identity flow unless explicit AZURE client credentials
(AZURE_CLIENT_ID/CLIENT_SECRET/TENANT_ID) are present, and clarify there is no
automatic fallback to `credentials` when `cloud` exists but AZURE_* values are
absent; reference the resolver behavior in
hypershift-operator/controllers/etcdbackup/credentials.go to ensure wording
matches code.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In
`@docs/content/how-to/disaster-recovery/etcd-snapshot-backup/managed-services-credentials.md`:
- Around line 25-26: Documentation text in managed-services-credentials.md
incorrectly states Azure falls back from `cloud` to `credentials` but the actual
resolver in hypershift-operator/controllers/etcdbackup/credentials.go (function
handling detection around lines 65-79) treats a present `cloud` key with missing
AZURE_CLIENT_ID as immediate managed-identity (no `credentials` fallback).
Update both occurrences (around lines 25 and 230) to reflect the real
precedence: explain that presence of the `cloud` key triggers the
managed-identity flow unless explicit AZURE client credentials
(AZURE_CLIENT_ID/CLIENT_SECRET/TENANT_ID) are present, and clarify there is no
automatic fallback to `credentials` when `cloud` exists but AZURE_* values are
absent; reference the resolver behavior in
hypershift-operator/controllers/etcdbackup/credentials.go to ensure wording
matches code.

In `@hypershift-operator/controllers/etcdbackup/reconciler.go`:
- Around line 61-62: The global jobServiceAccountName leads to cross-namespace
races because the controller mutates the azure.workload.identity/client-id
annotation per HCP during Reconcile; change to use a namespace-scoped
ServiceAccount name (e.g., derive SA name from jobServiceAccountName plus the
HCP namespace or HCP UID) so each HCP gets its own SA, update all places that
reference jobServiceAccountName (including the blocks around the earlier
mentions) to use the derived per-HCP name, and ensure all annotation
reads/writes target that per-HCP ServiceAccount rather than a single shared SA
to avoid concurrent annotation overwrites.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: e0f9440f-58d1-4e2a-96e1-eb34dd0d9147

📥 Commits

Reviewing files that changed from the base of the PR and between f26d30c and c3fc037.

⛔ Files ignored due to path filters (1)

docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md

📒 Files selected for processing (7)

docs/content/how-to/disaster-recovery/etcd-snapshot-backup/index.md
docs/content/how-to/disaster-recovery/etcd-snapshot-backup/managed-services-credentials.md
docs/mkdocs.yml
hypershift-operator/controllers/etcdbackup/credentials.go
hypershift-operator/controllers/etcdbackup/credentials_test.go
hypershift-operator/controllers/etcdbackup/reconciler.go
hypershift-operator/controllers/etcdbackup/reconciler_test.go

✅ Files skipped from review due to trivial changes (2)

docs/content/how-to/disaster-recovery/etcd-snapshot-backup/index.md
docs/mkdocs.yml

cblecker

Nice separation of the credential resolution into its own file — the detection logic and test coverage look solid. Two small items inline.

cblecker · 2026-05-01T22:08:14Z


 Step-by-step description of the restore process: how the OADP plugin injects the snapshot URL, how the Control Plane Operator restores etcd, and how the cluster recovers.
+
+### [Managed Services Credentials](managed-services-credentials.md)


This new page also needs an entry in docs/mkdocs.yml under the "Etcd Snapshot Backup (Tech Preview)" nav section, otherwise it won't show up in the sidebar. Something like:

- how-to/disaster-recovery/etcd-snapshot-backup/managed-services-credentials.md

after the restore-flow.md entry.

cblecker · 2026-05-01T22:08:21Z

+	case hyperv1.AzureBlobBackupStorage:
+		return resolveAzureCredentials(secret)
+	default:
+		return resolvedCredentials{Mode: credentialModeAWSStatic, SecretName: secret.Name}


nit: this default case silently returns credentialModeAWSStatic for any unknown storage type. The callers (getCredentialSecretName, buildUploadArgs) already error on unsupported types so it's not a bug today, but if a new storage type gets added later this could mask issues. Worth considering returning a distinct unknown mode or logging a warning here.

The HCPEtcdBackup controller now inspects the referenced credential Secret to determine the authentication mode instead of always passing --credentials-file. This enables IRSA/STS on AWS (ROSA HCP) and Workload Identity on Azure (ARO HCP) without API changes. Detection logic: - AWS STS: Secret key "credentials" contains an ARN prefix - AWS static: Secret key "credentials" contains INI-format creds - Azure WI: Secret key "cloud" with AZURE_CLIENT_ID env format - Azure client-secret: Secret key "credentials" with clientSecret JSON - Azure managed-identity: fallback for other credential formats Job PodSpec is built per mode: - AWS STS: projected SA token volume + AWS_ROLE_ARN env var - Azure WI: pod label azure.workload.identity/use=true, SA annotated - Static/client-secret: credentials Secret mounted as volume - Azure modes: --azure-auth-type arg set accordingly Ref: CNTRLPLANE-3167 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>

…ckup Document credential auto-detection for ROSA HCP (AWS STS/IRSA) and ARO HCP (Azure Workload Identity) in the etcd snapshot backup flow. Covers OADP plugin Secret flow with key remapping, IAM/federated credential prerequisites, and troubleshooting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>

…dBackup For Azure Workload Identity mode, if the etcd-backup-job ServiceAccount already has an azure.workload.identity/client-id annotation (e.g., set by infrastructure or Helm), it is preserved. Otherwise, the annotation is set from the credential secret, matching the original behavior for self-managed HyperShift deployments. This allows managed platforms (ARO-HCP) to use a dedicated managed identity for the backup job while self-managed deployments can continue reusing the Velero identity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ckup Replace the findActiveJob guard in cleanupResources with hasNonTerminalBackup, which checks for any non-terminal HCPEtcdBackup CR in the same namespace. This closes the race window where a new backup CR has been created but its Job has not been spawned yet, preventing premature deletion of shared RBAC and NetworkPolicy resources. Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>

hypershift-jira-solve-ci · 2026-05-07T14:25:47Z

Now I have the complete picture. Here is the report:

Test Failure Analysis Complete

Job Information

Prow Job: codecov/project
Build ID: Check run 74836260351
PR: #8368 — CNTRLPLANE-3167: Auto-detect credential type for HCPEtcdBackup
Commit: 313e3da
Base Commit: 76d06fa (main)

Test Failure Analysis

Error

codecov/project: 33.03% (-4.47%) compared to 76d06fa

Summary

The codecov/project check run 74836260351 failed transiently at 14:22:01 UTC because Codecov computed coverage from an incomplete upload — only a subset of coverage flags had reported at that point, causing a phantom drop from 37.50% to 33.03% (−4.47%, 187 files and 35,887 lines appearing to vanish). Approximately 2 minutes later, at 14:23:55 UTC, a second codecov/project check run (74836651183) was created after all coverage flags finished uploading, correctly reporting 37.60% (+0.09%) and concluding as success. The codecov/project check currently shows passing on the PR. This failure is fully resolved and requires no action.

Root Cause

Codecov uses a flag-based coverage system where different test suites (e.g., cmd-support, cpo-hostedcontrolplane, cpo-other, hypershift-operator, other) upload their coverage profiles independently. When the first upload completed, Codecov prematurely evaluated the codecov/project status using only the partial data:

Partial report (14:22:01): 564 files, 56,105 lines, 18,537 hits → 33.03%
Complete report (14:23:55): 752 files, 92,137 lines, 34,645 hits → 37.60%

The difference (187 missing files, 35,887 missing lines) corresponds exactly to the coverage flags that had not yet uploaded. This is a known Codecov behavior: when wait_for_ci: false is set (as it is in this repo's codecov.yml), Codecov does not wait for all CI jobs to finish before posting results. It creates an initial check run with partial data, then updates (or replaces) it when more data arrives.

The failing check run 74836260351 was superseded by the passing check run 74836651183 — both are on the same commit 313e3da. The PR's current codecov/project status is passing. The PR's own code changes have excellent coverage: 95.52% patch coverage with only 9 lines uncovered across reconciler.go (4 missing + 3 partials) and credentials.go (2 missing), and overall project coverage actually increased by +0.09%.

Recommendations

No action required — the codecov/project check has already self-resolved to success (check run 74836651183, 37.60% / +0.09%). The failure was transient.
Consider setting wait_for_ci: true — the current codecov.yml has wait_for_ci: false, which causes Codecov to evaluate coverage before all flag uploads complete. Changing this to true would prevent transient failures from partial uploads:
```
codecov:
  branch: main
  notify:
    wait_for_ci: true
```
Consider adding a project coverage threshold — the repo has no explicit coverage.status.project configuration, so Codecov uses the auto target (any decrease = failure). Adding a threshold would tolerate minor fluctuations:
```
coverage:
  status:
    project:
      default:
        threshold: 1%
```

Evidence

Evidence	Detail
Failing check run	`74836260351` — `failure` at 14:22:01 UTC — 33.03% (−4.47%)
Passing check run	`74836651183` — `success` at 14:23:55 UTC — 37.60% (+0.09%)
Time between runs	~2 minutes (partial → complete upload)
Partial report stats	564 files, 56,105 lines, 18,537 hits
Complete report stats	752 files, 92,137 lines, 34,645 hits
Missing in partial	187 files, 35,887 lines (exactly the unflagged uploads)
PR patch coverage	95.52% — 9 lines uncovered in 2 files
Current PR status	`codecov/project` = pass, `codecov/patch` = pass
`codecov.yml` config	`wait_for_ci: false` — evaluates before all uploads arrive
Coverage target	None configured — uses Codecov `auto` default (any drop = fail)

openshift-ci · 2026-05-07T14:43:12Z

@jparrill: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

cblecker

A few more things beyond the inline comments:

The mode-specific checks (creds.Mode == credentialModeAWSSTS, creds.Mode == credentialModeAzureWorkloadIdentity) are scattered across reconciler.go in four places. Moving these to behavioral methods on resolvedCredentials (like needsProjectedToken(), needsWorkloadIdentityLabel(), azureAuthType()) would keep all mode knowledge co-located in credentials.go and make the reconciler a pure consumer of queries. Not blocking, but would clean up nicely.

Also, there's no test that verifies the precedence order when both cloud and credentials keys are present in the same secret. The resolver checks cloud first, which is the right behavior, but a test documenting that priority would prevent future regressions.

cblecker · 2026-05-07T15:41:00Z

+		for line := range strings.SplitSeq(string(cloudData), "\n") {
+			line = strings.TrimSpace(line)
+			if v, ok := strings.CutPrefix(line, "AZURE_CLIENT_ID="); ok {
+				clientID = v


resolveAWSCredentials correctly trims whitespace from the credentials value with strings.TrimSpace, but this path doesn't trim the extracted clientID. If the secret data has Windows-style line endings (\r\n) or trailing whitespace, you'd end up with clientID = "client-789\r" which would be set as the SA annotation and silently fail against Azure AD.

Worth adding clientID = strings.TrimSpace(v) here for consistency with the AWS path.

cblecker · 2026-05-07T15:41:08Z

+		var creds struct {
+			ClientSecret string `json:"clientSecret"`
+		}
+		if err := json.Unmarshal(credData, &creds); err == nil && creds.ClientSecret != "" {


If the credentials key contains malformed JSON (truncated secret, wrong encoding, a typo), this silently falls through to managed-identity mode with no logging or feedback. A user who provides a client-secret JSON file with a typo would get managed-identity auth instead, and the root cause would be hard to track down.

Even without changing the function signature, a log line here when err != nil would help operators diagnose misconfigured secrets.

cblecker · 2026-05-07T15:41:16Z

+			if sa.Annotations == nil {
+				sa.Annotations = map[string]string{}
+			}
+			if sa.Annotations["azure.workload.identity/client-id"] == "" {


This preserves an existing annotation, which makes sense for infrastructure-managed SAs. But it also means if the credential secret is rotated to a different Azure client ID, the SA annotation won't be updated — the backup job would continue using the old client ID.

Might be worth at least logging when the existing annotation value differs from the resolved creds.ClientID, so operators can tell why the "wrong" identity is being used.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 29, 2026

openshift-ci Bot added the do-not-merge/needs-area label Apr 29, 2026

openshift-ci Bot requested review from cblecker and csrwng April 29, 2026 11:14

openshift-ci Bot added the area/documentation Indicates the PR includes changes for documentation label Apr 29, 2026

openshift-ci Bot added area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-area labels Apr 29, 2026

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

jparrill force-pushed the CNTRLPLANE-3167 branch from c92c851 to b337ffc Compare April 29, 2026 11:34

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread hypershift-operator/controllers/etcdbackup/credentials.go

jparrill force-pushed the CNTRLPLANE-3167 branch from b337ffc to f26d30c Compare April 29, 2026 12:00

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread docs/content/how-to/disaster-recovery/etcd-snapshot-backup/managed-services-credentials.md

Comment thread docs/content/how-to/disaster-recovery/etcd-snapshot-backup/managed-services-credentials.md

hypershift-jira-solve-ci Bot mentioned this pull request Apr 29, 2026

OCPBUGS-84528: clarify pull secret in-place update behavior and add CP watches #8327

Merged

4 tasks

jparrill force-pushed the CNTRLPLANE-3167 branch from f26d30c to c3fc037 Compare April 30, 2026 08:40

coderabbitai Bot reviewed Apr 30, 2026

View reviewed changes

jparrill force-pushed the CNTRLPLANE-3167 branch from c3fc037 to 50a1270 Compare April 30, 2026 10:20

jparrill mentioned this pull request Apr 30, 2026

CNTRLPLANE-3167: support STS/IRSA credentials and standalone Velero openshift/hypershift-oadp-plugin#247

Draft

8 tasks

cblecker reviewed May 1, 2026

View reviewed changes

jparrill and others added 4 commits May 7, 2026 10:15

jparrill force-pushed the CNTRLPLANE-3167 branch from 50a1270 to 313e3da Compare May 7, 2026 14:15

cblecker reviewed May 7, 2026

View reviewed changes


		Step-by-step description of the restore process: how the OADP plugin injects the snapshot URL, how the Control Plane Operator restores etcd, and how the cluster recovers.

		### [Managed Services Credentials](managed-services-credentials.md)

Conversation

jparrill commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Description

Related PRs

Jira

Test plan

Uh oh!

openshift-merge-bot Bot commented Apr 29, 2026

Uh oh!

openshift-ci-robot commented Apr 29, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Description

Jira

Test plan

Uh oh!

coderabbitai Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Sequence Diagram(s)

❌ Failed checks (1 warning)

Uh oh!

openshift-ci Bot commented Apr 29, 2026

Uh oh!

codecov Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cblecker left a comment

Choose a reason for hiding this comment

Uh oh!

cblecker May 1, 2026

Choose a reason for hiding this comment

Uh oh!

cblecker May 1, 2026

Choose a reason for hiding this comment

Uh oh!

hypershift-jira-solve-ci Bot commented May 7, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

Summary

Uh oh!

openshift-ci Bot commented May 7, 2026

Uh oh!

cblecker left a comment

Choose a reason for hiding this comment

Uh oh!

cblecker May 7, 2026

Choose a reason for hiding this comment

Uh oh!

cblecker May 7, 2026

Choose a reason for hiding this comment

Uh oh!

cblecker May 7, 2026

Choose a reason for hiding this comment

jparrill commented Apr 29, 2026 •

edited

Loading

openshift-ci-robot commented Apr 29, 2026 •

edited by openshift-ci Bot

Loading

coderabbitai Bot commented Apr 29, 2026 •

edited

Loading

codecov Bot commented Apr 29, 2026 •

edited

Loading

hypershift-jira-solve-ci Bot commented May 7, 2026 •

edited by openshift-ci Bot

Loading