Skip to content

CNTRLPLANE-3167: Auto-detect credential type for HCPEtcdBackup#8368

Open
jparrill wants to merge 4 commits intoopenshift:mainfrom
jparrill:CNTRLPLANE-3167
Open

CNTRLPLANE-3167: Auto-detect credential type for HCPEtcdBackup#8368
jparrill wants to merge 4 commits intoopenshift:mainfrom
jparrill:CNTRLPLANE-3167

Conversation

@jparrill
Copy link
Copy Markdown
Contributor

@jparrill jparrill commented Apr 29, 2026

Summary

  • Auto-detect credential mode (STS/IRSA for AWS, Workload Identity for Azure) from the credential Secret content in the HCPEtcdBackup controller
  • Extract credential resolution logic into credentials.go with dedicated unit tests
  • Add managed services credential documentation covering OADP plugin Secret flow, IAM/federated credential prerequisites, and troubleshooting

Description

ROSA HCP and ARO HCP use short-lived federated credentials instead of long-lived static keys. This PR enables the HCPEtcdBackup controller to auto-detect the credential type from the Secret format:

  • AWS STS/IRSA: When the credentials key contains a bare IAM role ARN (arn:...), the controller configures a projected SA token volume with sts.amazonaws.com audience and sets AWS_ROLE_ARN/AWS_WEB_IDENTITY_TOKEN_FILE env vars
  • Azure Workload Identity: When the Secret has a cloud key with a non-empty AZURE_CLIENT_ID=, the controller adds the azure.workload.identity/use=true pod label and annotates the SA with azure.workload.identity/client-id
  • Static credentials (AWS INI file, Azure client-secret, Azure managed-identity) continue to work as before

No API changes — detection is purely based on Secret content inspection.

Related PRs

Jira

Test plan

  • Unit tests for resolveAWSCredentials, resolveAzureCredentials, resolveCredentials, needsCredentialsFile
  • Unit tests for createBackupJob with STS, Workload Identity, and managed-identity modes
  • Unit tests for ensureServiceAccount annotation management
  • E2E: AWS STS backup on ROSA HCP cluster
  • E2E: Azure Workload Identity backup on ARO HCP cluster

🤖 Generated with Claude Code

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 29, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 29, 2026

@jparrill: This pull request references CNTRLPLANE-3167 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

  • Auto-detect credential mode (STS/IRSA for AWS, Workload Identity for Azure) from the credential Secret content in the HCPEtcdBackup controller
  • Extract credential resolution logic into credentials.go with dedicated unit tests
  • Add managed services credential documentation covering OADP plugin Secret flow, IAM/federated credential prerequisites, and troubleshooting

Description

ROSA HCP and ARO HCP use short-lived federated credentials instead of long-lived static keys. This PR enables the HCPEtcdBackup controller to auto-detect the credential type from the Secret format:

  • AWS STS/IRSA: When the credentials key contains a bare IAM role ARN (arn:...), the controller configures a projected SA token volume with sts.amazonaws.com audience and sets AWS_ROLE_ARN/AWS_WEB_IDENTITY_TOKEN_FILE env vars
  • Azure Workload Identity: When the Secret has a cloud key, the controller adds the azure.workload.identity/use=true pod label and annotates the SA with azure.workload.identity/client-id
  • Static credentials (AWS INI file, Azure client-secret, Azure managed-identity) continue to work as before

No API changes — detection is purely based on Secret content inspection.

Jira

Test plan

  • Unit tests for resolveAWSCredentials, resolveAzureCredentials, resolveCredentials, needsCredentialsFile
  • Unit tests for createBackupJob with STS, Workload Identity, and managed-identity modes
  • Unit tests for ensureServiceAccount annotation management
  • E2E: AWS STS backup on ROSA HCP cluster
  • E2E: Azure Workload Identity backup on ARO HCP cluster

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 29, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The pull request implements credential auto-detection and handling for HCPEtcdBackup, adding a credential-resolution module that classifies AWS (static vs STS/IRSA) and Azure (workload-identity, client-secret, managed-identity) modes from Secret contents. The reconciler now derives a resolvedCredentials value and passes it to ServiceAccount and Job creation, which conditionally set annotations, labels, volumes, mounts, env vars, and CLI flags. Unit tests validate parsing and Job/ServiceAccount behavior. Docs and navigation add a Managed Services Credentials tech-preview guide.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Reconciler
    participant APIServer as Kubernetes API
    participant Cloud as Cloud Platform

    User->>APIServer: Create HCPEtcdBackup CR + credentials Secret
    Reconciler->>APIServer: Watch HCPEtcdBackup
    Note over Reconciler: Trigger reconciliation

    Reconciler->>APIServer: Read credentials Secret
    APIServer-->>Reconciler: Secret data (cloud, credentials)

    Note over Reconciler: Resolve credential mode
    alt AWS STS detected
        Reconciler->>APIServer: Create ServiceAccount
        Reconciler->>APIServer: Create Job with projected token volume
        Reconciler->>APIServer: Set AWS_ROLE_ARN & AWS_WEB_IDENTITY_TOKEN_FILE env
        Note over Reconciler: No --credentials-file
    else AWS Static detected
        Reconciler->>APIServer: Create ServiceAccount
        Reconciler->>APIServer: Create Job with credentials secret volume
        Note over Reconciler: Include --credentials-file
    else Azure Workload Identity detected
        Reconciler->>APIServer: Create ServiceAccount with azure.workload.identity/client-id
        Reconciler->>APIServer: Create Job with pod label azure.workload.identity/use=true
        Note over Reconciler: No --credentials-file
    else Azure Client Secret detected
        Reconciler->>APIServer: Create ServiceAccount
        Reconciler->>APIServer: Create Job with credentials secret volume
        Note over Reconciler: Include --credentials-file and --azure-auth-type client-secret
    else Azure Managed Identity detected
        Reconciler->>APIServer: Create ServiceAccount
        Reconciler->>APIServer: Create Job with credentials secret volume
        Note over Reconciler: Include --credentials-file and --azure-auth-type managed-identity
    end

    Reconciler->>APIServer: Job pod scheduled
    APIServer->>Cloud: Pod authenticates using detected credential mode
    Cloud-->>APIServer: Credentials validated
    Note over Cloud: Pod performs etcd snapshot upload
Loading
🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and specifically summarizes the main change: auto-detecting credential types for HCPEtcdBackup, which is the primary objective reflected throughout all changeset files (credentials resolution logic, controller refactoring, documentation, and tests).
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All 21 test functions across test files use standard Go testing with t.Run() subtests containing 54 static, deterministic test names following consistent patterns with no dynamic content.
Test Structure And Quality ✅ Passed Tests demonstrate high quality with single responsibility, focused subtests, functional helper setup/cleanup, exclusive use of fake clients without timeouts, and consistency with repository conventions.
Microshift Test Compatibility ✅ Passed This PR adds only standard Go unit tests, not Ginkgo e2e tests. E2e tests are explicitly listed as TODO and not implemented.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This pull request does not add any Ginkgo e2e tests. The new test files are unit tests using standard Go testing conventions, not Ginkgo patterns. No new e2e test files were added.
Topology-Aware Scheduling Compatibility ✅ Passed The PR introduces no scheduling constraints that assume a standard HA topology; backup Job uses standard Kubernetes scheduling without affinity rules, node selectors, or topology spread constraints.
Ote Binary Stdout Contract ✅ Passed PR introduces credential resolution logic and controller modifications without violating OTE Binary Stdout Contract; all logging uses controller-runtime structured logging framework.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The PR does not add any Ginkgo e2e tests. Test files use standard Go testing package, not Ginkgo patterns.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from cblecker and csrwng April 29, 2026 11:14
@openshift-ci openshift-ci Bot added the area/documentation Indicates the PR includes changes for documentation label Apr 29, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 29, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jparrill

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-area labels Apr 29, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 29, 2026

Codecov Report

❌ Patch coverage is 95.52239% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 37.60%. Comparing base (76d06fa) to head (313e3da).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...hift-operator/controllers/etcdbackup/reconciler.go 95.03% 4 Missing and 3 partials ⚠️
...ift-operator/controllers/etcdbackup/credentials.go 96.66% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8368      +/-   ##
==========================================
+ Coverage   37.50%   37.60%   +0.09%     
==========================================
  Files         751      752       +1     
  Lines       91992    92137     +145     
==========================================
+ Hits        34505    34645     +140     
- Misses      54844    54848       +4     
- Partials     2643     2644       +1     
Files with missing lines Coverage Δ
...ift-operator/controllers/etcdbackup/credentials.go 96.66% <96.66%> (ø)
...hift-operator/controllers/etcdbackup/reconciler.go 76.69% <95.03%> (+2.35%) ⬆️
Flag Coverage Δ
cmd-support 32.68% <ø> (ø)
cpo-hostedcontrolplane 36.77% <ø> (ø)
cpo-other 37.73% <ø> (ø)
hypershift-operator 48.23% <95.52%> (+0.29%) ⬆️
other 27.77% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@docs/content/how-to/disaster-recovery/etcd-snapshot-backup/managed-services-credentials.md`:
- Around line 275-289: The unlabeled fenced code blocks under the "AWS STS:
Access Denied" and "Azure WI: No Matching Federated Identity Record" sections
are causing markdownlint MD040; update each triple-backtick fence that contains
the error messages (the blocks shown below those headings) to include a language
tag of text (i.e., ```text) so the snippets remain unchanged visually but
satisfy the linter; locate the blocks by the section headings "AWS STS: Access
Denied" and "Azure WI: No Matching Federated Identity Record" and modify the
opening fences accordingly.
- Around line 15-25: Update the documentation to match actual copy semantics and
resolver precedence: reconcile the Secret name examples (replace or mention both
`cloud-credentials` and `etcd-backup-creds` where CR examples use the latter)
and clarify that the HyperShift-copied Secret may contain either the remapped
`credentials` key, the original `cloud` key, or both depending on copy logic;
explicitly document the resolver behavior that for Azure the resolver checks the
`cloud` key first and will classify any Secret with a `cloud` key as
workload-identity (so client-secret mode must not include `cloud` and must
supply the `credentials` JSON with the client secret), and update the narrative
in the affected sections (including the blocks referencing `cloud-credentials`,
`etcd-backup-creds`, keys `cloud` and `credentials`, and the Azure mode
description) to reflect this precedence and required Secret shapes.
- Around line 88-117: In the "IAM Role Permissions" section update the JSON
policy snippet to remove the invalid IAM actions (`s3:CreateMultipartUpload`,
`s3:UploadPart`, `s3:CompleteMultipartUpload`) and only include the valid
actions `s3:PutObject` and `s3:AbortMultipartUpload` in the "Action" array for
the shown resource ARNs; also adjust the explanatory note below the snippet to
state that multipart upload is authorized by `s3:PutObject` and that
`s3:AbortMultipartUpload` is included for cleanup, removing the claim that the
removed actions are required for the AWS SDK v2 Transfer Manager.

In `@hypershift-operator/controllers/etcdbackup/reconciler.go`:
- Around line 560-575: The current ensureServiceAccount uses a single global
jobServiceAccountName and races when multiple HCP namespaces run backups; change
ensureServiceAccount to create a per-job/per-credential ServiceAccount (e.g.,
include the HCP namespace and a short credential or job identifier derived from
resolvedCredentials and the HCP being reconciled) instead of reusing
jobServiceAccountName, update any RoleBinding/ensureRoleBinding logic to
reference that per-job ServiceAccount subject (adjust symbols
ensureServiceAccount and any ensureRoleBinding or RoleBinding creation), and
ensure cleanup of the per-job ServiceAccount/RoleBinding after job completion or
failure.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 3f26774b-d142-433f-a5fe-0126039ec61b

📥 Commits

Reviewing files that changed from the base of the PR and between 60802b1 and c92c851.

⛔ Files ignored due to path filters (1)
  • docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md
📒 Files selected for processing (7)
  • docs/content/how-to/disaster-recovery/etcd-snapshot-backup/index.md
  • docs/content/how-to/disaster-recovery/etcd-snapshot-backup/managed-services-credentials.md
  • docs/mkdocs.yml
  • hypershift-operator/controllers/etcdbackup/credentials.go
  • hypershift-operator/controllers/etcdbackup/credentials_test.go
  • hypershift-operator/controllers/etcdbackup/reconciler.go
  • hypershift-operator/controllers/etcdbackup/reconciler_test.go

Comment thread hypershift-operator/controllers/etcdbackup/reconciler.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
hypershift-operator/controllers/etcdbackup/reconciler_test.go (1)

1076-1092: Minor: Backup name generation may produce unexpected characters for larger test sets.

The expression string(rune('a'+i)) works correctly for i < 26, but if the test set were expanded beyond 26 items, it would produce non-letter characters. This is fine for the current test with only 3 items, but consider using fmt.Sprintf("backup-%d", i) for consistency with other similar tests in this file (e.g., line 1025).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/etcdbackup/reconciler_test.go` around lines
1076 - 1092, The backup name generation using string(rune('a'+i)) inside the
loop that creates HCPEtcdBackup objects can produce unexpected characters for
larger i; update the name to use a numeric format (e.g.,
fmt.Sprintf("backup-%d", i)) when constructing each HCPEtcdBackup in the for i
:= range 3 loop so names are predictable and consistent with other tests, and
add an import for fmt if not already present.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hypershift-operator/controllers/etcdbackup/credentials.go`:
- Around line 64-79: When resolving the "cloud" secret in
resolveAzureCredentials, do not return credentialModeAzureWorkloadIdentity with
an empty ClientID; instead check if clientID == "" and if so log a warning
(e.g., klog.Warningf or processLogger) referencing secret.Name and the missing
AZURE_CLIENT_ID, then return a non-workload mode such as credentialModeUnknown
(or the existing fallback mode) so the ServiceAccount is not annotated with
azure.workload.identity/client-id:""; only set Mode to
credentialModeAzureWorkloadIdentity and populate ClientID when a non-empty
AZURE_CLIENT_ID is found.

---

Nitpick comments:
In `@hypershift-operator/controllers/etcdbackup/reconciler_test.go`:
- Around line 1076-1092: The backup name generation using string(rune('a'+i))
inside the loop that creates HCPEtcdBackup objects can produce unexpected
characters for larger i; update the name to use a numeric format (e.g.,
fmt.Sprintf("backup-%d", i)) when constructing each HCPEtcdBackup in the for i
:= range 3 loop so names are predictable and consistent with other tests, and
add an import for fmt if not already present.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 3868b814-bbdb-466d-bad3-fbb6c8ba9d04

📥 Commits

Reviewing files that changed from the base of the PR and between c92c851 and b337ffc.

⛔ Files ignored due to path filters (1)
  • docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md
📒 Files selected for processing (7)
  • docs/content/how-to/disaster-recovery/etcd-snapshot-backup/index.md
  • docs/content/how-to/disaster-recovery/etcd-snapshot-backup/managed-services-credentials.md
  • docs/mkdocs.yml
  • hypershift-operator/controllers/etcdbackup/credentials.go
  • hypershift-operator/controllers/etcdbackup/credentials_test.go
  • hypershift-operator/controllers/etcdbackup/reconciler.go
  • hypershift-operator/controllers/etcdbackup/reconciler_test.go
✅ Files skipped from review due to trivial changes (3)
  • docs/mkdocs.yml
  • docs/content/how-to/disaster-recovery/etcd-snapshot-backup/index.md
  • hypershift-operator/controllers/etcdbackup/credentials_test.go

Comment thread hypershift-operator/controllers/etcdbackup/credentials.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
docs/content/how-to/disaster-recovery/etcd-snapshot-backup/managed-services-credentials.md (2)

277-289: ⚠️ Potential issue | 🟡 Minor

Add language tags to the troubleshooting code fences.

This issue was flagged in a previous review but remains unaddressed. The fenced code blocks at Lines 277-279 and 287-289 lack language tags, triggering markdownlint MD040. Adding text as the language identifier will resolve the warning without changing rendering.

📝 Proposed fix
 ### AWS STS: Access Denied
 
-```
+```text
 An error occurred (AccessDenied) when calling the <S3_OPERATION> operation
 ```

 ...

 ### Azure WI: No Matching Federated Identity Record
 
-```
+```text
 AADSTS700213: No matching federated identity record found for presented assertion subject
 ```
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@docs/content/how-to/disaster-recovery/etcd-snapshot-backup/managed-services-credentials.md`
around lines 277 - 289, Two fenced code blocks containing the literal strings
"An error occurred (AccessDenied) when calling the <S3_OPERATION> operation" and
"AADSTS700213: No matching federated identity record found for presented
assertion subject" are missing language tags causing markdownlint MD040; fix by
adding the language identifier `text` immediately after the opening triple
backticks for both code fences (i.e., change the opening ``` to ```text for the
block containing the S3 AccessDenied message and for the block containing the
AADSTS700213 message) and keep the existing closing ``` unchanged.

88-117: ⚠️ Potential issue | 🟠 Major

Remove invalid IAM action names from the policy snippet; only s3:PutObject and s3:AbortMultipartUpload are needed.

This issue was flagged in a previous review but remains unaddressed. According to AWS documentation, s3:CreateMultipartUpload, s3:UploadPart, and s3:CompleteMultipartUpload are API operation names, not IAM policy action names. Multipart upload operations are authorized entirely by s3:PutObject on the target object. The policy should include only:

  • s3:PutObject (covers CreateMultipartUpload, UploadPart, and CompleteMultipartUpload operations)
  • s3:AbortMultipartUpload (valid IAM action for cleanup)

Update the policy snippet to remove Lines 100-102 and revise the explanatory note (Lines 114-116) to state that multipart upload is authorized by s3:PutObject and that s3:AbortMultipartUpload is included for cleanup.

🛠️ Proposed fix for the IAM policy
       "Action": [
         "s3:PutObject",
-        "s3:CreateMultipartUpload",
-        "s3:UploadPart",
-        "s3:CompleteMultipartUpload",
         "s3:AbortMultipartUpload"
       ],

And update the note:

 !!! note
 
-    The multipart permissions (`CreateMultipartUpload`, `UploadPart`, `CompleteMultipartUpload`, `AbortMultipartUpload`) are required because the upload uses the AWS SDK v2 Transfer Manager, which automatically splits files larger than 5 MB into multipart uploads. Etcd snapshots typically range from 30-100 MB, so multipart upload is used in practice. `AbortMultipartUpload` is needed to clean up incomplete uploads on failure.
+    The `s3:PutObject` permission authorizes all multipart upload operations (CreateMultipartUpload, UploadPart, CompleteMultipartUpload) used by the AWS SDK v2 Transfer Manager, which automatically splits files larger than 5 MB into multipart uploads. Etcd snapshots typically range from 30-100 MB, so multipart upload is used in practice. `s3:AbortMultipartUpload` is included to clean up incomplete uploads on failure.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@docs/content/how-to/disaster-recovery/etcd-snapshot-backup/managed-services-credentials.md`
around lines 88 - 117, Update the IAM policy JSON and explanatory note: remove
the invalid IAM actions "s3:CreateMultipartUpload", "s3:UploadPart", and
"s3:CompleteMultipartUpload" from the Action array and keep only "s3:PutObject"
and "s3:AbortMultipartUpload" (so the statement lists s3:PutObject and
s3:AbortMultipartUpload for the bucket and objects), and revise the accompanying
note to state that multipart uploads are authorized by s3:PutObject (which
covers CreateMultipartUpload/UploadPart/CompleteMultipartUpload operations) and
that s3:AbortMultipartUpload is included solely for cleanup of incomplete
uploads.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@docs/content/how-to/disaster-recovery/etcd-snapshot-backup/managed-services-credentials.md`:
- Around line 15-27: There is a name mismatch between the destination Secret
shown in the explanation table ("cloud-credentials") and the HCPEtcdBackup CR
examples that reference "etcd-backup-creds"; pick the correct canonical Secret
name used by the OADP HyperShift plugin and make the docs consistent by either
(a) updating the table to show "etcd-backup-creds" if the plugin actually
creates that name, or (b) updating both HCPEtcdBackup CR examples to reference
"cloud-credentials" if that is the source of truth; additionally, add one short
clarifying sentence stating whether the plugin auto-creates and copies the
Secret (so no manual creation is required) or whether users must
create/reference a different Secret name, and apply the same change to both
HCPEtcdBackup examples currently showing the alternate name.
- Around line 275-284: Update the permissions listed in the "AWS STS: Access
Denied" troubleshooting bullet so it matches the corrected IAM policy: replace
the full multipart upload action list with only the two actions allowed by the
policy—s3:PutObject and s3:AbortMultipartUpload—so the bullet under "AWS STS:
Access Denied" contains only `s3:PutObject` and `s3:AbortMultipartUpload` and
remove `s3:CreateMultipartUpload`, `s3:UploadPart`, and
`s3:CompleteMultipartUpload`.

---

Duplicate comments:
In
`@docs/content/how-to/disaster-recovery/etcd-snapshot-backup/managed-services-credentials.md`:
- Around line 277-289: Two fenced code blocks containing the literal strings "An
error occurred (AccessDenied) when calling the <S3_OPERATION> operation" and
"AADSTS700213: No matching federated identity record found for presented
assertion subject" are missing language tags causing markdownlint MD040; fix by
adding the language identifier `text` immediately after the opening triple
backticks for both code fences (i.e., change the opening ``` to ```text for the
block containing the S3 AccessDenied message and for the block containing the
AADSTS700213 message) and keep the existing closing ``` unchanged.
- Around line 88-117: Update the IAM policy JSON and explanatory note: remove
the invalid IAM actions "s3:CreateMultipartUpload", "s3:UploadPart", and
"s3:CompleteMultipartUpload" from the Action array and keep only "s3:PutObject"
and "s3:AbortMultipartUpload" (so the statement lists s3:PutObject and
s3:AbortMultipartUpload for the bucket and objects), and revise the accompanying
note to state that multipart uploads are authorized by s3:PutObject (which
covers CreateMultipartUpload/UploadPart/CompleteMultipartUpload operations) and
that s3:AbortMultipartUpload is included solely for cleanup of incomplete
uploads.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 4351cd03-c94a-4676-b1ee-0ef34059bcec

📥 Commits

Reviewing files that changed from the base of the PR and between b337ffc and f26d30c.

⛔ Files ignored due to path filters (1)
  • docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md
📒 Files selected for processing (3)
  • docs/content/how-to/disaster-recovery/etcd-snapshot-backup/index.md
  • docs/content/how-to/disaster-recovery/etcd-snapshot-backup/managed-services-credentials.md
  • docs/mkdocs.yml
✅ Files skipped from review due to trivial changes (2)
  • docs/content/how-to/disaster-recovery/etcd-snapshot-backup/index.md
  • docs/mkdocs.yml

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
hypershift-operator/controllers/etcdbackup/reconciler.go (1)

61-62: ⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Shared ServiceAccount still creates a cross-namespace Azure WI race.

jobServiceAccountName is global in the operator namespace, but annotation state is mutated per reconcile. Because active-job gating is per HCP namespace, concurrent backups in different HCP namespaces can overwrite/remove azure.workload.identity/client-id on the same SA before pod admission.

Also applies to: 560-575, 625-629, 806-807

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hypershift-operator/controllers/etcdbackup/reconciler.go` around lines 61 -
62, The global jobServiceAccountName leads to cross-namespace races because the
controller mutates the azure.workload.identity/client-id annotation per HCP
during Reconcile; change to use a namespace-scoped ServiceAccount name (e.g.,
derive SA name from jobServiceAccountName plus the HCP namespace or HCP UID) so
each HCP gets its own SA, update all places that reference jobServiceAccountName
(including the blocks around the earlier mentions) to use the derived per-HCP
name, and ensure all annotation reads/writes target that per-HCP ServiceAccount
rather than a single shared SA to avoid concurrent annotation overwrites.
docs/content/how-to/disaster-recovery/etcd-snapshot-backup/managed-services-credentials.md (1)

25-26: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Azure detection description is internally inconsistent with resolver precedence.

Line 25 says Azure falls back from cloud to credentials, but Line 230 says finding cloud triggers WI. In hypershift-operator/controllers/etcdbackup/credentials.go (Lines 65-79), a present cloud key with missing AZURE_CLIENT_ID returns managed-identity immediately (no credentials fallback). Please align this section to the actual precedence to avoid misconfiguration.

Also applies to: 230-230

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@docs/content/how-to/disaster-recovery/etcd-snapshot-backup/managed-services-credentials.md`
around lines 25 - 26, Documentation text in managed-services-credentials.md
incorrectly states Azure falls back from `cloud` to `credentials` but the actual
resolver in hypershift-operator/controllers/etcdbackup/credentials.go (function
handling detection around lines 65-79) treats a present `cloud` key with missing
AZURE_CLIENT_ID as immediate managed-identity (no `credentials` fallback).
Update both occurrences (around lines 25 and 230) to reflect the real
precedence: explain that presence of the `cloud` key triggers the
managed-identity flow unless explicit AZURE client credentials
(AZURE_CLIENT_ID/CLIENT_SECRET/TENANT_ID) are present, and clarify there is no
automatic fallback to `credentials` when `cloud` exists but AZURE_* values are
absent; reference the resolver behavior in
hypershift-operator/controllers/etcdbackup/credentials.go to ensure wording
matches code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In
`@docs/content/how-to/disaster-recovery/etcd-snapshot-backup/managed-services-credentials.md`:
- Around line 25-26: Documentation text in managed-services-credentials.md
incorrectly states Azure falls back from `cloud` to `credentials` but the actual
resolver in hypershift-operator/controllers/etcdbackup/credentials.go (function
handling detection around lines 65-79) treats a present `cloud` key with missing
AZURE_CLIENT_ID as immediate managed-identity (no `credentials` fallback).
Update both occurrences (around lines 25 and 230) to reflect the real
precedence: explain that presence of the `cloud` key triggers the
managed-identity flow unless explicit AZURE client credentials
(AZURE_CLIENT_ID/CLIENT_SECRET/TENANT_ID) are present, and clarify there is no
automatic fallback to `credentials` when `cloud` exists but AZURE_* values are
absent; reference the resolver behavior in
hypershift-operator/controllers/etcdbackup/credentials.go to ensure wording
matches code.

In `@hypershift-operator/controllers/etcdbackup/reconciler.go`:
- Around line 61-62: The global jobServiceAccountName leads to cross-namespace
races because the controller mutates the azure.workload.identity/client-id
annotation per HCP during Reconcile; change to use a namespace-scoped
ServiceAccount name (e.g., derive SA name from jobServiceAccountName plus the
HCP namespace or HCP UID) so each HCP gets its own SA, update all places that
reference jobServiceAccountName (including the blocks around the earlier
mentions) to use the derived per-HCP name, and ensure all annotation
reads/writes target that per-HCP ServiceAccount rather than a single shared SA
to avoid concurrent annotation overwrites.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: e0f9440f-58d1-4e2a-96e1-eb34dd0d9147

📥 Commits

Reviewing files that changed from the base of the PR and between f26d30c and c3fc037.

⛔ Files ignored due to path filters (1)
  • docs/content/reference/aggregated-docs.md is excluded by !docs/content/reference/aggregated-docs.md
📒 Files selected for processing (7)
  • docs/content/how-to/disaster-recovery/etcd-snapshot-backup/index.md
  • docs/content/how-to/disaster-recovery/etcd-snapshot-backup/managed-services-credentials.md
  • docs/mkdocs.yml
  • hypershift-operator/controllers/etcdbackup/credentials.go
  • hypershift-operator/controllers/etcdbackup/credentials_test.go
  • hypershift-operator/controllers/etcdbackup/reconciler.go
  • hypershift-operator/controllers/etcdbackup/reconciler_test.go
✅ Files skipped from review due to trivial changes (2)
  • docs/content/how-to/disaster-recovery/etcd-snapshot-backup/index.md
  • docs/mkdocs.yml

Copy link
Copy Markdown
Member

@cblecker cblecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice separation of the credential resolution into its own file — the detection logic and test coverage look solid. Two small items inline.


Step-by-step description of the restore process: how the OADP plugin injects the snapshot URL, how the Control Plane Operator restores etcd, and how the cluster recovers.

### [Managed Services Credentials](managed-services-credentials.md)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new page also needs an entry in docs/mkdocs.yml under the "Etcd Snapshot Backup (Tech Preview)" nav section, otherwise it won't show up in the sidebar. Something like:

- how-to/disaster-recovery/etcd-snapshot-backup/managed-services-credentials.md

after the restore-flow.md entry.

case hyperv1.AzureBlobBackupStorage:
return resolveAzureCredentials(secret)
default:
return resolvedCredentials{Mode: credentialModeAWSStatic, SecretName: secret.Name}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this default case silently returns credentialModeAWSStatic for any unknown storage type. The callers (getCredentialSecretName, buildUploadArgs) already error on unsupported types so it's not a bug today, but if a new storage type gets added later this could mask issues. Worth considering returning a distinct unknown mode or logging a warning here.

jparrill and others added 4 commits May 7, 2026 10:15
The HCPEtcdBackup controller now inspects the referenced credential
Secret to determine the authentication mode instead of always passing
--credentials-file. This enables IRSA/STS on AWS (ROSA HCP) and
Workload Identity on Azure (ARO HCP) without API changes.

Detection logic:
- AWS STS: Secret key "credentials" contains an ARN prefix
- AWS static: Secret key "credentials" contains INI-format creds
- Azure WI: Secret key "cloud" with AZURE_CLIENT_ID env format
- Azure client-secret: Secret key "credentials" with clientSecret JSON
- Azure managed-identity: fallback for other credential formats

Job PodSpec is built per mode:
- AWS STS: projected SA token volume + AWS_ROLE_ARN env var
- Azure WI: pod label azure.workload.identity/use=true, SA annotated
- Static/client-secret: credentials Secret mounted as volume
- Azure modes: --azure-auth-type arg set accordingly

Ref: CNTRLPLANE-3167

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
…ckup

Document credential auto-detection for ROSA HCP (AWS STS/IRSA) and
ARO HCP (Azure Workload Identity) in the etcd snapshot backup flow.
Covers OADP plugin Secret flow with key remapping, IAM/federated
credential prerequisites, and troubleshooting.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
…dBackup

For Azure Workload Identity mode, if the etcd-backup-job ServiceAccount
already has an azure.workload.identity/client-id annotation (e.g., set
by infrastructure or Helm), it is preserved. Otherwise, the annotation
is set from the credential secret, matching the original behavior for
self-managed HyperShift deployments.

This allows managed platforms (ARO-HCP) to use a dedicated managed
identity for the backup job while self-managed deployments can continue
reusing the Velero identity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ckup

Replace the findActiveJob guard in cleanupResources with
hasNonTerminalBackup, which checks for any non-terminal HCPEtcdBackup
CR in the same namespace. This closes the race window where a new
backup CR has been created but its Job has not been spawned yet,
preventing premature deletion of shared RBAC and NetworkPolicy
resources.

Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Juan Manuel Parrilla Madrid <jparrill@redhat.com>
@jparrill jparrill force-pushed the CNTRLPLANE-3167 branch from 50a1270 to 313e3da Compare May 7, 2026 14:15
@hypershift-jira-solve-ci
Copy link
Copy Markdown

hypershift-jira-solve-ci Bot commented May 7, 2026

Now I have the complete picture. Here is the report:

Test Failure Analysis Complete

Job Information

  • Prow Job: codecov/project
  • Build ID: Check run 74836260351
  • PR: #8368CNTRLPLANE-3167: Auto-detect credential type for HCPEtcdBackup
  • Commit: 313e3da
  • Base Commit: 76d06fa (main)

Test Failure Analysis

Error

codecov/project: 33.03% (-4.47%) compared to 76d06fa

Summary

The codecov/project check run 74836260351 failed transiently at 14:22:01 UTC because Codecov computed coverage from an incomplete upload — only a subset of coverage flags had reported at that point, causing a phantom drop from 37.50% to 33.03% (−4.47%, 187 files and 35,887 lines appearing to vanish). Approximately 2 minutes later, at 14:23:55 UTC, a second codecov/project check run (74836651183) was created after all coverage flags finished uploading, correctly reporting 37.60% (+0.09%) and concluding as success. The codecov/project check currently shows passing on the PR. This failure is fully resolved and requires no action.

Root Cause

Codecov uses a flag-based coverage system where different test suites (e.g., cmd-support, cpo-hostedcontrolplane, cpo-other, hypershift-operator, other) upload their coverage profiles independently. When the first upload completed, Codecov prematurely evaluated the codecov/project status using only the partial data:

  • Partial report (14:22:01): 564 files, 56,105 lines, 18,537 hits → 33.03%
  • Complete report (14:23:55): 752 files, 92,137 lines, 34,645 hits → 37.60%

The difference (187 missing files, 35,887 missing lines) corresponds exactly to the coverage flags that had not yet uploaded. This is a known Codecov behavior: when wait_for_ci: false is set (as it is in this repo's codecov.yml), Codecov does not wait for all CI jobs to finish before posting results. It creates an initial check run with partial data, then updates (or replaces) it when more data arrives.

The failing check run 74836260351 was superseded by the passing check run 74836651183 — both are on the same commit 313e3da. The PR's current codecov/project status is passing. The PR's own code changes have excellent coverage: 95.52% patch coverage with only 9 lines uncovered across reconciler.go (4 missing + 3 partials) and credentials.go (2 missing), and overall project coverage actually increased by +0.09%.

Recommendations
  1. No action required — the codecov/project check has already self-resolved to success (check run 74836651183, 37.60% / +0.09%). The failure was transient.
  2. Consider setting wait_for_ci: true — the current codecov.yml has wait_for_ci: false, which causes Codecov to evaluate coverage before all flag uploads complete. Changing this to true would prevent transient failures from partial uploads:
    codecov:
      branch: main
      notify:
        wait_for_ci: true
  3. Consider adding a project coverage threshold — the repo has no explicit coverage.status.project configuration, so Codecov uses the auto target (any decrease = failure). Adding a threshold would tolerate minor fluctuations:
    coverage:
      status:
        project:
          default:
            threshold: 1%
Evidence
Evidence Detail
Failing check run 74836260351failure at 14:22:01 UTC — 33.03% (−4.47%)
Passing check run 74836651183success at 14:23:55 UTC — 37.60% (+0.09%)
Time between runs ~2 minutes (partial → complete upload)
Partial report stats 564 files, 56,105 lines, 18,537 hits
Complete report stats 752 files, 92,137 lines, 34,645 hits
Missing in partial 187 files, 35,887 lines (exactly the unflagged uploads)
PR patch coverage 95.52% — 9 lines uncovered in 2 files
Current PR status codecov/project = pass, codecov/patch = pass
codecov.yml config wait_for_ci: false — evaluates before all uploads arrive
Coverage target None configured — uses Codecov auto default (any drop = fail)

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 7, 2026

@jparrill: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Copy Markdown
Member

@cblecker cblecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more things beyond the inline comments:

The mode-specific checks (creds.Mode == credentialModeAWSSTS, creds.Mode == credentialModeAzureWorkloadIdentity) are scattered across reconciler.go in four places. Moving these to behavioral methods on resolvedCredentials (like needsProjectedToken(), needsWorkloadIdentityLabel(), azureAuthType()) would keep all mode knowledge co-located in credentials.go and make the reconciler a pure consumer of queries. Not blocking, but would clean up nicely.

Also, there's no test that verifies the precedence order when both cloud and credentials keys are present in the same secret. The resolver checks cloud first, which is the right behavior, but a test documenting that priority would prevent future regressions.

for line := range strings.SplitSeq(string(cloudData), "\n") {
line = strings.TrimSpace(line)
if v, ok := strings.CutPrefix(line, "AZURE_CLIENT_ID="); ok {
clientID = v
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolveAWSCredentials correctly trims whitespace from the credentials value with strings.TrimSpace, but this path doesn't trim the extracted clientID. If the secret data has Windows-style line endings (\r\n) or trailing whitespace, you'd end up with clientID = "client-789\r" which would be set as the SA annotation and silently fail against Azure AD.

Worth adding clientID = strings.TrimSpace(v) here for consistency with the AWS path.

var creds struct {
ClientSecret string `json:"clientSecret"`
}
if err := json.Unmarshal(credData, &creds); err == nil && creds.ClientSecret != "" {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the credentials key contains malformed JSON (truncated secret, wrong encoding, a typo), this silently falls through to managed-identity mode with no logging or feedback. A user who provides a client-secret JSON file with a typo would get managed-identity auth instead, and the root cause would be hard to track down.

Even without changing the function signature, a log line here when err != nil would help operators diagnose misconfigured secrets.

if sa.Annotations == nil {
sa.Annotations = map[string]string{}
}
if sa.Annotations["azure.workload.identity/client-id"] == "" {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This preserves an existing annotation, which makes sense for infrastructure-managed SAs. But it also means if the credential secret is rotated to a different Azure client ID, the SA annotation won't be updated — the backup job would continue using the old client ID.

Might be worth at least logging when the existing annotation value differs from the resolved creds.ClientID, so operators can tell why the "wrong" identity is being used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/documentation Indicates the PR includes changes for documentation area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants