Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
213 changes: 213 additions & 0 deletions hyperfleet/docs/release-contract.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,213 @@
---
Status: Draft
Owner: HyperFleet Team
Last Updated: 2026-04-11
---

# HyperFleet Release Contract

## What & Why

### What

This document defines the formal release contract between the HyperFleet team and its consumer teams (GCP Offering Team and ROSA Regional Platform Team). It covers:

- **Release handoff contract**: what artifacts are produced, how consumers are notified, and what SLAs apply
- **Integration testing strategy**: how each team gates HyperFleet changes and how tests are coordinated across teams
- **Test ownership map**: which team owns which layer of testing to eliminate redundancy and close coverage gaps

### Why

- Without a defined contract, release handoffs are ad-hoc and require manual coordination between teams, slowing delivery and increasing error risk
- Unclear test ownership creates either coverage gaps (bugs reaching production) or overlapping test suites (longer pipelines without benefit)
- Consumer teams deploying HyperFleet via Argo CD and Terraform need predictable, machine-consumable artifacts (OCI Helm charts) to automate rollout
- A shared testing strategy is required before building confidence in continuous delivery pipelines end-to-end

### Out of scope

- No automated deployment of HyperFleet releases to consumer integration environments

---

## Consumer Teams

| Team | Platform | Deployment Method |
|------|----------|-------------------|
| GCP Offering Team | GCP | Argo CD + Terraform + Tekton Pipelines |
| ROSA Regional Platform | AWS ROSA | Argo CD + Terraform + AWS CodePipelines |

### ROSA Platform Architecture

The ROSA regional platform consumes HyperFleet as part of a GitOps deployment pipeline. Each deployment initiates three pipelines:

```mermaid
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I see you add a mermaid diagram it warms my heart 😆

flowchart TD
PR[HyperFleet PR / Release] --> OCI[OCI Helm Chart Registry]
OCI --> BOM[Bill of Materials\nenvironment default config file]
BOM --> P1[Pipeline 1\nEntry Point]
P1 --> P2[Pipeline 2\nRegional Cluster Provisioning\nTerraform]
P1 --> P3[Pipeline 3\nManagement Cluster Provisioning\nArgo CD]
P2 --> ENV[Full Environment]
P3 --> ENV
```

Environment configuration is centralized in a `default` file that acts as the bill of materials for Argo CD reconciliation. Component versions, Git revisions, and domain names are defined there and can be overridden per environment.

### GCP Platform Architecture
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include this 🤔 just thinking it might be hard to keep up to date, would be good if we could link to their doc's if they have them instead


The GCP platform uses a **three-tier multi-region model**: a single global control plane manages one or more region clusters, each of which may host one or more management clusters for Hypershift-based customer control planes.

```mermaid
flowchart TD
HF[HyperFleet PR / Release] --> TAG[Git Tag on Component Repo\ne.g. v0.1.0]
TAG --> ARGOAPP[ArgoCD Application\ntargetRevision: vX.Y.Z]
ARGOAPP --> RENDER[render.py\nargocd/config → argocd/rendered]
RENDER --> GIT[Git Commit\ngcp-hcp-infra repo]
GIT --> GLOBAL[Global Cluster\nArgoCD Control Plane]
GLOBAL --> REGION[Region Cluster\nhyperfleet-api\nhyperfleet-sentinel\nhyperfleet-cloud-resources]
GLOBAL --> MGT[Management Cluster\nHypershift Operator\nMaestro Agent]
REGION --> HC[Customer Hosted Clusters\nvia gcphcp CLI]
```

HyperFleet is deployed as three ArgoCD applications on each region cluster:

| Application | Source | Version Pinning |
|-------------|--------|-----------------|
| `hyperfleet-api` | `github.com/openshift-hyperfleet/hyperfleet-api` | Git tag (`targetRevision: vX.Y.Z`) |
| `hyperfleet-sentinel` | `github.com/openshift-hyperfleet/hyperfleet-sentinel` | Git tag (`targetRevision: vX.Y.Z`) |
| `hyperfleet-cloud-resources` | `gcp-hcp-infra` repo (local Helm chart) | Repository revision — provisions GCP Pub/Sub topics and IAM bindings |

Versions are hardcoded in ArgoCD Application manifests under `argocd/config/region/{app}/template.yaml`. Updating a HyperFleet version requires editing that file, re-rendering with `uv run argocd/scripts/render.py`, and committing — ArgoCD then auto-syncs.

GCP-specific infrastructure (Pub/Sub topics and subscriptions, IAM bindings for Sentinel) is deployed by `hyperfleet-cloud-resources` ahead of the HyperFleet applications (sync wave −5 vs wave 0) using Config Connector managed by Argo CD.

Integration tests run as **Tekton Pipelines on-cluster** (no Prow or GitHub Actions). The `gcp-region-e2e-pipeline` pipeline provisions a full GCP environment via Terraform, verifies ArgoCD sync on both region and management clusters, and optionally runs hosted cluster lifecycle tests (`hostedcluster-e2e` task). A nightly CronJob triggers this pipeline at 02:00 UTC against the `main` branch. Cleanup always runs in a finally block, deleting the ephemeral GCP project and clearing Terraform state.

Comment thread
coderabbitai[bot] marked this conversation as resolved.
---

## Release Handoff Contract

### Release Artifacts

For each HyperFleet release, the following artifacts are produced and made available to consumer teams:

| Artifact | Location | Format | Notes |
|----------|----------|--------|-------|
| Container images | `quay.io/openshift-hyperfleet/hyperfleet-{component}:{version}` | OCI image | Built automatically by Prow on GA tag |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just leaving a comment for myself as a reminder we need to create a ticket post Konflux onboarding to update all references to the newest image registry

| Helm charts | OCI registry (see [Helm Chart Distribution](#helm-chart-distribution)) | OCI artifact | Required for ROSA/Argo CD consumption |
| Release notes | `hyperfleet-release` repo, `releases/release-X.Y/` | Markdown | Compatibility matrix, breaking changes, upgrade guide |
Comment thread
rh-amarin marked this conversation as resolved.
| Compatibility matrix | `hyperfleet-release` repo | Markdown table | Maps validated component version combinations |
| Git tags | Per-component repos + `hyperfleet-release` | `vX.Y.Z` / `release-X.Y` | See [Release Process](hyperfleet-release-process.md) |


When a GA release is published, it will have detail of which ROSA/GCP versions have passed the integration tests to use as compatibility matrix. This allows to potentially introduce a breaking change in one release, that may be only deployable by another pillar.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should we obtain ROSA/GCP version information? Do we need to manually check with those teams for every Hyperfleet release?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ROSA/GCP team will provide a way to run their pipelines, with a specific stable version of their solution for testing ours.

So we will take the version running there for the compatibility matrix

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question, will they have a 'version' or will it just be a point in time, this is something that we struggled with in OCM, we have no 'version' for our environments, we just SHA's at specific points in time


### Helm Chart Distribution

**Current state**: ROSA consumes HyperFleet charts via Argo CD ApplicationSets that point directly to GitHub repos with a pinned `targetRevision` Git tag (e.g., `targetRevision: v0.1.1` on `https://github.com/openshift-hyperfleet/hyperfleet-adapter`). A freshly configured Argo CD instance does not support Git-sourced Helm charts without a plugin, which ROSA has not installed. This creates a tight coupling between HyperFleet Git tags and ROSA's deployment cadence.

**Agreed path**:

1. **Short-term (Q2 2026)**: ROSA team sets up a temporary OCI registry to publish HyperFleet Helm charts. This unblocks integration testing immediately.
2. **Q2 target**: HyperFleet team publishes charts to an OCI-compliant registry via Conflux as part of the release pipeline, eliminating the temporary workaround and the Git coupling.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean Konflux not Conflux.



### Notification SLA

When a HyperFleet GA release is published:

| Event | Channel | Timeline | Recipients |
|-------|---------|----------|------------|
| Release candidate available | `#hyperfleet-releases` Slack | RC cut day | GCP team, ROSA team |
| GA release published | `#hyperfleet-releases` Slack | GA day | GCP team, ROSA team |
| Breaking change in next release | `#hyperfleet-releases` Slack | ≥ 1 sprint before GA | GCP team, ROSA team |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean notify breaking change via slack from a summarized level? Will they be linked to each component repo for breaking change details?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detail will be in components, yes, no need to duplicate.
But this way, we have a trace when we communicated to our consumers about these breaking changes.

| Hotfix / patch release | `#hyperfleet-releases` Slack | Within 2 hours of GA tag | GCP team, ROSA team |


At this point in time (April 26) breaking changes are not blockers to HyperFleet releases as ROSA/GCP teams do not have to keep long running clusters and migrate data.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we already aligned on this date April 26?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove hardcoded date from normative policy text.

Line 127 embeds a point-in-time statement (April 26) in a rule section. This will go stale and can create policy ambiguity. Prefer a versioned/status-based qualifier instead of a calendar date in the sentence.

Suggested edit
-At this point in time (April 26) breaking changes are not blockers to HyperFleet releases as ROSA/GCP teams do not have to keep long running clusters and migrate data.
+For the current MVP phase, breaking changes are not blockers to HyperFleet releases, since ROSA/GCP teams do not maintain long-running clusters with data migration requirements.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
At this point in time (April 26) breaking changes are not blockers to HyperFleet releases as ROSA/GCP teams do not have to keep long running clusters and migrate data.
For the current MVP phase, breaking changes are not blockers to HyperFleet releases, since ROSA/GCP teams do not maintain long-running clusters with data migration requirements.
🧰 Tools
🪛 LanguageTool

[grammar] ~127-~127: Use a hyphen to join words.
Context: ... ROSA/GCP teams do not have to keep long running clusters and migrate data. ###...

(QB_NEW_EN_HYPHEN)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hyperfleet/docs/release-contract.md` at line 127, The sentence "At this point
in time (April 26) breaking changes are not blockers to HyperFleet releases..."
contains a hardcoded calendar date that will go stale; replace the date with a
versioned or status-based qualifier (e.g., "as of release X.Y" or "currently" /
"until further notice") and update the sentence to read something like
"Currently, breaking changes are not blockers to HyperFleet releases; ROSA/GCP
teams are not required to maintain long‑running clusters or migrate data" so the
policy remains evergreen—locate the sentence by searching for the phrase "At
this point in time (April 26)" in release-contract.md and remove the explicit
date, substituting a version/status token.



### Rollback / Recovery

HyperFleet uses a **roll-forward** strategy for MVP: issues are fixed via patch releases rather than rollback. See [Release Process — Release Recovery Strategy](hyperfleet-release-process.md#55-release-recovery-strategy).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we plan to continue using a roll-forward strategy for a period after MVP, as @ciaranRoche mentioned? I’ll also update the release process doc accordingly.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we should maintain a roll-forward strategy, we can confirm on the next office hours to ensure this aligns with their expectations.


HyperFleet commits to (exact times TBD):

- Producing a patch release within **48 hours** for Blocker/Critical regressions
- Producing a patch release within **1 week** for Major regressions
Comment thread
rh-amarin marked this conversation as resolved.
- Maintaining N-1 backward compatibility so consumer teams can remain pinned to the previous validated release while a fix is in flight

---

## Integration Testing Strategy

### Decision: Nightly Runs with OCI Chart Injection

**Agreed approach** (as of March 31, 2026 meeting):

- Start with **nightly runs** against HyperFleet `main` branch, not presubmit jobs
- Test against the **latest known-good stable version** of the ROSA regional platform (production Maestro version), replacing only the HyperFleet component under test
- The ROSA team will **temporarily enable OCI chart pushing** so the HyperFleet team can inject PR-built charts into the ROSA deployment pipeline
- Evaluate **non-blocking presubmit** integration with the HyperFleet release repository as a follow-up
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to add similar prow jobs—like the E2E testing Prow job—as a standard requirement and include them in the release X.Y checklist?


**Rationale**: Running full ROSA environment provisioning (~40 minutes + E2E duration) as a presubmit would significantly impact development velocity without proportional benefit at the current team scale. Nightly runs provide meaningful feedback without blocking day-to-day development.

**Note on ROSA's existing pre-merge capability**: The ROSA repo already has a working cross-component E2E pre-merge mechanism (triggered via Prow comment on PRs). The decision to start with nightly runs is about HyperFleet's readiness to onboard to that mechanism — not a limitation of the ROSA infrastructure. Per-PR testing remains the target once the OCI chart injection step is stable.


### Team Test Ownership

| Layer | Owner | Scope | Runs on |
|-------|-------|-------|---------|
| Unit tests | HyperFleet | Each component in isolation | Every PR (presubmit) |
| Integration tests | HyperFleet | Cross-component API contracts | Every PR (presubmit) |
| HyperFleet E2E | HyperFleet | HyperFleet stack end-to-end | Nightly (main branch) |
| ROSA integration | ROSA Team | Full ROSA region + HyperFleet override | Nightly (HyperFleet main) |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who is responsible to update the configuration of breaking changes for nightly integration testing? And how to notify?

| GCP integration | GCP Team | GCP deployment + HyperFleet | Nightly (HyperFleet main) via Tekton `gcp-region-e2e-pipeline` |
| Release gate | HyperFleet | All of the above must pass | Before GA tag |

Comment on lines +160 to +168
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Resolve the release-gate contradiction in the ownership matrix.

Line 140 marks GCP integration as TBD, while Line 141 requires all layers to pass before GA. This is internally inconsistent and makes GA criteria non-operable.

Suggested wording adjustment
-| GCP integration | GCP Team | GCP deployment + HyperFleet | TBD (ref: GCP-334) |
-| Release gate | HyperFleet | All of the above must pass | Before GA tag |
+| GCP integration | GCP Team | GCP deployment + HyperFleet | TBD (ref: GCP-334) |
+| Release gate | HyperFleet | All mandatory layers must pass; GCP integration is advisory until GCP-334 is complete | Before GA tag |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| Layer | Owner | Scope | Runs on |
|-------|-------|-------|---------|
| Unit tests | HyperFleet | Each component in isolation | Every PR (presubmit) |
| Integration tests | HyperFleet | Cross-component API contracts | Every PR (presubmit) |
| HyperFleet E2E | HyperFleet | HyperFleet stack end-to-end | Nightly (main branch) |
| ROSA integration | ROSA Team | Full ROSA region + HyperFleet override | Nightly (HyperFleet main) |
| GCP integration | GCP Team | GCP deployment + HyperFleet | TBD (ref: GCP-334) |
| Release gate | HyperFleet | All of the above must pass | Before GA tag |
| Layer | Owner | Scope | Runs on |
|-------|-------|-------|---------|
| Unit tests | HyperFleet | Each component in isolation | Every PR (presubmit) |
| Integration tests | HyperFleet | Cross-component API contracts | Every PR (presubmit) |
| HyperFleet E2E | HyperFleet | HyperFleet stack end-to-end | Nightly (main branch) |
| ROSA integration | ROSA Team | Full ROSA region + HyperFleet override | Nightly (HyperFleet main) |
| GCP integration | GCP Team | GCP deployment + HyperFleet | TBD (ref: GCP-334) |
| Release gate | HyperFleet | All mandatory layers must pass; GCP integration is advisory until GCP-334 is complete | Before GA tag |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hyperfleet/docs/release-contract.md` around lines 134 - 142, The ownership
matrix contradicts itself because the "GCP integration" row is marked as "TBD"
while the "Release gate" row requires "All of the above must pass" before GA;
update the document so the gate is operable by either assigning a concrete
owner/schedule to the "GCP integration" row (replace "TBD (ref: GCP-334)" with
the actual owner and cadence) or changing the "Release gate" text to explicitly
exclude items marked TBD (e.g., "All applicable, non-TBD layers must pass");
edit the rows "GCP integration" and "Release gate" in the table accordingly to
remove the contradiction.

### Testing Gaps Identified

| Gap | Owning Team | Mitigation |
|-----|-------------|------------|
| HyperFleet not yet onboarded to ROSA's pre-merge E2E mechanism | HyperFleet | Onboard to `openshift/release` Prow config + create `quay.io/rrp-dev-ci/` image repos (see onboarding steps above) |
| Helm chart override (OCI) not yet wired into ROSA CI | ROSA + HyperFleet | Temporary OCI setup by ROSA team (Q2 2026, immediate action); replaced by Conflux Q2 target |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conflux -> Konflux

| GCP integration tests not yet publishing chart overrides into HyperFleet CI | GCP + HyperFleet | Blocked on GCP-334 (CLM/Argo CD integration in progress); nightly Tekton pipeline already exists but consumes pinned tags, not PR-built charts |
| Multi-component PR testing (API + Adapter in same PR) | HyperFleet | Nightly tests use `main` for all other components; single-component override per nightly run is the starting point |
| Presubmit integration gate for HyperFleet release repo | HyperFleet | Future action: non-blocking presubmit on `hyperfleet-release` repo |


---

## Alternatives Considered

### 1. Non-blocking Presubmit on HyperFleet Release Repository

Run the full ROSA integration pipeline as an optional, non-blocking presubmit job triggered on the `hyperfleet-release` repo.

**Rejected for now**: A ~40-minute+ non-blocking job provides weak signal — developers may ignore it, especially if failures are infrequent. Starting with nightly runs builds confidence in the pipeline before promoting it to presubmit. This remains a **future action**.

### 2. Consumer-Driven Contract Testing (Pact-style)

Define formal API contracts using a consumer-driven contract testing tool (e.g., Pact). ROSA and GCP publish their expectations; HyperFleet CI verifies them on every PR.

**Rejected for MVP**: The integration surface between HyperFleet and consumer teams is primarily at the Helm chart / deployment configuration level, not a REST API contract boundary. Consumer-driven contract testing tools are better suited to service-to-service REST contracts. Helm value schema validation is a lighter-weight alternative to investigate post-MVP.

### 3. Automated Rollout to Integration Environments on GA

Trigger automatic deployment of each HyperFleet GA release to ROSA and GCP integration environments via webhooks.

**Rejected for MVP**: ROSA's pipeline takes ~40 minutes per run and requires environment-specific configuration overrides. Automating this safely requires tooling (OCI charts via Conflux, pipeline webhooks) not yet in place. Deferred to post-Q2 2026.

---

## Related Documents

- [HyperFleet Release Process](hyperfleet-release-process.md) — release cadence, branching, artifacts
- [Versioning Trade-offs](versioning-trade-offs.md) — SDK versioning, rollback considerations
- [E2E Testing Framework Spike Report](e2e-testing/e2e-testing-framework-spike-report.md)
- [E2E Run Strategy Spike Report](e2e-testing/e2e-run-strategy-spike-report.md)
- [ROSA — Adding a Component for Pre-merge E2E Testing](https://github.com/openshift-online/rosa-regional-platform/blob/main/docs/adding-component-pre-merge.md) — onboarding guide for the `/test rosa-regionality-compatibility-e2e` Prow trigger
- [ROSA — Testing Strategy Design](https://github.com/openshift-online/rosa-regional-platform/blob/main/docs/design/testing-strategy.md) — three CI workflows (pre-merge, nightly integration, nightly ephemeral)
- GCP-334 — CLM Components Deployment (linked Jira epic)
- HYPERFLEET-633 — Define release contract and integration testing strategy