Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions skills/argos-cli/agents/openai.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
interface:
display_name: "Argos CLI"
short_description: "Use the Argos CLI safely and correctly"
default_prompt: "Use $argos-cli to run or interpret Argos CLI commands in this repository."
219 changes: 219 additions & 0 deletions skills/argos-pr-review/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,219 @@
---
name: argos-pr-review
description: >
Use Argos visual regression builds as one input to a complete pull request
review. Use this skill when a PR includes an Argos build URL, an Argos status
check, or a bot comment linking to an Argos build, and you need to determine
whether the visual diffs match the developer's likely intent based on the PR
context, code changes, and screenshots before approving the PR.
---

# Argos PR Review

Treat the Argos build as one input to the PR review, not as a separate task and
not as the sole source of truth.

## When To Use This Skill

- A PR includes an Argos build link.
- A PR has an Argos check with `changes-detected`.
- An Argos bot comment says screenshots are waiting for review.

## Review Principles

- Start from the PR context, not from the screenshots alone.
- Infer likely developer intent from the PR title, description, linked issue,
branch name, commit messages, and code diff before judging the visual result.
- Use Argos to confirm whether the rendered outcome matches that intent.
- If the code suggests one intent and the screenshots show another result, treat
that mismatch as an important review signal.
- Do not let a clean screenshot override a code-level concern, or let a strange
screenshot override clear evidence that the change is intentional.

## Review Workflow

### 1. Inspect the PR context first

Before opening snapshot diffs, inspect:

- the PR title and description
- any linked issue or ticket context
- the branch name and commit messages when useful
- the code diff, especially the UI and test files related to the changed area

Form a working hypothesis for the intended user-facing change before relying on
the screenshots.

### 2. Inspect the build status

Use the Argos CLI with the build URL or build number.

```bash
argos build get <buildReference> --json
```

Use the build status to decide the next review step:

| Status | Review meaning | Next step |
| ------------------ | ------------------------------------------- | ---------------------------------------------- |
| `accepted` | Diffs already approved | No visual review needed |
| `no-changes` | No visual diff against baseline | No visual review needed |
| `pending` | Build is not ready yet | Wait and check again |
| `progress` | Comparison is still running | Wait and check again |
| `changes-detected` | Screenshots need a decision | Fetch snapshots and inspect them |
| `rejected` | A prior decision already rejected the build | Do not approve without understanding why |
| `error` | Build processing failed | Do not approve until the failure is understood |
| `aborted` | Build did not complete normally | Do not approve until the reason is understood |
| `expired` | Build was not completed in time | Do not approve until a valid build exists |

### 3. Fetch only snapshots that need review

```bash
argos build snapshots <buildReference> --needs-review --json
```

For each diff, inspect:

- the diff mask `url`
- the previous screenshot `base.url`
- the new screenshot `head.url`
- metadata in `head.metadata`

### 4. Compare screenshots against intended change

For each changed snapshot, ask:

- does the screenshot match the UI change implied by the PR and code diff?
- is the changed area the one you would expect from the files touched?
- does the visual result reveal a regression the code review alone would not catch?
- does the screenshot contradict the claimed intent in the PR description or
linked ticket?

### 5. Decide whether the change is intentional, broken, or flaky

- Intentional change: the new screenshot matches the code change and the UI is stable.
- Regression: layout break, wrong state, missing content, wrong theme, clipped content, incorrect data, or an unexpected removed snapshot.
- Flaky capture: loading indicator, partially rendered content, animation frame, dynamic content drift, or the same transient state captured in multiple browsers.

### 6. Always check for flake signals

Pay special attention to:

- `head.metadata.test.retry > 0`
- `head.metadata.test.retries` only as context for how many retries the test is allowed to use
- a visible loader or spinner in the screenshot
- partially loaded data
- identical scores across browsers
- identical `head.url` values across browsers

If the page is clearly captured mid-load, report it as flaky even if the branch name sounds related to loading or data changes.

### 7. Report the result in the PR review

- If everything looks intentional, say the visual result appears consistent with the PR intent.
- If you find a regression, call it out with the Argos build URL, the affected snapshot names, and the mismatch between the intended change and the rendered result.
- If you find flakiness, explain the signal and recommend a stabilization fix before approval.
- When relevant, mention both the code evidence and the screenshot evidence in the same finding.

Use this format for flaky captures:

> `snapshot-name` looks flaky.
> Test: `test.title` in `test.location.file:test.location.line`
> Signal: loader visible / retry > 0 / identical head image across browsers / ...
> Fix: wait for the page to finish loading. If the SDK uses `waitForAriaBusy`, mark loading UI with `aria-busy="true"` so Argos waits for it to clear
> Review: `<snapshot url>`

## What To Inspect In Snapshot Data

For each snapshot diff, inspect:

- `name`: human-readable snapshot identifier
- `status`: usually `changed`, `added`, or `removed`
- `score`: rough diff magnitude
- `url`: diff mask / review overlay
- `base.url`: previous screenshot
- `head.url`: new screenshot
- `head.metadata.test`: test file, line, title, `retry`, `retries`
- `head.metadata.browser`: browser information

## Review Heuristics

### Expected changes

These are usually safe when they match the PR:

- intentional content updates
- deliberate theme or layout redesigns
- newly added screens
- removed screenshots that correspond to intentionally deleted pages or tests

### Regressions

Flag the review when you see:

- missing content
- wrong empty state
- unexpected loader
- broken layout or overlap
- text clipping
- wrong route or wrong page captured
- mismatched theme or styling
- removed snapshots without an intentional test removal

### Flaky captures

Treat the snapshot as likely flaky when you see:

- a spinner, skeleton, or loading state
- async content not yet rendered
- animation mid-transition
- dynamic values drifting between runs
- `retry > 0`
- identical `score` across multiple browsers
- identical `head.url` across multiple browsers

Identical `head.url` values across browsers are a particularly strong signal that both browsers captured the same transient broken state.
`retry` is the current retry count for this run. `retries` is only the configured
retry budget, so do not treat `retries > 0` alone as evidence of flakiness.

## Typical Fixes

- Loading UI visible:
wait for settled content before capture; if the SDK uses `waitForAriaBusy`, add `aria-busy="true"` to the loading container so Argos waits for it to clear
- Test captures too early:
wait for content that depends on the async data before calling the screenshot helper
- Dynamic values:
freeze the data or mask the unstable element
- JS animation:
stabilize or remove the animated element from the capture

## Suggested PR Review Wording

Use short, actionable language.

Flaky example:

```text
The Argos build shows a flaky capture rather than an intentional visual change.
The screenshot is taken before async data has loaded, so the page is captured in
an intermediate state (spinner visible / empty values / incomplete content).
Please wait for settled content before taking the screenshot, or mark the
loading UI with aria-busy="true" if your SDK is configured to wait for it.
```

Intent mismatch example:

```text
The code changes suggest the intent is to update X, but the Argos snapshots show
Y instead. This looks like an unintended visual regression rather than the
expected UI outcome of the PR.
```

## Tooling

- Prefer the Argos CLI for build metadata and snapshot enumeration.
- Use `--json` when parsing CLI output.
- Read the PR title, description, linked context, and code diff before relying on screenshots.
- Use the PR diff and test code to confirm whether the visual change matches the code.
- Load [references/baseline.md](references/baseline.md) when the review depends on baseline selection or orphan build semantics.
- Load [references/flaky-fixes.md](references/flaky-fixes.md) when you need concrete remediation examples for flaky captures.
4 changes: 4 additions & 0 deletions skills/argos-pr-review/agents/openai.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
interface:
display_name: "Argos PR Review"
short_description: "Review Argos builds as part of PR review"
default_prompt: "Use $argos-pr-review to review this pull request when an Argos build is available."
73 changes: 73 additions & 0 deletions skills/argos-pr-review/references/baseline.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# Baseline And Orphan Build Notes

Load this file when an Argos PR review depends on baseline selection, orphan
build semantics, or how approval affects future comparisons.

---

## Baseline Build

The baseline is the screenshot Argos compares the new build against.

**Approval vs. baseline:** approving a build makes it a candidate for future
baseline selection. It does not directly overwrite the current baseline. The
baseline is selected dynamically each time a new build runs.

Do not say:

```text
Approving this build will update the baseline.
```

Say instead:

```text
Approving this build would make it eligible to be selected as a baseline for future builds on this branch.
```

## Baseline Selection Rules

Argos picks the most recent build satisfying all of:

1. Same build name as the triggered build
2. All framework tests passed
3. Not a subset build
4. Status is auto-approved, manually approved, or orphan
5. Its commit is an ancestor of the merge base between the new build's commit
and the baseline branch

## Baseline Branch

- Pull requests: the PR base branch
- Push events: the project's configured default baseline branch, usually `main`

## Baseline Overrides

- `ARGOS_REFERENCE_BRANCH`: force a specific branch as baseline
- `ARGOS_REFERENCE_COMMIT`: pin to a specific commit's build

## CI Caveat

If CI does not run on the default branch, the baseline becomes stale and diffs
accumulate. Keep Argos running on the default branch too.

## Orphan Builds

An orphan build has no prior build to compare against, so all snapshots appear
as `added` with `base: null`.

This is expected for:

- the first build in a project
- a branch that has no approved ancestor build yet

The absence of a baseline is not itself a regression signal.

## Review Guidance

When reviewing an orphan build:

- inspect screenshots for obvious layout or rendering problems
- do not treat every `added` snapshot as suspicious just because `base` is null
- approve the build if the screenshots look correct and it is intended to seed a
future baseline
Loading
Loading