Skip to content

replay.t: stop flapping on per-job event interleave#443

Merged
exodist merged 2 commits into2.0from
replay-flake-fix
Apr 26, 2026
Merged

replay.t: stop flapping on per-job event interleave#443
exodist merged 2 commits into2.0from
replay-flake-fix

Conversation

@atoomic
Copy link
Copy Markdown
Collaborator

@atoomic atoomic commented Apr 26, 2026

Summary

t/Yath/integration/replay.t is a flapper. CI hit it on run #24952534922 (linux Perl 5.26): attempt 1 failed, attempt 2 passed.

The test runs yath test, captures the live stdout, then runs yath replay on the saved log archive and asserts both outputs are byte-identical after clean_output normalisation. The live and replay renderers emit the same set of per-job lines but in different orders depending on how the failed job's diagnostics interleave with another job's ( PASSED ) status line:

  • Live (timing-dependent): the failed job's [ FAIL ] / ( DIAG ) / < REASON > lines drain through the renderer pipeline before pass.tx's status line arrives — ( FAILED ) fail.tx → diags → ( PASSED ) pass.tx.
  • Replay: status lines emit first, then all the per-test events trail at the end — ( FAILED ) fail.tx → ( PASSED ) pass.tx → diags.

The previous clean_output only sorted consecutive ( PASSED ) / ( FAILED ) status lines. That fired in the replay case (where the two status lines ARE consecutive) but not in the live case (the [ FAIL ] / ( DIAG ) / < REASON > lines break the run), so the two normalisations ended up in different orders and the equality assertion flapped.

Fix

Trying to attach diag lines back to "their" status block is fragile because the diags carry only job N (already normalised to a sentinel by clean_output) — there is no filename to disambiguate which status they belong to.

The robust normalisation is to canonicalise the order of every job N-prefixed line:

  • Walk the cleaned output line by line.
  • Buffer consecutive job N lines into a run.
  • On hitting a non-job line, sort the run lexicographically and flush, then pass the non-job line through.

Non-job lines (the "The following jobs failed:" table, the "Yath Result Summary" block, etc.) keep their position so the file's overall narrative survives.

Test plan

  • 20/20 back-to-back local runs of yath -D test t/Yath/integration/replay.t pass.
  • A simulated diff of the exact CI failure case (live with interleaved diags vs replay with status-first) normalises to identical strings under the new algorithm.
  • Watch the next CI run on this branch to confirm the flake is gone.

🤖 Generated with Claude Code

Copy link
Copy Markdown
Contributor

@troglodyne troglodyne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably could have been done in less lines with a more clever map, but not worth the effort to golf it down, especially since it is a test.

May not be the right fix if you think the ordering is in fact important here @exodist

Comment thread t/Yath/integration/replay.t
Comment thread t/Yath/integration/replay.t Outdated
troglodyne pushed a commit to troglodyne/Test2-Harness that referenced this pull request Apr 26, 2026
PR Test-More#437 brought back a live replay.t on 2.0. Our branch still had
plan skip_all, causing the GitHub merge commit to pick up upstream's
unpatched version, which fails on Perl 5.24/5.28 due to event ordering
differences between live and replay renderers.

Update replay.t to match the upstream 2.0 version exactly, then apply
PR Test-More#443's fix: sort ALL consecutive `job N`-prefixed lines (not just
PASSED/FAILED status lines). Diag/Reason/Fail lines also carry `job N`
and can interleave differently between live and replay, so the narrower
sort was insufficient.

Also adds the missing `Wrote archive:` strip and `job N` normalisation
that upstream 2.0 introduced in PR Test-More#437.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Instead of running yath test live, capturing non-deterministic output,
and sorting job lines to paper over ordering differences, commit a
pre-generated run.yath archive and an expected_output.txt golden file.

The test now replays only the committed archive. Replay is deterministic
(events come out in archive order), so no sorting normalization is
needed. The clean_output helper gains absolute-path stripping so the
golden file is portable across machines.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it a cruft?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope: looks like used later

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: this is fragile if we decide to update the test format
we will not notice the regression here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I generated saved archive generated from 1.0 branch

If we change format of output and it doesn't break old replays then we probably haven't deduplicated enough code in the default renderer. Both go through the OutputManager so I don't think this should be the case.

If we suddenly replace [ PASSED ] with << Passed >> or whatever that should break due to us saving the "golden" expected text.

Are you saying that if we suddenly introduce new event types that these won't be covered? Yea sure I guess so, though if we drop event types via filters or something then the test will fail as the filters are chained together to the renderer via the OutputManager, so I doubt dropping event types we already expect here would result in anything other than a failure.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A way we could put a stop to this is a test that doesn't care about order but instead generates every possible event type then replays them. Then it's just iterating over lines populating a %seen hash then asserting on keys after the replay.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can still improve that replay test later as needed, for now it s kind of blocking other PRs
this is fine to use it that way I reverted cruft previously pushed there to restore your state

The previous path normalization only matched the diag line
shape:

    s{ at \S+/(t/\S+) line }{ at $1 line }g

That works for the one place where a recorded archive's
absolute path leaks through into the rendered output today
(the `(  DIAG  )  job N    at <abs>/t/... line N` line, where
Test2's `like()` failure prints the file from caller's
__FILE__ at runtime). It does not cover other contexts where
a future renderer change might surface an absolute path -- the
failure summary table cells, status lines, etc. -- forcing us
to extend the regex case-by-case every time the renderer
grows a new path-bearing line type.

Generalize: strip the absolute-path prefix from any
repo-rooted `t/...` reference anywhere in the output. The
pattern walks one or more `/segment` components followed by
`/(t/...)`, replaces with just the captured `t/...`, and stops
at whitespace or `|` so table-cell paths normalise too without
crossing cell boundaries.

Verified against the original failure shape (diag line),
table-cell paths, and status-line paths -- all six test
inputs normalise correctly. 10/10 back-to-back local runs
pass. CI for the current branch SHA (3ba86fc) is green;
this change is forward-looking robustness, not a fix for an
active failure.

Co-Authored-By: Claude <noreply@anthropic.com>
@exodist exodist merged commit eb7010c into 2.0 Apr 26, 2026
20 of 34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants