fix(tests): wait for the final event on the tape, not the status row (#256)#257
Conversation
…256) `test_event_tape_replay_after_terminal` (and its same-family siblings `test_resume_from_seq_returns_tail_only` and the synth tape assert) trusted the task status row as a proxy for "the `final` event is on the event tape". But `TaskManager._run` flips the status row terminal *before* appending the `final` event (so a `final`-watching follower that calls `/result` always finds the row terminal). A `wait_task_terminal` wait followed by a `wait=0` tape read therefore races the trailing `progress` event onto the last slot, intermittently failing `events[-1]["type"] == "final"` with `'progress'`. Add a shared `wait_event_tape_final` conftest helper that polls the events endpoint until the tape's last event is `final` — the signal the asserts actually depend on — and route the three affected HTTP-level tests through it. This is the HTTP-layer analogue of the store-level waiters already in `test_task_manager.py` / `test_telemetry_tracing.py`. No product code changes. The helper fails fast on a non-200 from `/events` (so a real 500 isn't masked as a "never reached final" timeout) and pages through `has_more`/`next_from_seq` so the trailing `final` is found past the 1000-event page cap. New `test_event_tape_waiter.py` pins these behaviours on known-bad input. The sibling `test_task_span_marks_cancel_not_error` flake is NOT addressed here: it already waits for the `final` event, so its ~30s timeout is a manager-side cancel race (a different mechanism), left to a separate root-caused change rather than bundling a speculative product fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Warning Review limit reached
Next review available in: 43 minutes Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available. How can I continue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews. How do review limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please refer docs for additional details. Review details⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (4)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
What & why
Fixes #256 — the recurring CI flake
tests/server/test_ingest_task.py::test_event_tape_replay_after_terminal(assert events[-1]["type"] == "final"intermittently seeing'progress').Root cause (test-side race).
TaskManager._runflips the task status row to terminal before it appends thefinalevent to the event tape — deliberately, so a follower that seesfinaland immediately calls/resultalways finds the row terminal (the reverse order racestask_not_terminal). The affected tests waited on the status row (wait_task_terminal) and then read the tape withwait=0, so they raced the trailingprogressevent onto the last slot. Await>0long-poll can't substitute: the events endpoint only enters its wait loop when thefrom_seqslice is empty, but these tests readfrom_seq=0where the tape already holds events, so it returns immediately.Fix (test-infra only, no product code). Add a shared
wait_event_tape_finalconftest helper that polls the events endpoint until the tape's last event isfinal— the signal the asserts actually depend on — and route the three affected HTTP-level tests through it (test_event_tape_replay_after_terminal,test_resume_from_seq_returns_tail_only,test_synth_task_emits_per_source_progress_and_final_report). This is the HTTP-layer analogue of the store-level waiters already intest_task_manager.py/test_telemetry_tracing.py.The helper also fails fast on a non-200 from
/events(so a real 500 isn't masked as a misleading "never reached final" timeout) and pages throughhas_more/next_from_seqso the trailingfinalis found past the 1000-event page cap. Newtests/server/test_event_tape_waiter.pypins these on known-bad input.Scope note — the sibling cancel flake is intentionally NOT in this PR
test_task_span_marks_cancel_not_error(~30s terminal-state timeout) already waits for thefinalevent, so its flake is a manager-side cancel race (a different mechanism:manager.cancelfirestoken.cancel()+asyncio_task.cancel()and the cancelledfinalmay not land), not the status-row-proxy race fixed here. It needs a separately root-caused product change — bundling a speculative fix would violate surgical/simplicity discipline. Tracked as follow-up.Verification
tools/check.py: ruff + mypy clean, 2412 passed, 138 skipped.Delivery receipt
Step 3 — in-loop verify:
tools/check.pyPASS (ruff + mypy + 2412 passed). Flake loop 50/50 + 10/10.Step 4 — codex review: round 1, no findings — "did not identify any regressions in the modified tests or helper for their current usage." Quiet at round 1.
Step 5 — fresh-review (clean subagent): PASS, no blocking findings. Independently confirmed the manager ordering, that
wait>0can't substitute, the match to the proven store-level waiters, the sound failure modes (returns on a FAILED task too → caller'ssucceededassert fails clearly; hung task times out with a descriptive error), and that all 3 call-sites preserve their assertions. Rubric: Scope & simplicity = yes; Verification discipline = yes; rest N/A (tests-only).Step 5 — /code-review (xhigh, 26 agents, 14 raw → 3 distinct), triaged:
status_code == 200guard silently skips non-200 → masks a real/events500 as a 10s timeoutlimit=1000page cap → on a >1000-event tapefinalpages off the endhas_more/next_from_seqStep 6 — doc-sync: N/A — test-infra only; no doc/CLI/route/env-var/frontmatter references to update; no CHANGELOG entry (not a published-package behavior change, per the changelog's product-facing convention).
🤖 Generated with Claude Code