test(trace-flush-worker): prevent a leaked watchdog os._exit from killing the test run (TD-612)#179
Merged
Merged
Conversation
added 2 commits
June 5, 2026 16:37
…an't kill pytest (TD-612)
The trace-flush-worker starts a daemon thread (flush-watchdog) that calls the
real os._exit(1) on a stall so Docker restarts a wedged worker. Several worker
tests start that daemon via main(). A daemon that outlived its test — e.g. one
bound to a re-imported (stale) module object whose globals a module-local
teardown can no longer signal — later reached its stall deadline and fired
os._exit(1) in-process, hard-killing an already-passing pytest run with no
failing test, assertion, or traceback.
Add a session-wide autouse guard in tests/conftest.py that:
- records-instead-of-exits on the process-wide os._exit, so an orphaned daemon
bound to any module generation (all share the one os module) can no longer
terminate the runner; and
- still joins every flush-watchdog thread after each test and asserts none
leaked, so neutralization never silently masks a leak — a leak fails loudly
instead of killing the process.
Remove the now-subsumed module-scoped teardown from test_trace_flush_worker.py
(the conftest guard covers all modules, including the cross-module victim) and
its orphaned pytest import. Add tests/test_td612_watchdog_isolation.py
reproducing the stale-bound-daemon case and asserting the parent survives.
The watchdog's production stall-restart behavior is unchanged (test-only fix).
…on-scoped (TD-612) Address cycle-1 review findings on the TD-612 test-isolation guard. - Move os._exit neutralization to a session-scoped pytest.MonkeyPatch fixture that owns os._exit continuously. The prior function-scoped patch was undone at each test teardown, leaving the real os._exit live in the gap between tests where a stale-bound flush-watchdog daemon could still hard-kill the runner. A session recorder routes every neutralized exit into a per-test sink, so there is no inter-test window. - Keep the per-test fixture as a fresh sink swap plus the loud join + leak assertion, so neutralizing the exit never masks a thread leak. The direct watchdog tests' own function-scoped mod.os._exit patch still wins during their body and is restored back to the session recorder at teardown. - Replace the fixed-sleep assertion in the second adversarial test with the _wait_for_exit_capture poll helper. - Rename test_following_test_survives_prior_leak to test_armed_watchdog_fires_into_recorder_not_process (it arms a fresh watchdog in its own body rather than carrying a cross-test leak). - Document the teardown's stale-daemon limitation and tighten the contextlib.suppress to only wrap the watchdog-wakeup set() call. No production change.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
main's Test Suite intermittently fails: a single Unit Tests (Python 3.x) job exits 1 after every test has passed —[100%]reached, then a bareProcess completed with exit code 1with no failing test, assertion, or traceback. It is non-deterministic (one Python version at a time) and clears on re-run, which masks genuine failures and erodes trust in the gate.Root cause
trace_flush_workerstarts a daemon thread namedflush-watchdogthat calls the realos._exit(1)on a stall, so Docker restarts a wedged worker in production. Several worker tests start that daemon viamain(). The worker module is deleted and re-imported per test, so a daemon that outlives its test can be bound to a superseded module object whose globals a module-scoped teardown can no longer signal. That orphaned daemon later reaches its stall deadline and firesos._exit(1)in-process — hard-killing an already-green pytest run.The previous module-scoped guard could not cover this: the victim is a different test module, and a stale-bound daemon never observes a shutdown signal sent to the current module generation.
Fix (test-only)
A session-wide guard in
tests/conftest.py:os._exitcontinuously for the whole test session (records the call instead of exiting). Because every re-imported module generation shares the oneosmodule, an orphaned daemon bound to any generation is neutralized — and there is no inter-test gap where the realos._exitis live.flush-watchdogthread and asserts none leaked — so neutralizing the exit never silently hides a thread leak; a leak still fails a test loudly, it just can no longer kill the process.The old module-scoped guard is removed (now subsumed). New adversarial tests reproduce the stale-bound-module path directly and assert the runner survives.
No production behavior change —
src/is untouched; the watchdog still calls the realos._exit(1)on a genuine stall in the container. The recorder exists only within the pytest process.Verification
os._exitpatch wins during the test body and is restored to the recorder, never the realos._exit).pytest tests/ --ignore=tests/e2e --ignore=tests/integration -m "not quarantine" -p no:randomly); adversarial isolation tests pass;black/ruff/isortclean.Closes TD-612.