Skip to content

test(trace-flush-worker): prevent a leaked watchdog os._exit from killing the test run (TD-612)#179

Merged
Hidden-History merged 2 commits into
mainfrom
fix/td612-watchdog-test-isolation
Jun 6, 2026
Merged

test(trace-flush-worker): prevent a leaked watchdog os._exit from killing the test run (TD-612)#179
Hidden-History merged 2 commits into
mainfrom
fix/td612-watchdog-test-isolation

Conversation

@Hidden-History
Copy link
Copy Markdown
Owner

Problem

main's Test Suite intermittently fails: a single Unit Tests (Python 3.x) job exits 1 after every test has passed[100%] reached, then a bare Process completed with exit code 1 with no failing test, assertion, or traceback. It is non-deterministic (one Python version at a time) and clears on re-run, which masks genuine failures and erodes trust in the gate.

Root cause

trace_flush_worker starts a daemon thread named flush-watchdog that calls the real os._exit(1) on a stall, so Docker restarts a wedged worker in production. Several worker tests start that daemon via main(). The worker module is deleted and re-imported per test, so a daemon that outlives its test can be bound to a superseded module object whose globals a module-scoped teardown can no longer signal. That orphaned daemon later reaches its stall deadline and fires os._exit(1) in-process — hard-killing an already-green pytest run.

The previous module-scoped guard could not cover this: the victim is a different test module, and a stale-bound daemon never observes a shutdown signal sent to the current module generation.

Fix (test-only)

A session-wide guard in tests/conftest.py:

  • A session-scoped autouse fixture owns os._exit continuously for the whole test session (records the call instead of exiting). Because every re-imported module generation shares the one os module, an orphaned daemon bound to any generation is neutralized — and there is no inter-test gap where the real os._exit is live.
  • A per-test fixture points that recorder at a fresh per-test sink and, after each test, joins every flush-watchdog thread and asserts none leaked — so neutralizing the exit never silently hides a thread leak; a leak still fails a test loudly, it just can no longer kill the process.

The old module-scoped guard is removed (now subsumed). New adversarial tests reproduce the stale-bound-module path directly and assert the runner survives.

No production behavior changesrc/ is untouched; the watchdog still calls the real os._exit(1) on a genuine stall in the container. The recorder exists only within the pytest process.

Verification

  • The watchdog exit-assertion tests still pass (their local os._exit patch wins during the test body and is restored to the recorder, never the real os._exit).
  • 10/10 consecutive clean runs of the CI unit command (pytest tests/ --ignore=tests/e2e --ignore=tests/integration -m "not quarantine" -p no:randomly); adversarial isolation tests pass; black/ruff/isort clean.

Closes TD-612.

WB Solutions added 2 commits June 5, 2026 16:37
…an't kill pytest (TD-612)

The trace-flush-worker starts a daemon thread (flush-watchdog) that calls the
real os._exit(1) on a stall so Docker restarts a wedged worker. Several worker
tests start that daemon via main(). A daemon that outlived its test — e.g. one
bound to a re-imported (stale) module object whose globals a module-local
teardown can no longer signal — later reached its stall deadline and fired
os._exit(1) in-process, hard-killing an already-passing pytest run with no
failing test, assertion, or traceback.

Add a session-wide autouse guard in tests/conftest.py that:
  - records-instead-of-exits on the process-wide os._exit, so an orphaned daemon
    bound to any module generation (all share the one os module) can no longer
    terminate the runner; and
  - still joins every flush-watchdog thread after each test and asserts none
    leaked, so neutralization never silently masks a leak — a leak fails loudly
    instead of killing the process.

Remove the now-subsumed module-scoped teardown from test_trace_flush_worker.py
(the conftest guard covers all modules, including the cross-module victim) and
its orphaned pytest import. Add tests/test_td612_watchdog_isolation.py
reproducing the stale-bound-daemon case and asserting the parent survives.

The watchdog's production stall-restart behavior is unchanged (test-only fix).
…on-scoped (TD-612)

Address cycle-1 review findings on the TD-612 test-isolation guard.

- Move os._exit neutralization to a session-scoped pytest.MonkeyPatch fixture
  that owns os._exit continuously. The prior function-scoped patch was undone
  at each test teardown, leaving the real os._exit live in the gap between
  tests where a stale-bound flush-watchdog daemon could still hard-kill the
  runner. A session recorder routes every neutralized exit into a per-test
  sink, so there is no inter-test window.
- Keep the per-test fixture as a fresh sink swap plus the loud join + leak
  assertion, so neutralizing the exit never masks a thread leak. The direct
  watchdog tests' own function-scoped mod.os._exit patch still wins during
  their body and is restored back to the session recorder at teardown.
- Replace the fixed-sleep assertion in the second adversarial test with the
  _wait_for_exit_capture poll helper.
- Rename test_following_test_survives_prior_leak to
  test_armed_watchdog_fires_into_recorder_not_process (it arms a fresh
  watchdog in its own body rather than carrying a cross-test leak).
- Document the teardown's stale-daemon limitation and tighten the
  contextlib.suppress to only wrap the watchdog-wakeup set() call.

No production change.
@Hidden-History Hidden-History merged commit e6dee02 into main Jun 6, 2026
17 checks passed
@Hidden-History Hidden-History deleted the fix/td612-watchdog-test-isolation branch June 6, 2026 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant