[Test] Drop taichi xdist fork, use stock pytest-xdist#556
[Test] Drop taichi xdist fork, use stock pytest-xdist#556hughperkins wants to merge 14 commits intomainfrom
Conversation
The taichi fork of pytest-xdist used os._exit(0) in its retire() method, which killed workers before the execnet channel could flush the test failure report. This made all error messages invisible when running with multiple threads. Stock xdist >= 3.4 already handles worker crashes correctly: it preserves the failure report, restarts the worker, and displays full error details in the terminal summary. The conftest hook now just calls os._exit(1) after a failure, and stock xdist does the rest. Changes: - Replace taichi xdist fork with stock pytest-xdist >= 3.7 + pytest-timeout - Simplify conftest pytest_runtest_logreport hook (no fork-specific APIs) - Remove requirements_test_xdist.txt and all CI references to it - Remove pytest_rerunfailures.works_with_current_xdist version hack - Replace pytest_hardtle with pytest-timeout in run_tests.py
There was a problem hiding this comment.
Code review is billed via overage credits. To resume reviews, an organization admin can raise the monthly limit at claude.ai/admin-settings/claude-code.
Once credits are available, reopen this pull request to trigger a review.
|
@claude review |
…porting-2 Made-with: Cursor # Conflicts: # tests/python/conftest.py
Stock pytest-timeout uses Python-level signals that can't interrupt native CUDA/HIP kernel hangs. pytest_hardtle compiles a C watchdog via CFFI that calls _exit(1) from a signal handler, which always works. Restore `-p pytest_hardtle` (with `-p no:timeout` to suppress stock pytest-timeout) and drop pytest-timeout from pyproject.toml.
Documents why we use a CFFI-based hard-kill timeout instead of stock pytest-timeout, and why the two cannot be loaded simultaneously.
- Remove always-true `hasattr(pytest, "version_tuple")` guard in conftest.py (pytest >= 7.0 is guaranteed by dependency constraints). - Add comments in run_tests.py explaining why we suppress stock pytest-timeout and why pytest_hardtle is preferred (CFFI-based native signal handler can kill GIL-held native GPU hangs).
|
@claude review |
|
Opus writes: Closing in favor of the simpler fix in #557Claude's review correctly identifies two regressions with the
Regression 2 is fixable with The simpler approach is to stay on the Taichi fork (which has |
Stock xdist treats os._exit(1) as a worker crash, which causes: 1. A synthetic "worker crashed" report duplicating the real failure 2. Each exit counting toward --max-worker-restart, eventually shutting down the session and silently dropping remaining tests Fix both by: - Adding a pytest_handlecrashitem hook (firstresult=True) that marks the synthetic crash report as passed, since the real report was already sent before the worker exited - Setting --max-worker-restart=999999 so intentional worker kills don't trigger xdist's crash-cap shutdown
Uses pytester to verify in a subprocess that: - pytest_handlecrashitem suppresses synthetic "worker crashed" duplicates (failures counted exactly once, not doubled) - --max-worker-restart=999999 prevents premature session shutdown when many tests fail and workers are intentionally killed Also fix pytest_handlecrashitem to set report.when="teardown" so the suppressed crash report doesn't inflate the passed count.
Made-with: Cursor
…om genuine crashes The previous implementation blindly suppressed all crash reports, causing: 1. Double-counted failures (1 failed + 1 passed for the same test) 2. Genuine crashes (segfaults, OOM) silently marked as passed Use a marker-file protocol: workers write a marker before os._exit(1), and the controller only suppresses crash reports when a marker is found. A wrapper hookimpl on pytest_runtest_logreport swallows the suppressed synthetic report so it never reaches the terminal reporter or stats.
…iscovery CI runs pytest rooted at tests/python/ via run_tests.py, so tests placed as siblings of that directory are never collected.
The top-level import at line 3 already covers all usages.
There was a problem hiding this comment.
Additional findings (outside current diff — PR may have been updated during review):
-
🔴
tests/test_xdist_worker_retirement.py:32-54— The new regression tests intests/test_xdist_worker_retirement.pywill self-destruct when run under outer pytest-xdist (i.e.run_tests.py's-n threads --dist=worksteal, the standard CI configuration).pytester.runpytest_subprocesspropagatesPYTEST_XDIST_WORKERfrom the parent (pytester.popendoesenv = os.environ.copy()), and the inline conftest's hook (lines 40-46) gatesos._exit(1)purely on that env var with no check that the current process is actually a worker — so the inner pytest controller sees the inherited env var and kills itself on the first failed report from an inner worker, before pytest emits its terminal summary, producingValueError: Pytest terminal summary report not foundfromparseoutcomes(). Currently latent because the file is not collected byrun_tests.py(separate prior comment), but resolving that comment by moving the file intotests/python/immediately exposes this; the same fragile gate lives in the production hook attests/python/conftest.py:124. Fix:monkeypatch.delenv("PYTEST_XDIST_WORKER", raising=False)in thexdist_projectfixture beforerunpytest_subprocess, and/or also gate the conftest hook onhasattr(config, 'workerinput')rather than the env var alone.Extended reasoning...
What the bug is
The regression tests added by this PR run pytest-in-pytest via
pytester.runpytest_subprocess(...). Inside_pytest/pytester.py,Pytester.popendoesenv = os.environ.copy()(line 1360 of pytester.py in the installed pytest 8.x), with no scrubbing of xdist-specific variables. So every env var present in the parent process — includingPYTEST_XDIST_WORKER, set by xdist on every worker — is propagated verbatim into the inner pytest subprocess that pytester spawns.The inline conftest the fixture writes (test_xdist_worker_retirement.py:40-46) installs a
trylastpytest_runtest_logreporthook that gatesos._exit(1)on a single condition:if not os.environ.get("PYTEST_XDIST_WORKER"): return
There is no check distinguishing "this process is itself an xdist worker" from "this process inherited the env var from a worker parent." A controller running under xdist does not have
PYTEST_XDIST_WORKERset by xdist; xdist only sets it on workers (via execnet remote setup). But the inner controller in pytester inherits the value from the outer worker that pytester is running in, and the inline hook then misclassifies the inner controller as a worker.The code path that triggers it
run_tests.pyinvokes pytest with-n threads --dist=worksteal(the default CI path; lines 95-100 of the post-PR run_tests.py).- xdist spawns worker processes; each has
PYTEST_XDIST_WORKER=gw0(etc.) in its env. - A worker collects and runs
test_xdist_worker_retirement.py::TestNoDuplicateFailures::test_single_failure_counted_once. - The test calls
xdist_project.runpytest_subprocess("-n", "2", "--dist=worksteal", ...). - pytester.popen does
env = os.environ.copy()—PYTEST_XDIST_WORKER=gw0is in the inner subprocess's env. - The inner pytest controller starts, loads the inline conftest. Its env still has
PYTEST_XDIST_WORKER. - xdist on the inner controller spawns 2 inner workers (gw0', gw1') for
-n 2. - An inner worker runs
test_fail()which raises AssertionError, sends afailedtestreport to the inner controller via execnet. - xdist's controller-side dispatch (
DSession.worker_testreport) callsconfig.hook.pytest_runtest_logreport(report=rep)on the inner controller. - The inline conftest's trylast hook fires on the inner controller, sees
PYTEST_XDIST_WORKERset, and callsos._exit(1). - The inner controller dies mid-session, before pytest's terminal summary finalizer runs.
- The outer pytester reads the truncated stdout;
RunResult.parseoutcomes()does not find aX failed in Yssummary line and raisesValueError("Pytest terminal summary report not found"). - The outer test errors out (not even fails — errors during result parsing).
Why existing safeguards do not prevent this
- The inline conftest has no controller-vs-worker distinction; only the env var.
- pytester's
runpytest_subprocessaccepts no parameter to scrub specific env vars; the test would have to monkeypatch them out before the call, or build a custom env. - The
pytest_handlecrashitemhook on the inner controller (lines 49-52) only mutates the synthetic crash report and does not affect thepytest_runtest_logreportpath. - The PR's own
--max-worker-restart=999999setting is on the outer invocation; it does not propagate into the inner controller's setup.
Step-by-step empirical proof
Setting
PYTEST_XDIST_WORKERin the parent shell mimics what an outer xdist worker would do:$ PYTEST_XDIST_WORKER=gw0 python -m pytest tests/test_xdist_worker_retirement.py::TestNoDuplicateFailures::test_single_failure_counted_once -v ... > raise ValueError("Pytest terminal summary report not found") E ValueError: Pytest terminal summary report not foundWithout the env var:
1 passed. With it:ERRORfrom parseoutcomes. Same failure mode underpytest -n 2 --dist=workstealdirectly — all three regression tests in the file error out, because each spawns a pytester subprocess that inherits the outer worker's env.Impact
The regression tests added by this PR cannot run under the very environment they are supposed to be regressing. As soon as the prior comment about file collection is addressed (by moving the file into
tests/python/), CI will fail: every invocation ofrun_tests.pyruns under-n threads, every test runs inside a worker, every pytester subprocess inheritsPYTEST_XDIST_WORKER, and all three regression tests will error out atparseoutcomes(). The PR's claimed regression guard becomes a CI-blocking failure.The same fragile gate exists in the production hook at
tests/python/conftest.py:124— no test in the suite spawns a pytest subprocess today, so it is latent there, but any future test that does will hit the same trap.How to fix
Two complementary fixes; (1) is the minimum, (2) is more defensive:
-
In the fixture — scrub the env var before
runpytest_subprocess:@pytest.fixture def xdist_project(pytester, monkeypatch): monkeypatch.delenv("PYTEST_XDIST_WORKER", raising=False) pytester.makeconftest(...) return pytester
Or pass an explicit
env=argument when callingrunpytest_subprocess(or wrap the popen). -
In the conftest hook — additionally check that this process is actually serving as a pytest-xdist worker (e.g. by gating on
hasattr(config, 'workerinput')from a session-scoped fixture or by storing the worker status in a module-level flag set inpytest_configure). This protects any future test that spawns pytest subprocesses, including the production hook at tests/python/conftest.py:124.
scrub PYTEST_XDIST_WORKER in pytester fixture to prevent env var leak The linux.yml test-cuda job still referenced the deleted requirements_test_xdist.txt. Also, pytester propagates the outer worker's PYTEST_XDIST_WORKER into inner subprocesses, causing the inner controller to misidentify itself as a worker and os._exit(1).
|
|
||
| # Don't call interactor.retire() — it uses os._exit(0) which kills | ||
| # the process before execnet's IO thread can flush the channel buffer. | ||
| # The test failure report (queued by xdist's own hook, which ran before | ||
| # this trylast hook) would be lost, hiding all error messages. | ||
| interactor.sendevent("workerretire", layoff=layoff) | ||
| time.sleep(0.2) | ||
| os._exit(0) | ||
| @pytest.hookimpl(wrapper=True, tryfirst=True) | ||
| def pytest_runtest_logreport(report): | ||
| """Handle xdist worker retirement and crash-report suppression. | ||
|
|
||
| On the controller: swallow synthetic crash reports that were already marked for suppression by | ||
| pytest_handlecrashitem. | ||
|
|
||
| import importlib | ||
| import sys | ||
| On workers: after a test failure, write an intentional-exit marker and kill the process so it | ||
| restarts with clean GPU state. The real test report is sent by inner hooks (including xdist's | ||
| report-forwarding hook) during ``yield`` before we exit. | ||
| """ | ||
| if getattr(report, "_qd_suppress", False): | ||
| return None | ||
|
|
||
| import pytest | ||
| result = yield | ||
|
|
||
| if os.environ.get("PYTEST_XDIST_WORKER") and report.outcome in ("rerun", "error", "failed"): | ||
| d = _exit_marker_dir() | ||
| if d: | ||
| worker_id = os.environ["PYTEST_XDIST_WORKER"] | ||
| try: | ||
| with open(os.path.join(d, worker_id), "w") as f: | ||
| f.write(report.nodeid) | ||
| except OSError: | ||
| pass | ||
| os._exit(1) | ||
|
|
||
| return result |
There was a problem hiding this comment.
🔴 The wrapper hook at tests/python/conftest.py:146-173 (@pytest.hookimpl(wrapper=True, tryfirst=True)) has if getattr(report, "_qd_suppress", False): return None at line 158, BEFORE the result = yield at line 160 — pluggy treats any function containing yield as a generator, so the early return raises StopIteration on next(), which pluggy converts to RuntimeError("wrap_controller at pytest_runtest_logreport ... did not yield"). This fires every time a worker dies via os._exit(1) after a test failure: pytest_handlecrashitem at line 176 sets report._qd_suppress=True and then xdist/dsession.py:452 unconditionally re-invokes pytest_runtest_logreport(report=rep), which enters the wrapper, hits the early return, and raises. The RuntimeError surfaces as INTERNALERROR and aborts the test loop, silently dropping the remaining tests — strictly worse than the failure-visibility issue this PR set out to fix. Fix: move the _qd_suppress check after the yield (yield first, then discard the result), or convert the hook to a non-wrapper @pytest.hookimpl(tryfirst=True) that mutates the report in place.
Extended reasoning...
What the bug is
The hook at tests/python/conftest.py:146-173:
@pytest.hookimpl(wrapper=True, tryfirst=True)
def pytest_runtest_logreport(report):
if getattr(report, "_qd_suppress", False):
return None # <-- early return BEFORE yield
result = yield
if os.environ.get("PYTEST_XDIST_WORKER") and report.outcome in ("rerun", "error", "failed"):
...
os._exit(1)
return resultA wrapper=True hookimpl whose body contains yield is a generator. Pluggy drives it via next() to run the pre-yield code, calls .send(value) to resume after the inner hooks, and expects the generator to either return after the yield (StopIteration with a value) or raise. If the generator returns before yielding, the very first next() raises StopIteration; pluggy detects this in _callers.py and re-raises as RuntimeError: wrap_controller at pytest_runtest_logreport ... did not yield. That RuntimeError surfaces as INTERNALERROR in pytest_runtestloop, which aborts the entire session.
How the bad path is triggered
The PR introduces two cooperating pieces:
- The wrapper hook above — fires on every
pytest_runtest_logreportinvocation. pytest_handlecrashitemat conftest.py:176-200 — setsreport._qd_suppress = Trueon the synthetic crash report and returns truthy.
The trigger chain (verified against xdist/dsession.py from pytest-xdist 3.x):
- Worker fails a test → real
TestReportis sent over the channel, conftest wrapper runs cleanly to itsos._exit(1). - Controller detects
errordown, callshandle_crashitem(crashitem, errors). handle_crashiteminvokes the firstresult hookpytest_handlecrashitem. The PRs impl setsreport._qd_suppress = Trueand returns True.- The very next line in
dsession.py:452unconditionally callsself.config.hook.pytest_runtest_logreport(report=rep)on the same synthetic report. - Pluggy enters the PRs wrapper hook with
tryfirst=True. Body sees_qd_suppressflag →return Noneon line 158. - The function never reaches
yieldon line 160. Pluggy raisesRuntimeError: ... did not yield. - The error propagates out of
_pytest_runtest_logreport→ INTERNALERROR →pytest_runtestloopaborts.
Why the existing code does not prevent it
dsession.py:452has no opt-out path: thepytest_runtest_logreportcall is unconditional after the firstresult hook returns.- The
firstresult=Truedecorator onpytest_handlecrashitemonly short-circuits that hook chain; it does not gate the unrelatedlogreportcall on the next line. - pluggys wrapper protocol has no "yielded zero times" mode — wrappers must yield exactly once.
Impact
Every xdist run with a worker-dying-after-failure (which is the central path of this PRs design) hits the bug on the first such failure and aborts the test loop with INTERNALERROR. With N>=2 failing tests under -n 2 --dist=worksteal --max-worker-restart=999999, only the first one or two failures are reported before the loop dies; remaining tests are silently dropped. This is strictly worse than the original PR-author-acknowledged double-counting issue: instead of a noisy 2x in the summary, real test failures vanish.
It also means the new regression tests at tests/python/test_xdist_worker_retirement.py::TestNoDuplicateFailures::test_multiple_failures_counted_correctly (expects failed=4) and TestSessionCompletesWithManyFailures::test_no_premature_shutdown (expects failed=20) will themselves fail when actually run — the conftest they install via pytester.makeconftest has the identical wrapper-without-pre-yield check and will INTERNALERROR before reaching their expected counts.
Step-by-step concrete proof
Setup: pytest 9.x + pytest-xdist 3.8+ + pluggy 1.6, exact PR conftest, three tests of which two always fail, run with -n 2 --dist=worksteal --max-worker-restart=999999.
- gw0 runs
test_fail_1, asserts False → real failed report sent to controller. - Conftest wrapper resumes after
yield, takes theoutcome in ("rerun","error","failed")branch, writes the marker, callsos._exit(1). gw0 dies. - Controller emits
FAILED test_fail_1from the real report. Scheduleserrordown. handle_crashitem(node="gw0", crashitem="test_fail_1")runs:- Synthesizes
TestReport(outcome="failed", when="???", longrepr="worker gw0 crashed while running test_fail_1"). - Calls
pytest_handlecrashitem(...)→ PR impl reads marker, setsreport._qd_suppress=True, returns True. - Next line:
self.config.hook.pytest_runtest_logreport(report=rep).
- Synthesizes
- Pluggy invokes PR wrapper with
tryfirst=True:- First statement:
if getattr(report, "_qd_suppress", False): return None. - Function is a generator (because of
yieldfurther down) →next(gen)raisesStopIteration. - Pluggy
_callers.pydetects this and calls_raise_wrapfail(gen, "did not yield")→RuntimeError.
- First statement:
- RuntimeError propagates up
_multicall→pytest_runtestloopcatches and reportsINTERNALERROR> RuntimeError: wrap_controller at pytest_runtest_logreport ... did not yield. - Test loop aborts.
test_fail_2(and any further failing test) is never reported. Summary shows1 failed in X.Xsdespite multiple real failures pending.
This was independently reproduced by three verifiers using the exact PR code with stock pytest+xdist+pluggy.
How to fix
Two clean options:
-
Yield first, gate the post-yield logic on the flag — preserves wrapper semantics and still suppresses the synthetic report:
@pytest.hookimpl(wrapper=True, tryfirst=True) def pytest_runtest_logreport(report): result = yield if getattr(report, "_qd_suppress", False): return None if os.environ.get("PYTEST_XDIST_WORKER") and report.outcome in ("rerun", "error", "failed"): ... os._exit(1) return result
Note this still lets inner hooks observe the synthetic report; for stronger suppression, mutate the report in
pytest_handlecrashitem(e.g.report.outcome="passed"; report.when="teardown") instead, which the existing terminal categorisation will collapse to the""bucket. -
Drop the wrapper entirely — use a non-wrapper
@pytest.hookimpl(tryfirst=True)that mutates the report (the suppression-via-mutation pattern an earlier review iteration on this PR already validated empirically withreport.when = "teardown").
The same identical bug exists in the embedded conftest at tests/python/test_xdist_worker_retirement.py:53-68 (written via pytester.makeconftest), so any fix needs to be applied there too — those are the regression tests for this exact mechanism, and they currently encode the broken pattern.
| def pytest_unconfigure(config): | ||
| """Clean up the marker directory at session end.""" | ||
| if os.environ.get("PYTEST_XDIST_WORKER"): | ||
| return | ||
| d = _exit_marker_dir() | ||
| if d and os.path.isdir(d): | ||
| import shutil | ||
|
|
||
| layoff = False | ||
| shutil.rmtree(d, ignore_errors=True) |
There was a problem hiding this comment.
🟡 Latent regression of the duplicate-failure protection on the second pytest.main() invocation in the same process. pytest_unconfigure at conftest.py:135-143 deletes the marker dir but never clears os.environ['_QD_XDIST_EXIT_MARKER_DIR'], so a second pytest.main() call sees the stale env var, skips creating a fresh dir at the early-return on lines 128-129, and any subsequent worker failure cannot write its marker — pytest_handlecrashitem then leaves _qd_suppress unset and the synthetic 'worker crashed' report is double-counted, the exact regression this PR was written to prevent. Triggers locally on tests/run_tests.py --with-offline-cache (run_count=2). Fix: add os.environ.pop('_QD_XDIST_EXIT_MARKER_DIR', None) to pytest_unconfigure (controller-only) so the next session creates a fresh directory.
Extended reasoning...
What the bug is
pytest_configure at conftest.py:121-132 creates a per-process marker directory and sets os.environ['_QD_XDIST_EXIT_MARKER_DIR'] so workers (via env-var inheritance) and the controller share the same path. The function early-returns at lines 128-129 if the env var is already set.
pytest_unconfigure at conftest.py:135-143 calls shutil.rmtree(d, ignore_errors=True) on the directory, but never clears the env var. There is no symmetric os.environ.pop('_QD_XDIST_EXIT_MARKER_DIR', None). So after the first session ends, the env var is still set in the running Python process, pointing at a now-deleted path.
How it manifests
tests/run_tests.py:test() invokes pytest.main(...) in a loop:
run_count = 1
if args.with_offline_cache:
run_count += args.rerun_with_offline_cache # default 1, so run_count = 2
...
for _ in range(run_count):
ret = _test_python(args) # calls pytest.main(...)
if ret == 5: ret = 0
if ret != 0: exit(ret)When --with-offline-cache is passed (or QD_TEST_OFFLINE_CACHE=1 is set), pytest.main() is called twice in the same Python process.
Step-by-step proof
- Iter 1, pytest_configure: env var unset → creates
/tmp/qd_xdist_exits_<pid>, sets env var. - Iter 1, pytest_unconfigure:
shutil.rmtree(d)removes the dir; env var still set. - Iter 2, pytest_configure: line 128 sees env var, returns early. No dir is recreated.
- Iter 2, worker fails a test:
pytest_runtest_logreportenters the wrapper hook, real report is sent via inner hooks duringyield, then the worker triesopen(os.path.join(d, worker_id), 'w')—dpoints to the deleted dir, so this raisesFileNotFoundError, which is swallowed byexcept OSError: pass(lines 169-170).os._exit(1)runs anyway. - Iter 2, controller pytest_handlecrashitem:
d = _exit_marker_dir()returns the stale path (truthy, so doesn't early-return at line 201).os.path.exists(marker)isFalse(whole dir is gone) → returnsNonewithout settingreport._qd_suppress = True. - Iter 2, dsession.py:452: stock xdist unconditionally calls
pytest_runtest_logreport(report=rep)on the synthetic crash report. The controller's wrapper hook sees_qd_suppressis False, soyieldlets the report flow through. The real failed report (already counted in step 4 from the worker) plus the syntheticworker 'gw0' crashed while running Xreport are both counted — exactly the double-counted-failure regression this PR was written to prevent.
Why existing code doesn't prevent it
The two early-return guards in pytest_configure (lines 126-129) check the env var to detect (a) running inside a worker process, and (b) re-entry — but the second guard implicitly assumes the env var is cleared between sessions. pytest_unconfigure does not honor that assumption. The pytest_handlecrashitem helper (lines 199-218) similarly assumes that if _exit_marker_dir() is truthy, the directory exists and any missing marker means a genuine crash, not 'we forgot to recreate the directory'.
Scope and severity
CI lanes do not exercise --with-offline-cache (no matches under .github/ for with-offline-cache or QD_TEST_OFFLINE_CACHE), so no CI workflow regresses today. The bug bites any developer running python tests/run_tests.py --with-offline-cache locally — a documented developer flag (see arg help in run_tests.py). On iteration ≥ 2 the deduplication mechanism the PR adds is silently disabled, and any test failure is double-counted in the summary.
Fix
One line in pytest_unconfigure (conftest.py:143), after the rmtree:
os.environ.pop('_QD_XDIST_EXIT_MARKER_DIR', None)This is the symmetric counterpart to the os.environ['_QD_XDIST_EXIT_MARKER_DIR'] = d set on line 132. The PYTEST_XDIST_WORKER early-return on line 137 already ensures only the controller clears it, so workers (which inherited the env var via fork/spawn) are unaffected.
| @pytest.fixture | ||
| def xdist_project(pytester, monkeypatch): | ||
| """Write a minimal conftest that reproduces our worker-retirement hooks.""" | ||
| monkeypatch.delenv("PYTEST_XDIST_WORKER", raising=False) |
There was a problem hiding this comment.
🟡 The xdist_project fixture at tests/python/test_xdist_worker_retirement.py:35 scrubs PYTEST_XDIST_WORKER (added in f63261e) but does not scrub _QD_XDIST_EXIT_MARKER_DIR, so pytester.runpytest_subprocess inherits the outer controller's marker dir and the inner pytest_configure (lines 45-52) early-returns without creating a fresh per-process dir. Inner xdist workers then write gw0/gw1 markers directly into the outer process's marker directory, where they collide with the outer xdist's gw0/gw1 worker_id namespace. Fix is a one-line scrub mirroring the existing PYTEST_XDIST_WORKER delenv: monkeypatch.delenv("_QD_XDIST_EXIT_MARKER_DIR", raising=False) at fixture start.
Extended reasoning...
What the bug is
The xdist_project fixture at tests/python/test_xdist_worker_retirement.py:32-35 does this:
@pytest.fixture
def xdist_project(pytester, monkeypatch):
monkeypatch.delenv("PYTEST_XDIST_WORKER", raising=False)
pytester.makeconftest(...)Commit f63261e added the PYTEST_XDIST_WORKER scrub to fix the inner controller misidentifying itself as a worker. But the symmetric scrub for _QD_XDIST_EXIT_MARKER_DIR (which this PR introduced in tests/python/conftest.py:132) is missing. pytester.runpytest_subprocess builds its env from os.environ.copy() — no env kwarg is passed at the call sites (lines 105, 122-129, 149-156) — so the inner subprocess inherits every env var the outer test process has.
How the leak chain plays out
- Outer
pytest_configureattests/python/conftest.py:121-132runs on the outer controller, creates/tmp/qd_xdist_exits_<outer_pid>, and exports_QD_XDIST_EXIT_MARKER_DIR=<outer_dir>intoos.environ. - All 8 outer xdist workers (
tests/run_tests.py:90setsthreads = min(8, cpu_count())and the PR adds--max-worker-restart=999999) inherit the env var. - The outer worker that picks up
tests/python/test_xdist_worker_retirement.pyruns thexdist_projectfixture; it deletesPYTEST_XDIST_WORKERfrom monkeypatched env but leaves_QD_XDIST_EXIT_MARKER_DIR=<outer_dir>intact. runpytest_subprocessspawns the inner pytest with that env. The inner conftest'spytest_configure(lines 45-52 in the embedded conftest) hits the early-return at lines 48-49 —_QD_XDIST_EXIT_MARKER_DIRis already set — and does not create a per-process dir.- The inner xdist controller spawns inner workers
gw0/gw1, which inherit the same_QD_XDIST_EXIT_MARKER_DIR=<outer_dir>. - When an inner test fails, the inner-worker
pytest_runtest_logreportwrites<outer_dir>/gw0(line 64 of the embedded conftest). The outer xdist also uses thegw0/gw1worker_id namespace.
Why existing code does not prevent it
The two early-return guards in the outer pytest_configure (lines 126-129) and the embedded one (lines 46-49) deliberately reuse the env var to avoid recreating the directory in worker subprocesses. That design assumes any subprocess sharing the env var is part of the same logical session — true for xdists own worker fanout, false here, where pytester is launching an entirely independent pytest run that should own its own marker directory. The fixture is the only place where that assumption gets violated, and the existing PYTEST_XDIST_WORKER scrub from f63261e already establishes the pattern: env vars affecting inner conftest behavior must be cleared before runpytest_subprocess.
Concrete consequences
(1) Race-window misclassification. While inner gw0s marker is present in <outer_dir>/gw0 (the few milliseconds between the inner worker writing it and the inner controllers pytest_handlecrashitem unlinking it), if outer worker gw0 genuinely segfaults/OOMs in the same window, the outer controllers pytest_handlecrashitem (conftest.py:175-200) finds the inner-test-written marker, sets _qd_suppress=True, and silently swallows the genuine crash report — exactly the green-on-crash regression earlier review iterations on this PR flagged. The window is small per failure, but the regression tests themselves trigger writes for ~25 inner failures (test_no_premature_shutdown), so cumulative exposure is non-trivial.
(2) Orphan markers on abnormal subprocess termination. If the inner pytester subprocess is killed abruptly (parent SIGTERM, OOM-killer, hardtle timeout) before its inner controller can clean up, markers stay in <outer_dir> until outer pytest_unconfigure rmtrees the dir at session end. Any later outer worker crash with a matching gw0/gw1 id is misclassified.
Step-by-step proof
- Outer
pytest_configure:_QD_XDIST_EXIT_MARKER_DIRunset → creates/tmp/qd_xdist_exits_<outer_pid>and sets env var. - Outer worker
gw3picks uptest_xdist_worker_retirement.py::test_multiple_failures_counted_correctly. Env:PYTEST_XDIST_WORKER=gw3,_QD_XDIST_EXIT_MARKER_DIR=<outer_dir>. xdist_projectfixture:monkeypatch.delenv("PYTEST_XDIST_WORKER")clears that._QD_XDIST_EXIT_MARKER_DIRis untouched.runpytest_subprocess(...)invokesPytester.popen, which usesos.environ.copy()(noenv=kwarg in any call site). Inner subprocess env contains_QD_XDIST_EXIT_MARKER_DIR=<outer_dir>.- Inner pytest starts. Embedded
pytest_configureline 48-49:if os.environ.get("_QD_XDIST_EXIT_MARKER_DIR"): return. Returns. No fresh dir. - Inner xdist forks inner workers gw0, gw1 with same env.
- Inner test fails on inner gw0. Inner worker hook writes
<outer_dir>/gw0, callsos._exit(1). - If, in the milliseconds before the inner controller unlinks
<outer_dir>/gw0, outer workergw0independently segfaults: outerpytest_handlecrashitemfinds<outer_dir>/gw0exists, treats the genuine outer crash as intentional, suppresses it.
Addressing the refutation
The refutation argues this duplicates bug_002 (already refuted) and that the harm is theoretical because the original report only empirically demonstrated env var inheritance, not actual misclassification. Thats a fair limitation on the empirical reproduction, but the marker namespace collision is mechanically inevitable once the env var leaks (both inner and outer xdist use gw<N> ids starting from 0), and the outer's pytest_handlecrashitem does not check whether a marker was written by the same xdist session — it only checks file existence. The refutation does not contest the leak itself or propose any alternative mechanism that would prevent the namespace collision.
The concrete severity argument is that the race window per inner failure is small and the orphan-marker scenario requires abnormal subprocess death — both true. Thats why this is filed as nit, not normal: the fix is one line, the precedent for the scrub already exists in the same fixture (f63261e), and applying it makes the regression tests cleanly isolated from the outer process state regardless of how rare the bad timing is.
How to fix
One line, mirroring the existing PYTEST_XDIST_WORKER scrub:
@pytest.fixture
def xdist_project(pytester, monkeypatch):
monkeypatch.delenv("PYTEST_XDIST_WORKER", raising=False)
monkeypatch.delenv("_QD_XDIST_EXIT_MARKER_DIR", raising=False)
pytester.makeconftest(...)The inner pytest_configure then takes the create-fresh-dir branch and the inner session has its own marker namespace, fully isolated from the outer.
Coverage Report (
|
| File | Coverage | Missing |
|---|---|---|
🔴 tests/python/conftest.py |
34% | 118,129,137,139-141,143,158,163-171,184-199 |
🔴 tests/python/test_xdist_worker_retirement.py |
43% | 35-36,90,96,105-106,110,122,130,136,149,157-158 |
Diff coverage: 37% · Overall: 22% · 73 lines, 46 missing
Summary
The taichi fork of pytest-xdist used
os._exit(0)in itsretire()method, which killed workers before the execnet channel could flush the test failure report. This made all error messages invisible when running with multiple threads.This PR replaces the fork with stock pytest-xdist >= 3.7, while keeping
pytest_hardtlefor hard-kill timeouts (it uses a CFFI-compiled C watchdog that can kill tests hung in native GPU calls, even when the GIL is held — stock pytest-timeout's signal method cannot do this).Changes
pytest-xdist >= 3.7pytest_hardtlefor hard-kill timeouts; suppress stockpytest-timeoutwith-p no:timeoutto avoid hook conflictspytest_runtest_logreporthook — justos._exit(1)after failure, relying on stock xdist to restart the workerpytest_handlecrashitemhook to suppress synthetic "worker crashed" duplicate reports (stock xdist has noworkerretireevent, so it treats intentional exits as crashes)--max-worker-restart=999999to prevent stock xdist's crash-cap from shutting down the session when many tests failrequirements_test_xdist.txtand all CI references to itpytest_rerunfailures.works_with_current_xdistversion hack