Skip to content

[Test] Drop taichi xdist fork, use stock pytest-xdist#556

Open
hughperkins wants to merge 14 commits intomainfrom
hp/fix-xdist-error-reporting-2
Open

[Test] Drop taichi xdist fork, use stock pytest-xdist#556
hughperkins wants to merge 14 commits intomainfrom
hp/fix-xdist-error-reporting-2

Conversation

@hughperkins
Copy link
Copy Markdown
Collaborator

@hughperkins hughperkins commented Apr 23, 2026

Summary

The taichi fork of pytest-xdist used os._exit(0) in its retire() method, which killed workers before the execnet channel could flush the test failure report. This made all error messages invisible when running with multiple threads.

This PR replaces the fork with stock pytest-xdist >= 3.7, while keeping pytest_hardtle for hard-kill timeouts (it uses a CFFI-compiled C watchdog that can kill tests hung in native GPU calls, even when the GIL is held — stock pytest-timeout's signal method cannot do this).

Changes

  • Replace taichi xdist fork with stock pytest-xdist >= 3.7
  • Keep pytest_hardtle for hard-kill timeouts; suppress stock pytest-timeout with -p no:timeout to avoid hook conflicts
  • Simplify conftest pytest_runtest_logreport hook — just os._exit(1) after failure, relying on stock xdist to restart the worker
  • Add pytest_handlecrashitem hook to suppress synthetic "worker crashed" duplicate reports (stock xdist has no workerretire event, so it treats intentional exits as crashes)
  • Set --max-worker-restart=999999 to prevent stock xdist's crash-cap from shutting down the session when many tests fail
  • Remove requirements_test_xdist.txt and all CI references to it
  • Remove pytest_rerunfailures.works_with_current_xdist version hack

The taichi fork of pytest-xdist used os._exit(0) in its retire() method,
which killed workers before the execnet channel could flush the test
failure report. This made all error messages invisible when running
with multiple threads.

Stock xdist >= 3.4 already handles worker crashes correctly: it
preserves the failure report, restarts the worker, and displays full
error details in the terminal summary. The conftest hook now just calls
os._exit(1) after a failure, and stock xdist does the rest.

Changes:
- Replace taichi xdist fork with stock pytest-xdist >= 3.7 + pytest-timeout
- Simplify conftest pytest_runtest_logreport hook (no fork-specific APIs)
- Remove requirements_test_xdist.txt and all CI references to it
- Remove pytest_rerunfailures.works_with_current_xdist version hack
- Replace pytest_hardtle with pytest-timeout in run_tests.py
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Code review skipped — your organization's overage spend limit has been reached.

Code review is billed via overage credits. To resume reviews, an organization admin can raise the monthly limit at claude.ai/admin-settings/claude-code.

Once credits are available, reopen this pull request to trigger a review.

@hughperkins
Copy link
Copy Markdown
Collaborator Author

@claude review

…porting-2

Made-with: Cursor

# Conflicts:
#	tests/python/conftest.py
Comment thread tests/run_tests.py
Comment thread tests/python/conftest.py Outdated
Comment thread tests/python/conftest.py
Comment thread tests/run_tests.py
Stock pytest-timeout uses Python-level signals that can't interrupt
native CUDA/HIP kernel hangs. pytest_hardtle compiles a C watchdog
via CFFI that calls _exit(1) from a signal handler, which always works.

Restore `-p pytest_hardtle` (with `-p no:timeout` to suppress stock
pytest-timeout) and drop pytest-timeout from pyproject.toml.
Comment thread tests/run_tests.py
@hughperkins hughperkins changed the title [Test] Drop taichi xdist fork, use stock pytest-xdist + pytest-timeout [Test] Drop taichi xdist fork, use stock pytest-xdist Apr 24, 2026
Documents why we use a CFFI-based hard-kill timeout instead of stock
pytest-timeout, and why the two cannot be loaded simultaneously.
- Remove always-true `hasattr(pytest, "version_tuple")` guard in
  conftest.py (pytest >= 7.0 is guaranteed by dependency constraints).
- Add comments in run_tests.py explaining why we suppress stock
  pytest-timeout and why pytest_hardtle is preferred (CFFI-based
  native signal handler can kill GIL-held native GPU hangs).
@hughperkins hughperkins changed the title [Test] Drop taichi xdist fork, use stock pytest-xdist [Test] Drop taichi xdist fork, use stock pytest-xdist (keep pytest_hardtle) Apr 24, 2026
@hughperkins hughperkins changed the title [Test] Drop taichi xdist fork, use stock pytest-xdist (keep pytest_hardtle) [Test] Drop taichi xdist fork, use stock pytest-xdist Apr 24, 2026
@hughperkins
Copy link
Copy Markdown
Collaborator Author

@claude review

Comment thread tests/python/conftest.py Outdated
@hughperkins
Copy link
Copy Markdown
Collaborator Author

hughperkins commented Apr 24, 2026

Opus writes:

Closing in favor of the simpler fix in #557

Claude's review correctly identifies two regressions with the os._exit(1) approach on stock xdist:

  1. Double-counted failures: Stock xdist has no workerretire event (that was fork-specific), so it treats every os._exit(1) as a crash. The real test report gets sent (xdist's hook runs before our trylast hook), but the controller also synthesizes a second "worker crashed while running X" report. Each failure appears twice in the summary and counts double against --maxfail.

  2. Silent session shutdown: Each exit counts toward --max-worker-restart (default numprocesses * 4). With enough failures, xdist triggers triggershutdown() and silently drops remaining tests.

Regression 2 is fixable with --max-worker-restart=<large value>, but regression 1 (double-counting) can't be fixed from a conftest hook without access to stock xdist's channel internals.

The simpler approach is to stay on the Taichi fork (which has workerretire) and just fix the immediate conflicts, which is what #557 does.

Stock xdist treats os._exit(1) as a worker crash, which causes:
1. A synthetic "worker crashed" report duplicating the real failure
2. Each exit counting toward --max-worker-restart, eventually shutting
   down the session and silently dropping remaining tests

Fix both by:
- Adding a pytest_handlecrashitem hook (firstresult=True) that marks
  the synthetic crash report as passed, since the real report was
  already sent before the worker exited
- Setting --max-worker-restart=999999 so intentional worker kills
  don't trigger xdist's crash-cap shutdown
@hughperkins hughperkins reopened this Apr 24, 2026
@hughperkins hughperkins changed the title [Test] Drop taichi xdist fork, use stock pytest-xdist [Test] Drop taichi xdist fork, use stock pytest-xdist (keep pytest_hardtle) Apr 24, 2026
@hughperkins hughperkins changed the title [Test] Drop taichi xdist fork, use stock pytest-xdist (keep pytest_hardtle) [Test] Drop taichi xdist fork, use stock pytest-xdist Apr 24, 2026
Uses pytester to verify in a subprocess that:
- pytest_handlecrashitem suppresses synthetic "worker crashed" duplicates
  (failures counted exactly once, not doubled)
- --max-worker-restart=999999 prevents premature session shutdown when
  many tests fail and workers are intentionally killed

Also fix pytest_handlecrashitem to set report.when="teardown" so the
suppressed crash report doesn't inflate the passed count.
Comment thread tests/python/conftest.py Outdated
Comment thread tests/python/test_xdist_worker_retirement.py
…om genuine crashes

The previous implementation blindly suppressed all crash reports, causing:
1. Double-counted failures (1 failed + 1 passed for the same test)
2. Genuine crashes (segfaults, OOM) silently marked as passed

Use a marker-file protocol: workers write a marker before os._exit(1),
and the controller only suppresses crash reports when a marker is found.
A wrapper hookimpl on pytest_runtest_logreport swallows the suppressed
synthetic report so it never reaches the terminal reporter or stats.
…iscovery

CI runs pytest rooted at tests/python/ via run_tests.py, so tests placed
as siblings of that directory are never collected.
The top-level import at line 3 already covers all usages.
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional findings (outside current diff — PR may have been updated during review):

  • 🔴 tests/test_xdist_worker_retirement.py:32-54 — The new regression tests in tests/test_xdist_worker_retirement.py will self-destruct when run under outer pytest-xdist (i.e. run_tests.py's -n threads --dist=worksteal, the standard CI configuration). pytester.runpytest_subprocess propagates PYTEST_XDIST_WORKER from the parent (pytester.popen does env = os.environ.copy()), and the inline conftest's hook (lines 40-46) gates os._exit(1) purely on that env var with no check that the current process is actually a worker — so the inner pytest controller sees the inherited env var and kills itself on the first failed report from an inner worker, before pytest emits its terminal summary, producing ValueError: Pytest terminal summary report not found from parseoutcomes(). Currently latent because the file is not collected by run_tests.py (separate prior comment), but resolving that comment by moving the file into tests/python/ immediately exposes this; the same fragile gate lives in the production hook at tests/python/conftest.py:124. Fix: monkeypatch.delenv("PYTEST_XDIST_WORKER", raising=False) in the xdist_project fixture before runpytest_subprocess, and/or also gate the conftest hook on hasattr(config, 'workerinput') rather than the env var alone.

    Extended reasoning...

    What the bug is

    The regression tests added by this PR run pytest-in-pytest via pytester.runpytest_subprocess(...). Inside _pytest/pytester.py, Pytester.popen does env = os.environ.copy() (line 1360 of pytester.py in the installed pytest 8.x), with no scrubbing of xdist-specific variables. So every env var present in the parent process — including PYTEST_XDIST_WORKER, set by xdist on every worker — is propagated verbatim into the inner pytest subprocess that pytester spawns.

    The inline conftest the fixture writes (test_xdist_worker_retirement.py:40-46) installs a trylast pytest_runtest_logreport hook that gates os._exit(1) on a single condition:

    if not os.environ.get("PYTEST_XDIST_WORKER"):
        return

    There is no check distinguishing "this process is itself an xdist worker" from "this process inherited the env var from a worker parent." A controller running under xdist does not have PYTEST_XDIST_WORKER set by xdist; xdist only sets it on workers (via execnet remote setup). But the inner controller in pytester inherits the value from the outer worker that pytester is running in, and the inline hook then misclassifies the inner controller as a worker.

    The code path that triggers it

    1. run_tests.py invokes pytest with -n threads --dist=worksteal (the default CI path; lines 95-100 of the post-PR run_tests.py).
    2. xdist spawns worker processes; each has PYTEST_XDIST_WORKER=gw0 (etc.) in its env.
    3. A worker collects and runs test_xdist_worker_retirement.py::TestNoDuplicateFailures::test_single_failure_counted_once.
    4. The test calls xdist_project.runpytest_subprocess("-n", "2", "--dist=worksteal", ...).
    5. pytester.popen does env = os.environ.copy()PYTEST_XDIST_WORKER=gw0 is in the inner subprocess's env.
    6. The inner pytest controller starts, loads the inline conftest. Its env still has PYTEST_XDIST_WORKER.
    7. xdist on the inner controller spawns 2 inner workers (gw0', gw1') for -n 2.
    8. An inner worker runs test_fail() which raises AssertionError, sends a failed testreport to the inner controller via execnet.
    9. xdist's controller-side dispatch (DSession.worker_testreport) calls config.hook.pytest_runtest_logreport(report=rep) on the inner controller.
    10. The inline conftest's trylast hook fires on the inner controller, sees PYTEST_XDIST_WORKER set, and calls os._exit(1).
    11. The inner controller dies mid-session, before pytest's terminal summary finalizer runs.
    12. The outer pytester reads the truncated stdout; RunResult.parseoutcomes() does not find a X failed in Ys summary line and raises ValueError("Pytest terminal summary report not found").
    13. The outer test errors out (not even fails — errors during result parsing).

    Why existing safeguards do not prevent this

    • The inline conftest has no controller-vs-worker distinction; only the env var.
    • pytester's runpytest_subprocess accepts no parameter to scrub specific env vars; the test would have to monkeypatch them out before the call, or build a custom env.
    • The pytest_handlecrashitem hook on the inner controller (lines 49-52) only mutates the synthetic crash report and does not affect the pytest_runtest_logreport path.
    • The PR's own --max-worker-restart=999999 setting is on the outer invocation; it does not propagate into the inner controller's setup.

    Step-by-step empirical proof

    Setting PYTEST_XDIST_WORKER in the parent shell mimics what an outer xdist worker would do:

    $ PYTEST_XDIST_WORKER=gw0 python -m pytest tests/test_xdist_worker_retirement.py::TestNoDuplicateFailures::test_single_failure_counted_once -v
    ...
    >   raise ValueError("Pytest terminal summary report not found")
    E   ValueError: Pytest terminal summary report not found
    

    Without the env var: 1 passed. With it: ERROR from parseoutcomes. Same failure mode under pytest -n 2 --dist=worksteal directly — all three regression tests in the file error out, because each spawns a pytester subprocess that inherits the outer worker's env.

    Impact

    The regression tests added by this PR cannot run under the very environment they are supposed to be regressing. As soon as the prior comment about file collection is addressed (by moving the file into tests/python/), CI will fail: every invocation of run_tests.py runs under -n threads, every test runs inside a worker, every pytester subprocess inherits PYTEST_XDIST_WORKER, and all three regression tests will error out at parseoutcomes(). The PR's claimed regression guard becomes a CI-blocking failure.

    The same fragile gate exists in the production hook at tests/python/conftest.py:124 — no test in the suite spawns a pytest subprocess today, so it is latent there, but any future test that does will hit the same trap.

    How to fix

    Two complementary fixes; (1) is the minimum, (2) is more defensive:

    1. In the fixture — scrub the env var before runpytest_subprocess:

      @pytest.fixture
      def xdist_project(pytester, monkeypatch):
          monkeypatch.delenv("PYTEST_XDIST_WORKER", raising=False)
          pytester.makeconftest(...)
          return pytester

      Or pass an explicit env= argument when calling runpytest_subprocess (or wrap the popen).

    2. In the conftest hook — additionally check that this process is actually serving as a pytest-xdist worker (e.g. by gating on hasattr(config, 'workerinput') from a session-scoped fixture or by storing the worker status in a module-level flag set in pytest_configure). This protects any future test that spawns pytest subprocesses, including the production hook at tests/python/conftest.py:124.

Comment thread .github/workflows/scripts_new/linux/4_test.sh
scrub PYTEST_XDIST_WORKER in pytester fixture to prevent env var leak

The linux.yml test-cuda job still referenced the deleted
requirements_test_xdist.txt. Also, pytester propagates the outer
worker's PYTEST_XDIST_WORKER into inner subprocesses, causing the
inner controller to misidentify itself as a worker and os._exit(1).
Comment thread tests/python/conftest.py
Comment on lines 145 to +173

# Don't call interactor.retire() — it uses os._exit(0) which kills
# the process before execnet's IO thread can flush the channel buffer.
# The test failure report (queued by xdist's own hook, which ran before
# this trylast hook) would be lost, hiding all error messages.
interactor.sendevent("workerretire", layoff=layoff)
time.sleep(0.2)
os._exit(0)
@pytest.hookimpl(wrapper=True, tryfirst=True)
def pytest_runtest_logreport(report):
"""Handle xdist worker retirement and crash-report suppression.

On the controller: swallow synthetic crash reports that were already marked for suppression by
pytest_handlecrashitem.

import importlib
import sys
On workers: after a test failure, write an intentional-exit marker and kill the process so it
restarts with clean GPU state. The real test report is sent by inner hooks (including xdist's
report-forwarding hook) during ``yield`` before we exit.
"""
if getattr(report, "_qd_suppress", False):
return None

import pytest
result = yield

if os.environ.get("PYTEST_XDIST_WORKER") and report.outcome in ("rerun", "error", "failed"):
d = _exit_marker_dir()
if d:
worker_id = os.environ["PYTEST_XDIST_WORKER"]
try:
with open(os.path.join(d, worker_id), "w") as f:
f.write(report.nodeid)
except OSError:
pass
os._exit(1)

return result
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The wrapper hook at tests/python/conftest.py:146-173 (@pytest.hookimpl(wrapper=True, tryfirst=True)) has if getattr(report, "_qd_suppress", False): return None at line 158, BEFORE the result = yield at line 160 — pluggy treats any function containing yield as a generator, so the early return raises StopIteration on next(), which pluggy converts to RuntimeError("wrap_controller at pytest_runtest_logreport ... did not yield"). This fires every time a worker dies via os._exit(1) after a test failure: pytest_handlecrashitem at line 176 sets report._qd_suppress=True and then xdist/dsession.py:452 unconditionally re-invokes pytest_runtest_logreport(report=rep), which enters the wrapper, hits the early return, and raises. The RuntimeError surfaces as INTERNALERROR and aborts the test loop, silently dropping the remaining tests — strictly worse than the failure-visibility issue this PR set out to fix. Fix: move the _qd_suppress check after the yield (yield first, then discard the result), or convert the hook to a non-wrapper @pytest.hookimpl(tryfirst=True) that mutates the report in place.

Extended reasoning...

What the bug is

The hook at tests/python/conftest.py:146-173:

@pytest.hookimpl(wrapper=True, tryfirst=True)
def pytest_runtest_logreport(report):
    if getattr(report, "_qd_suppress", False):
        return None        # <-- early return BEFORE yield

    result = yield

    if os.environ.get("PYTEST_XDIST_WORKER") and report.outcome in ("rerun", "error", "failed"):
        ...
        os._exit(1)

    return result

A wrapper=True hookimpl whose body contains yield is a generator. Pluggy drives it via next() to run the pre-yield code, calls .send(value) to resume after the inner hooks, and expects the generator to either return after the yield (StopIteration with a value) or raise. If the generator returns before yielding, the very first next() raises StopIteration; pluggy detects this in _callers.py and re-raises as RuntimeError: wrap_controller at pytest_runtest_logreport ... did not yield. That RuntimeError surfaces as INTERNALERROR in pytest_runtestloop, which aborts the entire session.

How the bad path is triggered

The PR introduces two cooperating pieces:

  1. The wrapper hook above — fires on every pytest_runtest_logreport invocation.
  2. pytest_handlecrashitem at conftest.py:176-200 — sets report._qd_suppress = True on the synthetic crash report and returns truthy.

The trigger chain (verified against xdist/dsession.py from pytest-xdist 3.x):

  1. Worker fails a test → real TestReport is sent over the channel, conftest wrapper runs cleanly to its os._exit(1).
  2. Controller detects errordown, calls handle_crashitem(crashitem, errors).
  3. handle_crashitem invokes the firstresult hook pytest_handlecrashitem. The PRs impl sets report._qd_suppress = True and returns True.
  4. The very next line in dsession.py:452 unconditionally calls self.config.hook.pytest_runtest_logreport(report=rep) on the same synthetic report.
  5. Pluggy enters the PRs wrapper hook with tryfirst=True. Body sees _qd_suppress flag → return None on line 158.
  6. The function never reaches yield on line 160. Pluggy raises RuntimeError: ... did not yield.
  7. The error propagates out of _pytest_runtest_logreport → INTERNALERROR → pytest_runtestloop aborts.

Why the existing code does not prevent it

  • dsession.py:452 has no opt-out path: the pytest_runtest_logreport call is unconditional after the firstresult hook returns.
  • The firstresult=True decorator on pytest_handlecrashitem only short-circuits that hook chain; it does not gate the unrelated logreport call on the next line.
  • pluggys wrapper protocol has no "yielded zero times" mode — wrappers must yield exactly once.

Impact

Every xdist run with a worker-dying-after-failure (which is the central path of this PRs design) hits the bug on the first such failure and aborts the test loop with INTERNALERROR. With N>=2 failing tests under -n 2 --dist=worksteal --max-worker-restart=999999, only the first one or two failures are reported before the loop dies; remaining tests are silently dropped. This is strictly worse than the original PR-author-acknowledged double-counting issue: instead of a noisy 2x in the summary, real test failures vanish.

It also means the new regression tests at tests/python/test_xdist_worker_retirement.py::TestNoDuplicateFailures::test_multiple_failures_counted_correctly (expects failed=4) and TestSessionCompletesWithManyFailures::test_no_premature_shutdown (expects failed=20) will themselves fail when actually run — the conftest they install via pytester.makeconftest has the identical wrapper-without-pre-yield check and will INTERNALERROR before reaching their expected counts.

Step-by-step concrete proof

Setup: pytest 9.x + pytest-xdist 3.8+ + pluggy 1.6, exact PR conftest, three tests of which two always fail, run with -n 2 --dist=worksteal --max-worker-restart=999999.

  1. gw0 runs test_fail_1, asserts False → real failed report sent to controller.
  2. Conftest wrapper resumes after yield, takes the outcome in ("rerun","error","failed") branch, writes the marker, calls os._exit(1). gw0 dies.
  3. Controller emits FAILED test_fail_1 from the real report. Schedules errordown.
  4. handle_crashitem(node="gw0", crashitem="test_fail_1") runs:
    • Synthesizes TestReport(outcome="failed", when="???", longrepr="worker gw0 crashed while running test_fail_1").
    • Calls pytest_handlecrashitem(...) → PR impl reads marker, sets report._qd_suppress=True, returns True.
    • Next line: self.config.hook.pytest_runtest_logreport(report=rep).
  5. Pluggy invokes PR wrapper with tryfirst=True:
    • First statement: if getattr(report, "_qd_suppress", False): return None.
    • Function is a generator (because of yield further down) → next(gen) raises StopIteration.
    • Pluggy _callers.py detects this and calls _raise_wrapfail(gen, "did not yield")RuntimeError.
  6. RuntimeError propagates up _multicallpytest_runtestloop catches and reports INTERNALERROR> RuntimeError: wrap_controller at pytest_runtest_logreport ... did not yield.
  7. Test loop aborts. test_fail_2 (and any further failing test) is never reported. Summary shows 1 failed in X.Xs despite multiple real failures pending.

This was independently reproduced by three verifiers using the exact PR code with stock pytest+xdist+pluggy.

How to fix

Two clean options:

  1. Yield first, gate the post-yield logic on the flag — preserves wrapper semantics and still suppresses the synthetic report:

    @pytest.hookimpl(wrapper=True, tryfirst=True)
    def pytest_runtest_logreport(report):
        result = yield
        if getattr(report, "_qd_suppress", False):
            return None
        if os.environ.get("PYTEST_XDIST_WORKER") and report.outcome in ("rerun", "error", "failed"):
            ...
            os._exit(1)
        return result

    Note this still lets inner hooks observe the synthetic report; for stronger suppression, mutate the report in pytest_handlecrashitem (e.g. report.outcome="passed"; report.when="teardown") instead, which the existing terminal categorisation will collapse to the "" bucket.

  2. Drop the wrapper entirely — use a non-wrapper @pytest.hookimpl(tryfirst=True) that mutates the report (the suppression-via-mutation pattern an earlier review iteration on this PR already validated empirically with report.when = "teardown").

The same identical bug exists in the embedded conftest at tests/python/test_xdist_worker_retirement.py:53-68 (written via pytester.makeconftest), so any fix needs to be applied there too — those are the regression tests for this exact mechanism, and they currently encode the broken pattern.

Comment thread tests/python/conftest.py
Comment on lines +135 to +143
def pytest_unconfigure(config):
"""Clean up the marker directory at session end."""
if os.environ.get("PYTEST_XDIST_WORKER"):
return
d = _exit_marker_dir()
if d and os.path.isdir(d):
import shutil

layoff = False
shutil.rmtree(d, ignore_errors=True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Latent regression of the duplicate-failure protection on the second pytest.main() invocation in the same process. pytest_unconfigure at conftest.py:135-143 deletes the marker dir but never clears os.environ['_QD_XDIST_EXIT_MARKER_DIR'], so a second pytest.main() call sees the stale env var, skips creating a fresh dir at the early-return on lines 128-129, and any subsequent worker failure cannot write its marker — pytest_handlecrashitem then leaves _qd_suppress unset and the synthetic 'worker crashed' report is double-counted, the exact regression this PR was written to prevent. Triggers locally on tests/run_tests.py --with-offline-cache (run_count=2). Fix: add os.environ.pop('_QD_XDIST_EXIT_MARKER_DIR', None) to pytest_unconfigure (controller-only) so the next session creates a fresh directory.

Extended reasoning...

What the bug is

pytest_configure at conftest.py:121-132 creates a per-process marker directory and sets os.environ['_QD_XDIST_EXIT_MARKER_DIR'] so workers (via env-var inheritance) and the controller share the same path. The function early-returns at lines 128-129 if the env var is already set.

pytest_unconfigure at conftest.py:135-143 calls shutil.rmtree(d, ignore_errors=True) on the directory, but never clears the env var. There is no symmetric os.environ.pop('_QD_XDIST_EXIT_MARKER_DIR', None). So after the first session ends, the env var is still set in the running Python process, pointing at a now-deleted path.

How it manifests

tests/run_tests.py:test() invokes pytest.main(...) in a loop:

run_count = 1
if args.with_offline_cache:
    run_count += args.rerun_with_offline_cache  # default 1, so run_count = 2
...
for _ in range(run_count):
    ret = _test_python(args)   # calls pytest.main(...)
    if ret == 5: ret = 0
    if ret != 0: exit(ret)

When --with-offline-cache is passed (or QD_TEST_OFFLINE_CACHE=1 is set), pytest.main() is called twice in the same Python process.

Step-by-step proof

  1. Iter 1, pytest_configure: env var unset → creates /tmp/qd_xdist_exits_<pid>, sets env var.
  2. Iter 1, pytest_unconfigure: shutil.rmtree(d) removes the dir; env var still set.
  3. Iter 2, pytest_configure: line 128 sees env var, returns early. No dir is recreated.
  4. Iter 2, worker fails a test: pytest_runtest_logreport enters the wrapper hook, real report is sent via inner hooks during yield, then the worker tries open(os.path.join(d, worker_id), 'w')d points to the deleted dir, so this raises FileNotFoundError, which is swallowed by except OSError: pass (lines 169-170). os._exit(1) runs anyway.
  5. Iter 2, controller pytest_handlecrashitem: d = _exit_marker_dir() returns the stale path (truthy, so doesn't early-return at line 201). os.path.exists(marker) is False (whole dir is gone) → returns None without setting report._qd_suppress = True.
  6. Iter 2, dsession.py:452: stock xdist unconditionally calls pytest_runtest_logreport(report=rep) on the synthetic crash report. The controller's wrapper hook sees _qd_suppress is False, so yield lets the report flow through. The real failed report (already counted in step 4 from the worker) plus the synthetic worker 'gw0' crashed while running X report are both counted — exactly the double-counted-failure regression this PR was written to prevent.

Why existing code doesn't prevent it

The two early-return guards in pytest_configure (lines 126-129) check the env var to detect (a) running inside a worker process, and (b) re-entry — but the second guard implicitly assumes the env var is cleared between sessions. pytest_unconfigure does not honor that assumption. The pytest_handlecrashitem helper (lines 199-218) similarly assumes that if _exit_marker_dir() is truthy, the directory exists and any missing marker means a genuine crash, not 'we forgot to recreate the directory'.

Scope and severity

CI lanes do not exercise --with-offline-cache (no matches under .github/ for with-offline-cache or QD_TEST_OFFLINE_CACHE), so no CI workflow regresses today. The bug bites any developer running python tests/run_tests.py --with-offline-cache locally — a documented developer flag (see arg help in run_tests.py). On iteration ≥ 2 the deduplication mechanism the PR adds is silently disabled, and any test failure is double-counted in the summary.

Fix

One line in pytest_unconfigure (conftest.py:143), after the rmtree:

os.environ.pop('_QD_XDIST_EXIT_MARKER_DIR', None)

This is the symmetric counterpart to the os.environ['_QD_XDIST_EXIT_MARKER_DIR'] = d set on line 132. The PYTEST_XDIST_WORKER early-return on line 137 already ensures only the controller clears it, so workers (which inherited the env var via fork/spawn) are unaffected.

Comment on lines +32 to +35
@pytest.fixture
def xdist_project(pytester, monkeypatch):
"""Write a minimal conftest that reproduces our worker-retirement hooks."""
monkeypatch.delenv("PYTEST_XDIST_WORKER", raising=False)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The xdist_project fixture at tests/python/test_xdist_worker_retirement.py:35 scrubs PYTEST_XDIST_WORKER (added in f63261e) but does not scrub _QD_XDIST_EXIT_MARKER_DIR, so pytester.runpytest_subprocess inherits the outer controller's marker dir and the inner pytest_configure (lines 45-52) early-returns without creating a fresh per-process dir. Inner xdist workers then write gw0/gw1 markers directly into the outer process's marker directory, where they collide with the outer xdist's gw0/gw1 worker_id namespace. Fix is a one-line scrub mirroring the existing PYTEST_XDIST_WORKER delenv: monkeypatch.delenv("_QD_XDIST_EXIT_MARKER_DIR", raising=False) at fixture start.

Extended reasoning...

What the bug is

The xdist_project fixture at tests/python/test_xdist_worker_retirement.py:32-35 does this:

@pytest.fixture
def xdist_project(pytester, monkeypatch):
    monkeypatch.delenv("PYTEST_XDIST_WORKER", raising=False)
    pytester.makeconftest(...)

Commit f63261e added the PYTEST_XDIST_WORKER scrub to fix the inner controller misidentifying itself as a worker. But the symmetric scrub for _QD_XDIST_EXIT_MARKER_DIR (which this PR introduced in tests/python/conftest.py:132) is missing. pytester.runpytest_subprocess builds its env from os.environ.copy() — no env kwarg is passed at the call sites (lines 105, 122-129, 149-156) — so the inner subprocess inherits every env var the outer test process has.

How the leak chain plays out

  1. Outer pytest_configure at tests/python/conftest.py:121-132 runs on the outer controller, creates /tmp/qd_xdist_exits_<outer_pid>, and exports _QD_XDIST_EXIT_MARKER_DIR=<outer_dir> into os.environ.
  2. All 8 outer xdist workers (tests/run_tests.py:90 sets threads = min(8, cpu_count()) and the PR adds --max-worker-restart=999999) inherit the env var.
  3. The outer worker that picks up tests/python/test_xdist_worker_retirement.py runs the xdist_project fixture; it deletes PYTEST_XDIST_WORKER from monkeypatched env but leaves _QD_XDIST_EXIT_MARKER_DIR=<outer_dir> intact.
  4. runpytest_subprocess spawns the inner pytest with that env. The inner conftest's pytest_configure (lines 45-52 in the embedded conftest) hits the early-return at lines 48-49 — _QD_XDIST_EXIT_MARKER_DIR is already set — and does not create a per-process dir.
  5. The inner xdist controller spawns inner workers gw0/gw1, which inherit the same _QD_XDIST_EXIT_MARKER_DIR=<outer_dir>.
  6. When an inner test fails, the inner-worker pytest_runtest_logreport writes <outer_dir>/gw0 (line 64 of the embedded conftest). The outer xdist also uses the gw0/gw1 worker_id namespace.

Why existing code does not prevent it

The two early-return guards in the outer pytest_configure (lines 126-129) and the embedded one (lines 46-49) deliberately reuse the env var to avoid recreating the directory in worker subprocesses. That design assumes any subprocess sharing the env var is part of the same logical session — true for xdists own worker fanout, false here, where pytester is launching an entirely independent pytest run that should own its own marker directory. The fixture is the only place where that assumption gets violated, and the existing PYTEST_XDIST_WORKER scrub from f63261e already establishes the pattern: env vars affecting inner conftest behavior must be cleared before runpytest_subprocess.

Concrete consequences

(1) Race-window misclassification. While inner gw0s marker is present in <outer_dir>/gw0 (the few milliseconds between the inner worker writing it and the inner controllers pytest_handlecrashitem unlinking it), if outer worker gw0 genuinely segfaults/OOMs in the same window, the outer controllers pytest_handlecrashitem (conftest.py:175-200) finds the inner-test-written marker, sets _qd_suppress=True, and silently swallows the genuine crash report — exactly the green-on-crash regression earlier review iterations on this PR flagged. The window is small per failure, but the regression tests themselves trigger writes for ~25 inner failures (test_no_premature_shutdown), so cumulative exposure is non-trivial.

(2) Orphan markers on abnormal subprocess termination. If the inner pytester subprocess is killed abruptly (parent SIGTERM, OOM-killer, hardtle timeout) before its inner controller can clean up, markers stay in <outer_dir> until outer pytest_unconfigure rmtrees the dir at session end. Any later outer worker crash with a matching gw0/gw1 id is misclassified.

Step-by-step proof

  1. Outer pytest_configure: _QD_XDIST_EXIT_MARKER_DIR unset → creates /tmp/qd_xdist_exits_<outer_pid> and sets env var.
  2. Outer worker gw3 picks up test_xdist_worker_retirement.py::test_multiple_failures_counted_correctly. Env: PYTEST_XDIST_WORKER=gw3, _QD_XDIST_EXIT_MARKER_DIR=<outer_dir>.
  3. xdist_project fixture: monkeypatch.delenv("PYTEST_XDIST_WORKER") clears that. _QD_XDIST_EXIT_MARKER_DIR is untouched.
  4. runpytest_subprocess(...) invokes Pytester.popen, which uses os.environ.copy() (no env= kwarg in any call site). Inner subprocess env contains _QD_XDIST_EXIT_MARKER_DIR=<outer_dir>.
  5. Inner pytest starts. Embedded pytest_configure line 48-49: if os.environ.get("_QD_XDIST_EXIT_MARKER_DIR"): return. Returns. No fresh dir.
  6. Inner xdist forks inner workers gw0, gw1 with same env.
  7. Inner test fails on inner gw0. Inner worker hook writes <outer_dir>/gw0, calls os._exit(1).
  8. If, in the milliseconds before the inner controller unlinks <outer_dir>/gw0, outer worker gw0 independently segfaults: outer pytest_handlecrashitem finds <outer_dir>/gw0 exists, treats the genuine outer crash as intentional, suppresses it.

Addressing the refutation

The refutation argues this duplicates bug_002 (already refuted) and that the harm is theoretical because the original report only empirically demonstrated env var inheritance, not actual misclassification. Thats a fair limitation on the empirical reproduction, but the marker namespace collision is mechanically inevitable once the env var leaks (both inner and outer xdist use gw<N> ids starting from 0), and the outer's pytest_handlecrashitem does not check whether a marker was written by the same xdist session — it only checks file existence. The refutation does not contest the leak itself or propose any alternative mechanism that would prevent the namespace collision.

The concrete severity argument is that the race window per inner failure is small and the orphan-marker scenario requires abnormal subprocess death — both true. Thats why this is filed as nit, not normal: the fix is one line, the precedent for the scrub already exists in the same fixture (f63261e), and applying it makes the regression tests cleanly isolated from the outer process state regardless of how rare the bad timing is.

How to fix

One line, mirroring the existing PYTEST_XDIST_WORKER scrub:

@pytest.fixture
def xdist_project(pytester, monkeypatch):
    monkeypatch.delenv("PYTEST_XDIST_WORKER", raising=False)
    monkeypatch.delenv("_QD_XDIST_EXIT_MARKER_DIR", raising=False)
    pytester.makeconftest(...)

The inner pytest_configure then takes the create-fresh-dir branch and the inner session has its own marker namespace, fully isolated from the outer.

@github-actions
Copy link
Copy Markdown

Coverage Report (f63261e64)

File Coverage Missing
🔴 tests/python/conftest.py 34% 118,129,137,139-141,143,158,163-171,184-199
🔴 tests/python/test_xdist_worker_retirement.py 43% 35-36,90,96,105-106,110,122,130,136,149,157-158

Diff coverage: 37% · Overall: 22% · 73 lines, 46 missing

Full annotated report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant