[CI] Fix Vulkan debugPrintf flake with session-scoped warmup by hughperkins · Pull Request #571 · Genesis-Embodied-AI/quadrants

hughperkins · 2026-04-26T18:47:06Z

The validation layer's debugPrintf callback has a race condition on the first kernel dispatch in a fresh process: vkQueueWaitIdle() can return before the callback fires, causing empty captured stdout. This only manifests under pytest-xdist with multiple workers contending for the GPU.

Run one throwaway overflow kernel, for each worker that runs vulkan tests with debug=True, to prime the callback pipeline before real tests run. Validated with 10 consecutive rounds passing on T4 CI runner, where previously ~60% of rounds failed.

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 349d291e6e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-26T18:51:24Z

+            _warmup()
+            qd.sync()
+            sys.stdout.flush()
+            _vulkan_debug_warmed_up = True


Move warmup dispatch out of per-test capture window

This warmup runs during wanted_arch setup and intentionally triggers an overflow print, but the race described in the comment means that print callback can arrive after qd.sync(). When that delayed output lands during the first test’s call phase, stdout assertions become flaky (for example, tests that assert no overflow text or strict print ordering). Because _vulkan_debug_warmed_up is set immediately, the fixture will not retry or drain that delayed output, so the flake can be reintroduced in a new form.

Useful? React with 👍 / 👎.

claude

LGTM — small, well-scoped CI flake fix in test conftest only.

Extended reasoning...

Overview

This PR adds a session-scoped warmup to tests/python/conftest.py that runs one throwaway kernel when Vulkan debug mode is requested, to prime the validation layer's debugPrintf callback pipeline and avoid a race in vkQueueWaitIdle() on the first dispatch. The change is gated by req_arch == qd.vulkan and req_options.get('debug', False), and uses a module-level flag to ensure it runs at most once per worker process.

Security risks

None. The change is confined to test infrastructure and only executes during pytest runs against Vulkan with debug enabled. No production code paths, auth, or data handling are affected.

Level of scrutiny

Low. This is a test-only fix in a single file, the implementation is straightforward (a guard flag plus a trivial warmup kernel), and the author validated the fix empirically with 15 consecutive rounds (975 tests) on the T4 CI runner. The bug hunting system reported no issues.

Other factors

The warmup kernel is intentionally simple (i8(64) + i8(64) overflow) and the comment clearly explains the purpose. The use of a module-level global flag instead of a session-scoped fixture is a minor stylistic choice but works correctly given xdist's per-worker process model.

github-actions · 2026-04-26T19:56:51Z

Coverage Report (`13c8f78af`)

Metric	Value
Diff coverage (changed lines only)	22%
Overall project coverage	73%

Total: 9 lines, 7 missing, 22% covered

🔴 tests/python/conftest.py (22%)

🟢   56  _vulkan_debug_warmed_up = False
     57  
     58  
     61      global _vulkan_debug_warmed_up
🟢   80          if req_arch == qd.vulkan and req_options.get("debug", False) and not _vulkan_debug_warmed_up:
     81              # Prime the Vulkan debugPrintf callback pipeline. The validation
     82              # layer has a race where vkQueueWaitIdle() returns before the
     83              # callback fires on the first dispatch. One throwaway kernel
     84              # ensures subsequent dispatches deliver callbacks reliably.
🔴   85              @qd.kernel
🔴   86              def _warmup() -> qd.i8:
🔴   87                  return qd.i8(64) + qd.i8(64)
     88  
🔴   89              _warmup()
🔴   90              qd.sync()
🔴   91              sys.stdout.flush()
🔴   92              _vulkan_debug_warmed_up = True

hughperkins · 2026-04-26T20:08:18Z

@claude review

The validation layer's debugPrintf callback has a race condition on the first kernel dispatch in a fresh process: vkQueueWaitIdle() can return before the callback fires, causing empty captured stdout. This only manifests under pytest-xdist with multiple workers contending for the GPU. Add a session-scoped fixture that runs one throwaway overflow kernel at worker startup to prime the callback pipeline before real tests run. Validated with 15 consecutive rounds (975 tests) passing on T4 CI runner, where previously ~60% of rounds failed.

hughperkins · 2026-04-26T20:39:37Z

@claude review

claude

Addressed both of my prior concerns — arch check now uses test_utils.expected_archs() and try/except/finally wraps the full warmup with qd.reset() in finally. Test-only change, looks good.

Extended reasoning...

Overview

PR adds a single session-scoped autouse fixture _vulkan_debug_warmup to tests/python/conftest.py that primes the Vulkan debugPrintf callback pipeline once per worker process. This is purely test infrastructure to address a CI flake under pytest-xdist. The change has been iterated based on review feedback from both the codex bot and my own prior comments.

Security risks

None. The change only modifies pytest test setup; no production code paths, auth, crypto, network, or data handling are touched.

Level of scrutiny

Low. This is CI/test-only code with no production impact. The author has empirically validated the fix (10 consecutive rounds passing on T4 CI runner where ~60% previously failed). Failure modes are fail-soft — exceptions during warmup are swallowed and qd.reset() always runs in finally.

Other factors

Both of my prior review comments were addressed in this revision:

The substring-based arch check was replaced with qd.vulkan not in test_utils.expected_archs(), which correctly handles the ^ exclusion prefix from run_tests.py --exclusive.
The try/except now wraps the full warmup (kernel decoration, dispatch, sync) and qd.reset() is in a finally block, preventing a session-scoped fixture exception from erroring every test on the worker.

The codex bot's P1 concern about the warmup running within a per-test capture window was also addressed by making the fixture scope="session" rather than running inside wanted_arch. Coverage is low (22%) but expected — these lines only execute on Vulkan-capable runners with debug=True.

Made-with: Cursor

claude

Both issues from my prior review (broken QD_WANTED_ARCHS parsing and unguarded post-init dispatch) have been addressed cleanly.

Extended reasoning...

Overview

PR #571 adds a session-scoped autouse pytest fixture in tests/python/conftest.py that primes the Vulkan debugPrintf callback pipeline once per worker process. Scope is limited to test infrastructure — no production code paths touched.

Security risks

None. The change runs only in pytest sessions, gated to Vulkan-enabled test runs, and performs a constant-arithmetic kernel dispatch before reset.

Level of scrutiny

Low — this is test-only infrastructure addressing a documented CI flake. The fixture self-isolates via qd.reset() in a guarded finally block, and expected_archs() is the same canonical resolver used elsewhere in the test suite.

Other factors

Both concerns from my prior review have been addressed in commit 13c8f78: (1) the arch detection now correctly delegates to test_utils.expected_archs(), properly handling the ^ exclusion prefix; (2) the try/except now covers the full init/dispatch/sync block with qd.reset() in a guarded finally, preserving fail-soft intent. The remaining P1 from the codex bot (delayed callback bleeding into the first test) is a theoretical concern but the author has empirically validated 10 consecutive rounds passing on the T4 CI runner where ~60% previously failed, which is strong evidence the warmup fully drains before tests start.

github-actions · 2026-04-26T22:24:33Z

Coverage Report (`d27666434`)

Metric	Value
Diff coverage (changed lines only)	26%
Overall project coverage	73%

Total: 19 lines, 14 missing, 26% covered

🔴 tests/python/conftest.py (26%)

🟢   56  @pytest.fixture(scope="session", autouse=True)
🟢   57  def _vulkan_debug_warmup():
     58      """Prime the Vulkan debugPrintf callback pipeline once per worker process.
     59  
     60      The validation layer's debugPrintf callback delivery has a race condition on the first kernel dispatch in a new
     61      process: vkQueueWaitIdle() can return before the callback fires. A full init/dispatch/sync/reset cycle here warms
     62      the driver-level debug infrastructure so subsequent inits work reliably.
     63      """
🟢   64      from tests import test_utils
     65  
🟢   66      if qd.vulkan not in test_utils.expected_archs():
🟢   67          return
     68  
🔴   69      try:
🔴   70          qd.init(arch=qd.vulkan, debug=True, enable_fallback=False, print_full_traceback=True)
     71  
🔴   72          @qd.kernel
🔴   73          def _warmup() -> qd.i8:
🔴   74              return qd.i8(64) + qd.i8(64)
     75  
🔴   76          _warmup()
🔴   77          qd.sync()
🔴   78          sys.stdout.flush()
🔴   79      except Exception:
🔴   80          pass
     81      finally:
🔴   82          try:
🔴   83              qd.reset()
🔴   84          except Exception:
🔴   85              pass
     86  
     87

The debugPrintf warmup fixture hangs on Mac CI runners because MoltenVK does not support the validation-layer debugPrintf pipeline. The warmup is only needed on Linux Vulkan drivers. Made-with: Cursor

claude

Prior feedback addressed (proper expected_archs() usage, full fail-soft try/finally with reset, macOS skip).

Extended reasoning...

Overview

Single-file change to tests/python/conftest.py adding a session-scoped autouse pytest fixture that runs one throwaway Vulkan kernel at worker startup to prime the validation layer's debugPrintf callback pipeline. The PR targets a CI flake observed under pytest-xdist and is purely test-infrastructure code — no production paths touched.

Security risks

None. The fixture only initializes the Vulkan backend with debug=True, dispatches a constant-arithmetic kernel, syncs, and resets. No external input, no I/O beyond stdout flush, no privilege boundaries.

Level of scrutiny

Low. This is a CI-only test fixture. The fixture is opt-in by environment (only runs when Vulkan is in expected_archs() and not on macOS) and fully fail-soft (entire body wrapped in try/except with qd.reset() in finally). Worst case if anything goes wrong is the warmup is skipped and the original flake recurs — same state as before this PR.

Other factors

My two prior review comments on this PR are both resolved:

The naive substring check on QD_WANTED_ARCHS was replaced with the canonical qd.vulkan not in test_utils.expected_archs(), which correctly handles the ^ exclusion-prefix grammar.
The try/except scope was extended to cover the kernel decoration, dispatch, and sync, with qd.reset() placed in a finally block — preserving the author's fail-soft intent across the full warmup path.

The codex-bot's concern about delayed callbacks landing in per-test capture windows is mitigated by the session-scoped placement (runs once before any test) plus the explicit qd.sync() + sys.stdout.flush(). The author validated 10 consecutive rounds passing on the T4 runner. The only remaining risk surface is whether the fixture sufficiently primes the pipeline on all Vulkan stacks, which is empirically verifiable and self-healing if it doesn't (qd.reset() is called in finally).

github-actions · 2026-04-27T04:55:48Z

Coverage Report (`d676d5cef`)

Metric	Value
Diff coverage (changed lines only)	29%
Overall project coverage	73%

Total: 21 lines, 15 missing, 29% covered

🔴 tests/python/conftest.py (29%)

🟢   56  @pytest.fixture(scope="session", autouse=True)
🟢   57  def _vulkan_debug_warmup():
     58      """Prime the Vulkan debugPrintf callback pipeline once per worker process.
     59  
     60      The validation layer's debugPrintf callback delivery has a race condition on the first kernel dispatch in a new
     61      process: vkQueueWaitIdle() can return before the callback fires. A full init/dispatch/sync/reset cycle here warms
     62      the driver-level debug infrastructure so subsequent inits work reliably.
     63      """
🟢   64      if sys.platform == "darwin":
🔴   65          return
     66  
🟢   67      from tests import test_utils
     68  
🟢   69      if qd.vulkan not in test_utils.expected_archs():
🟢   70          return
     71  
🔴   72      try:
🔴   73          qd.init(arch=qd.vulkan, debug=True, enable_fallback=False, print_full_traceback=True)
     74  
🔴   75          @qd.kernel
🔴   76          def _warmup() -> qd.i8:
🔴   77              return qd.i8(64) + qd.i8(64)
     78  
🔴   79          _warmup()
🔴   80          qd.sync()
🔴   81          sys.stdout.flush()
🔴   82      except Exception:
🔴   83          pass
     84      finally:
🔴   85          try:
🔴   86              qd.reset()
🔴   87          except Exception:
🔴   88              pass
     89  
     90

claude

Both my prior concerns (arch-grammar parsing and exception scope) are addressed; small test-only conftest change, LGTM.

Extended reasoning...

Overview

The PR adds a single session-scoped autouse pytest fixture in tests/python/conftest.py that primes the Vulkan validation layer's debugPrintf callback once per worker before real tests run, working around a first-dispatch race in vkQueueWaitIdle(). Latest revisions also skip the warmup on macOS (where Vulkan tests aren't run) and use test_utils.expected_archs() to decide when to run.

Security risks

None. The change is confined to test infrastructure (CI), runs only on the test workers, and does not touch auth, crypto, IO boundaries, or any production code paths.

Level of scrutiny

Low. This is a CI-only fix in test bootstrap code, the modification is localized to one file, and the author has provided empirical validation (10 consecutive rounds passing on T4 vs ~60% prior failure rate). Risk of regression in the actual product is essentially zero.

Other factors

My two earlier review comments (^-prefix exclusion grammar in QD_WANTED_ARCHS and the unguarded post-init dispatch) have both been resolved in the current revision: the fixture now delegates to expected_archs() for the inclusion check and wraps the entire init+dispatch+sync block in try/except with qd.reset() in a finally clause. The remaining codex-bot concern about delayed callbacks bleeding into the first real test is mitigated by the session-scoped placement (it runs before any test, not in the per-test capture window) and the explicit sys.stdout.flush(). The author has been responsive to feedback across multiple revisions.

claude

Both concerns from my prior review have been addressed (proper QD_WANTED_ARCHS parsing via expected_archs(), and try/except now wrapping the full init+dispatch with qd.reset() in a finally). Test-only change, fail-soft, scope is small.

Extended reasoning...

Overview

The PR adds a session-scoped autouse pytest fixture in tests/python/conftest.py that primes the Vulkan debugPrintf callback pipeline once per worker process to fix a CI flake under pytest-xdist. Single file, ~30 lines, test infrastructure only.

Security risks

None. This is test infrastructure that runs only during pytest setup. No production code paths, no user input handling, no auth/crypto/permissions surface.

Level of scrutiny

Low scrutiny is appropriate. This is pytest conftest plumbing for CI flake mitigation, gated by an arch check and platform check, and it fail-softs on every error path. The blast radius is limited to test session setup.

Other factors

My prior review (claude[bot] at 2026-04-26T20:16:12Z) raised two issues which the author addressed in the latest revisions:

The naive substring check on QD_WANTED_ARCHS has been replaced with qd.vulkan not in test_utils.expected_archs(), which is the canonical parser and correctly handles the ^ exclusion prefix.
The try/except previously only wrapped qd.init(); it now wraps the full init+kernel+_warmup+sync block, with qd.reset() in a finally clause guarded by its own try/except. This matches the fail-soft intent.

A subsequent commit (32cb6b1) also added an early return when sys.platform == 'darwin' to fix a separate CI hang on macOS.

The codex bot raised a P1 concern about delayed callback output bleeding into the first test's capture window, but the warmup includes qd.sync() followed by qd.reset() (which tears down the Vulkan context entirely) before any test runs, so any pending callback should be drained or discarded. The premise of the PR is precisely that this warmup pass forces the callback pipeline to settle, after which subsequent dispatches are reliable.

github-actions · 2026-04-27T08:16:34Z

Coverage Report (`58bfd049a`)

Metric	Value
Diff coverage (changed lines only)	0%
Overall project coverage	73%

Total: 0 lines, 0 missing, 0% covered

chatgpt-codex-connector Bot reviewed Apr 26, 2026

View reviewed changes

claude Bot reviewed Apr 26, 2026

View reviewed changes

hughperkins force-pushed the turtle/vulkan-debug-warmup branch from 349d291 to 2a3f431 Compare April 26, 2026 20:03

claude Bot reviewed Apr 26, 2026

View reviewed changes

Comment thread tests/python/conftest.py Outdated

Comment thread tests/python/conftest.py Outdated

hughperkins force-pushed the turtle/vulkan-debug-warmup branch from 2a3f431 to 3215b46 Compare April 26, 2026 20:39

claude Bot reviewed Apr 26, 2026

View reviewed changes

Reflow docstring to 120-char wrap width

f77d601

Made-with: Cursor

claude Bot reviewed Apr 26, 2026

View reviewed changes

Skip Vulkan debug warmup on macOS to fix CI hang

32cb6b1

The debugPrintf warmup fixture hangs on Mac CI runners because MoltenVK does not support the validation-layer debugPrintf pipeline. The warmup is only needed on Linux Vulkan drivers. Made-with: Cursor

claude Bot reviewed Apr 27, 2026

View reviewed changes

Merge branch 'main' into turtle/vulkan-debug-warmup

7ade7e2

claude Bot reviewed Apr 27, 2026

View reviewed changes

Merge branch 'main' into turtle/vulkan-debug-warmup

dbc83ed

hughperkins merged commit d8ca127 into main Apr 27, 2026
11 checks passed

hughperkins deleted the turtle/vulkan-debug-warmup branch April 27, 2026 07:06

claude Bot reviewed Apr 27, 2026

View reviewed changes

Conversation

hughperkins commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Brief Summary

Walkthrough

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

github-actions Bot commented Apr 26, 2026

Coverage Report (13c8f78af)

Uh oh!

hughperkins commented Apr 26, 2026

Uh oh!

Uh oh!

Uh oh!

hughperkins commented Apr 26, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

github-actions Bot commented Apr 26, 2026

Coverage Report (d27666434)

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

github-actions Bot commented Apr 27, 2026

Coverage Report (d676d5cef)

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

github-actions Bot commented Apr 27, 2026

Coverage Report (58bfd049a)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

hughperkins commented Apr 26, 2026 •

edited

Loading

Coverage Report (`13c8f78af`)

Coverage Report (`d27666434`)

Coverage Report (`d676d5cef`)

Coverage Report (`58bfd049a`)