test(cufile): prime libcufile to avoid SIGFPE from 1.17.1 state-poisoning bug#1925
Merged
cpcloud merged 2 commits intoNVIDIA:mainfrom Apr 17, 2026
Merged
Conversation
Under pytest-randomly, the cuFile test module fatally crashes with SIGFPE in
CUFileDrv::ReadVersionInfo (unguarded div %rcx with rcx=0) inside
libcufile.so cuFileDriverOpen+0xe. The crash is deterministic given specific
test orderings and was reproducible with seed 2758108007.
Root cause is a libcufile 1.17.1 bug. Calling cuFileSetParameterSizeT (or
other pre-open configuration APIs) BEFORE the first cuFileDriverOpen leaves
an internal version list uninitialized; the next driver_open then divides by
its zero length. Minimal repro:
pytest tests/test_cufile.py::test_set_get_parameter_size_t \\
tests/test_cufile.py::test_buf_register_invalid_flags
Fix: add a module-scope autouse _cufile_driver_prewarm fixture that performs
one driver_open/driver_close before any test in the module runs. That single
cycle initializes libcufile's version list; both test regimes
(driver-open tests via the function-scope `driver` fixture, and
driver-closed parameter-set tests) then work under any ordering.
Also swap test_set_parameter_posix_pool_slab_array's inline driver_open/close
for the `driver` fixture. pytest fixture ordering guarantees driver_config
(which calls set_parameter_posix_pool_slab_array while closed) runs before
`driver` opens, matching the previous manual ordering.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
The _cufile_driver_prewarm fixture has no teardown — the open/close cycle is setup-only. Keeping a trailing `yield` made ruff's PT022 (pytest-useless-yield-fixture) flag it. Drop the yield so the fixture runs as pure setup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
rwgk
approved these changes
Apr 16, 2026
Contributor
rwgk
left a comment
There was a problem hiding this comment.
Wow.
LGTM as-is, but could we put in some sort of reminder, to remove or version-gate the workaround later? (Do we have to be careful about adding ctx back to test_set_parameter_posix_pool_slab_array in that case?)
Contributor
|
It looks like the "claude" co-author triggers our Needs-Restricted-Paths-Review label. |
rwgk
approved these changes
Apr 16, 2026
Contributor
Author
|
What? |
Contributor
Sorry, wrong PR. I'll be more careful to not mix up chrome tabs. |
Contributor
|
I assume this bug is known upstream? |
Contributor
|
@sourabgupta3 for visibility |
mdboom
pushed a commit
to mdboom/cuda-python
that referenced
this pull request
Apr 20, 2026
…ning bug (NVIDIA#1925) * test(cufile): prime libcufile before parameter-set tests to avoid SIGFPE Under pytest-randomly, the cuFile test module fatally crashes with SIGFPE in CUFileDrv::ReadVersionInfo (unguarded div %rcx with rcx=0) inside libcufile.so cuFileDriverOpen+0xe. The crash is deterministic given specific test orderings and was reproducible with seed 2758108007. Root cause is a libcufile 1.17.1 bug. Calling cuFileSetParameterSizeT (or other pre-open configuration APIs) BEFORE the first cuFileDriverOpen leaves an internal version list uninitialized; the next driver_open then divides by its zero length. Minimal repro: pytest tests/test_cufile.py::test_set_get_parameter_size_t \\ tests/test_cufile.py::test_buf_register_invalid_flags Fix: add a module-scope autouse _cufile_driver_prewarm fixture that performs one driver_open/driver_close before any test in the module runs. That single cycle initializes libcufile's version list; both test regimes (driver-open tests via the function-scope `driver` fixture, and driver-closed parameter-set tests) then work under any ordering. Also swap test_set_parameter_posix_pool_slab_array's inline driver_open/close for the `driver` fixture. pytest fixture ordering guarantees driver_config (which calls set_parameter_posix_pool_slab_array while closed) runs before `driver` opens, matching the previous manual ordering. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * test(cufile): drop useless yield from prewarm fixture The _cufile_driver_prewarm fixture has no teardown — the open/close cycle is setup-only. Keeping a trailing `yield` made ruff's PT022 (pytest-useless-yield-fixture) flag it. Drop the yield so the fixture runs as pure setup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Under
pytest-randomly,tests/test_cufile.pycan fatally crash the pytest process withSIGFPE— not a caught exception, an actual floating-point exception tearing down the interpreter. The crash is deterministic given specific orderings (reproduced with seed2758108007).Under gdb the faulting instruction is an unguarded
div %rcxwith%rcx = 0insideCUFileDrv::ReadVersionInfo(bool), reached fromcuFileDriverOpen+0xeinlibcufile.so(libcufile 1.17.1).Root cause (libcufile 1.17.1 bug)
Calling
cuFileSetParameterSizeT— or any of the other parameter-set APIs that are documented as usable beforecuFileDriverOpen— BEFORE the firstcuFileDriverOpenleaves an internal version list uninitialized. The nextcuFileDriverOpenthen divides by the list's zero length.Minimal repro with the shipped library:
Reversing the order passes — the
buf_registertest implicitly opens/closes the driver first, initializing the list. Random ordering will hit the bad order whenever any driver-closed parameter-set test happens to sort before any driver-open test.This needs an upstream libcufile fix (bounds-check before the division, or initialize the list in the parameter-set path). The two cufile-test regimes cannot be merged around it either — parameter-set tests require the driver CLOSED (libcufile returns
DRIVER_ALREADY_OPEN (5026)otherwise), while I/O and registration tests require it OPEN.Fix
Add a module-scope autouse fixture
_cufile_driver_prewarmthat runs onedriver_open/driver_closecycle before any test intest_cufile.pyexecutes. That single cycle initializes libcufile's internal version list; afterwards both regimes work under any ordering:buf_register_*,cufile_read_write,batch_io, stats, etc.) continue to use the function-scopedriverfixture to open/close per test.test_set_get_parameter_*,test_set_parameter_posix_*) run against a properly initialized closed driver.The per-test
driver_open/driver_closepattern is preserved — not ideal on throughput grounds, but forced by the libcufile API, as noted in the fixture docstring.Secondary cleanup:
test_set_parameter_posix_pool_slab_arraywas the only test with an inlinedriver_open/try/finally: driver_closeblock. Pytest fixture ordering guaranteesdriver_config(which callsset_parameter_posix_pool_slab_arraywhile the driver is closed) runs before thedriverfixture opens it, so the inline pair was replaced by addingdriverto the test signature. The redundant@pytest.mark.usefixtures("ctx")was dropped sincedriveralready depends onctx.