Skip to content

test(cufile): prime libcufile to avoid SIGFPE from 1.17.1 state-poisoning bug#1925

Merged
cpcloud merged 2 commits intoNVIDIA:mainfrom
cpcloud:fix/cufile-state-poisoning
Apr 17, 2026
Merged

test(cufile): prime libcufile to avoid SIGFPE from 1.17.1 state-poisoning bug#1925
cpcloud merged 2 commits intoNVIDIA:mainfrom
cpcloud:fix/cufile-state-poisoning

Conversation

@cpcloud
Copy link
Copy Markdown
Contributor

@cpcloud cpcloud commented Apr 16, 2026

Problem

Under pytest-randomly, tests/test_cufile.py can fatally crash the pytest process with SIGFPE — not a caught exception, an actual floating-point exception tearing down the interpreter. The crash is deterministic given specific orderings (reproduced with seed 2758108007).

Under gdb the faulting instruction is an unguarded div %rcx with %rcx = 0 inside CUFileDrv::ReadVersionInfo(bool), reached from cuFileDriverOpen+0xe in libcufile.so (libcufile 1.17.1).

Root cause (libcufile 1.17.1 bug)

Calling cuFileSetParameterSizeT — or any of the other parameter-set APIs that are documented as usable before cuFileDriverOpen — BEFORE the first cuFileDriverOpen leaves an internal version list uninitialized. The next cuFileDriverOpen then divides by the list's zero length.

Minimal repro with the shipped library:

pytest tests/test_cufile.py::test_set_get_parameter_size_t \
       tests/test_cufile.py::test_buf_register_invalid_flags

Reversing the order passes — the buf_register test implicitly opens/closes the driver first, initializing the list. Random ordering will hit the bad order whenever any driver-closed parameter-set test happens to sort before any driver-open test.

This needs an upstream libcufile fix (bounds-check before the division, or initialize the list in the parameter-set path). The two cufile-test regimes cannot be merged around it either — parameter-set tests require the driver CLOSED (libcufile returns DRIVER_ALREADY_OPEN (5026) otherwise), while I/O and registration tests require it OPEN.

Fix

Add a module-scope autouse fixture _cufile_driver_prewarm that runs one driver_open / driver_close cycle before any test in test_cufile.py executes. That single cycle initializes libcufile's internal version list; afterwards both regimes work under any ordering:

  • Driver-open tests (buf_register_*, cufile_read_write, batch_io, stats, etc.) continue to use the function-scope driver fixture to open/close per test.
  • Driver-closed tests (test_set_get_parameter_*, test_set_parameter_posix_*) run against a properly initialized closed driver.

The per-test driver_open/driver_close pattern is preserved — not ideal on throughput grounds, but forced by the libcufile API, as noted in the fixture docstring.

Secondary cleanup: test_set_parameter_posix_pool_slab_array was the only test with an inline driver_open / try/finally: driver_close block. Pytest fixture ordering guarantees driver_config (which calls set_parameter_posix_pool_slab_array while the driver is closed) runs before the driver fixture opens it, so the inline pair was replaced by adding driver to the test signature. The redundant @pytest.mark.usefixtures("ctx") was dropped since driver already depends on ctx.

Under pytest-randomly, the cuFile test module fatally crashes with SIGFPE in
CUFileDrv::ReadVersionInfo (unguarded div %rcx with rcx=0) inside
libcufile.so cuFileDriverOpen+0xe. The crash is deterministic given specific
test orderings and was reproducible with seed 2758108007.

Root cause is a libcufile 1.17.1 bug. Calling cuFileSetParameterSizeT (or
other pre-open configuration APIs) BEFORE the first cuFileDriverOpen leaves
an internal version list uninitialized; the next driver_open then divides by
its zero length. Minimal repro:

    pytest tests/test_cufile.py::test_set_get_parameter_size_t \\
           tests/test_cufile.py::test_buf_register_invalid_flags

Fix: add a module-scope autouse _cufile_driver_prewarm fixture that performs
one driver_open/driver_close before any test in the module runs. That single
cycle initializes libcufile's version list; both test regimes
(driver-open tests via the function-scope `driver` fixture, and
driver-closed parameter-set tests) then work under any ordering.

Also swap test_set_parameter_posix_pool_slab_array's inline driver_open/close
for the `driver` fixture. pytest fixture ordering guarantees driver_config
(which calls set_parameter_posix_pool_slab_array while closed) runs before
`driver` opens, matching the previous manual ordering.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cpcloud cpcloud added this to the cuda.bindings next milestone Apr 16, 2026
@cpcloud cpcloud added bug Something isn't working test Improvements or additions to tests cuda.bindings Everything related to the cuda.bindings module labels Apr 16, 2026
@cpcloud cpcloud self-assigned this Apr 16, 2026
@github-actions github-actions Bot added the Needs-Restricted-Paths-Review PR touches cuda_bindings or cuda_python; only NVIDIA employees may modify these paths; see LICENSEs label Apr 16, 2026
@github-actions
Copy link
Copy Markdown

The _cufile_driver_prewarm fixture has no teardown — the open/close cycle
is setup-only. Keeping a trailing `yield` made ruff's PT022
(pytest-useless-yield-fixture) flag it. Drop the yield so the fixture runs
as pure setup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow.

LGTM as-is, but could we put in some sort of reminder, to remove or version-gate the workaround later? (Do we have to be careful about adding ctx back to test_set_parameter_posix_pool_slab_array in that case?)

@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented Apr 16, 2026

It looks like the "claude" co-author triggers our Needs-Restricted-Paths-Review label.

Copy link
Copy Markdown
Contributor

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I got the wrong PR (this was meant for 1908)

I think this is fine to merge-as-is and work on the

detect-changes can under-classify cross-package renames

finding in a follow-on PR. I'm fine either way.

@cpcloud
Copy link
Copy Markdown
Contributor Author

cpcloud commented Apr 16, 2026

What? detect-changes has nothing to do with the changes here.

@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented Apr 16, 2026

What? detect-changes has nothing to do with the changes here.

Sorry, wrong PR. I'll be more careful to not mix up chrome tabs.

@mdboom
Copy link
Copy Markdown
Contributor

mdboom commented Apr 16, 2026

I assume this bug is known upstream?

@rwgk
Copy link
Copy Markdown
Contributor

rwgk commented Apr 16, 2026

@sourabgupta3 for visibility

@leofang leofang removed the Needs-Restricted-Paths-Review PR touches cuda_bindings or cuda_python; only NVIDIA employees may modify these paths; see LICENSEs label Apr 16, 2026
@cpcloud cpcloud merged commit c6cf372 into NVIDIA:main Apr 17, 2026
173 of 176 checks passed
@cpcloud cpcloud deleted the fix/cufile-state-poisoning branch April 17, 2026 14:06
github-actions Bot pushed a commit that referenced this pull request Apr 18, 2026
Removed preview folders for the following PRs:
- PR #1593
- PR #1908
- PR #1923
- PR #1925
- PR #1933
- PR #1934
- PR #1938
- PR #1939
mdboom pushed a commit to mdboom/cuda-python that referenced this pull request Apr 20, 2026
…ning bug (NVIDIA#1925)

* test(cufile): prime libcufile before parameter-set tests to avoid SIGFPE

Under pytest-randomly, the cuFile test module fatally crashes with SIGFPE in
CUFileDrv::ReadVersionInfo (unguarded div %rcx with rcx=0) inside
libcufile.so cuFileDriverOpen+0xe. The crash is deterministic given specific
test orderings and was reproducible with seed 2758108007.

Root cause is a libcufile 1.17.1 bug. Calling cuFileSetParameterSizeT (or
other pre-open configuration APIs) BEFORE the first cuFileDriverOpen leaves
an internal version list uninitialized; the next driver_open then divides by
its zero length. Minimal repro:

    pytest tests/test_cufile.py::test_set_get_parameter_size_t \\
           tests/test_cufile.py::test_buf_register_invalid_flags

Fix: add a module-scope autouse _cufile_driver_prewarm fixture that performs
one driver_open/driver_close before any test in the module runs. That single
cycle initializes libcufile's version list; both test regimes
(driver-open tests via the function-scope `driver` fixture, and
driver-closed parameter-set tests) then work under any ordering.

Also swap test_set_parameter_posix_pool_slab_array's inline driver_open/close
for the `driver` fixture. pytest fixture ordering guarantees driver_config
(which calls set_parameter_posix_pool_slab_array while closed) runs before
`driver` opens, matching the previous manual ordering.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* test(cufile): drop useless yield from prewarm fixture

The _cufile_driver_prewarm fixture has no teardown — the open/close cycle
is setup-only. Keeping a trailing `yield` made ruff's PT022
(pytest-useless-yield-fixture) flag it. Drop the yield so the fixture runs
as pure setup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cuda.bindings Everything related to the cuda.bindings module test Improvements or additions to tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants