Fix cuda_bindings conjugate_gradient_multi_block_cg.py example by rwgk · Pull Request #1922 · NVIDIA/cuda-python

rwgk · 2026-04-16T00:28:22Z

This PR started out as a humble attempt to eliminate the only remaining pytest.skip() in our examples (similar to #1861), but it turned into a bug-fix pass once the example actually began running.

Enable cuda_bindings/examples/4_CUDA_Libraries/conjugate_gradient_multi_block_cg.py by removing its unconditional NVRTC waiver (which dates all the way back to the initial cuda-python commit 4a36f83) and replacing standalone pytest.skip() calls with requirement_not_met()
Simplify the platform gating in that example
Fix the Python-side/runtime issues that had been hidden behind the waiver:
- invalid C-style %d formatting inside an f-string
- gen_tridiag() variable shadowing that broke CSR construction
- cooperative kernel launch not passing the computed dynamic shared memory size
- managed-memory pointer variables being overwritten by loop indices before kernel launch and cleanup
- residual access after freeing dot_result, which caused a teardown segfault
Drive-by: fix the separate QNX gate in cuda_bindings/examples/3_CUDA_Features/global_to_shmem_async_copy.py by checking platform.system() == "QNX" instead of platform.machine() == "qnx"— QNX is an operating system rather than a machine architecture, so checking platform.machine() == "qnx" does not detect QNX hosts.

Remove the unconditional NVRTC waiver from the renamed example so CI can exercise its real execution path again. While re-enabling it, replace the standalone pytest.skip() checks with requirement_not_met() and simplify the platform gating. The waived code path had been hiding several Python-side runtime bugs that required these fixes: - replaced the invalid C-style %d f-string formatting - fixed gen_tridiag() variable shadowing so the CSR row-offset array is actually populated - passed the computed dynamic shared-memory size into cuLaunchCooperativeKernel() and made that size integer-valued - stopped overwriting managed-memory pointer variables with loop indices before kernel launch and cleanup - cached the residual before freeing dot_result, which removed the teardown segfault Made-with: Cursor

QNX is an operating system rather than a machine architecture, so checking platform.machine() can miss the requirement_not_met() path on QNX hosts. Use platform.system() so the example is waived consistently on that platform. Made-with: Cursor

copy-pr-bot · 2026-04-16T00:28:26Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

rwgk · 2026-04-16T00:28:57Z

/ok to test

rwgk · 2026-04-16T17:16:20Z

Cursor-generated analysis of CI logs for commit be43e41:

https://github.com/NVIDIA/cuda-python/actions/runs/24485442949?pr=1922 (2 reruns b/o infrastructure flakiness)

I unpacked and inspected the combined CI logs locally under /wrk/logs_24485442949_combined.

For completeness, these are the log archives the combined directory came from:

+ gh api "/repos/NVIDIA/cuda-python/actions/runs/24485442949/attempts/1/logs" > "logs_24485442949_attempt1.zip"
Saved: logs_24485442949_attempt1.zip
+ gh api "/repos/NVIDIA/cuda-python/actions/runs/24485442949/attempts/2/logs" > "logs_24485442949_attempt2.zip"
Saved: logs_24485442949_attempt2.zip
+ gh api "/repos/NVIDIA/cuda-python/actions/runs/24485442949/attempts/3/logs" > "logs_24485442949_attempt3.zip"
Saved: logs_24485442949_attempt3.zip

I then scanned all Test_*.txt files under /wrk/logs_24485442949_combined for
conjugate_gradient_multi_block_cg.

Summary

25 PASSED
16 SKIPPED
0 FAILED / ERROR

Skip reason

All 16 skips appear to be consistent with the same grouped pytest summary reason:

tests/test_examples.py:33: pathfinder.find_nvidia_header_directory("cudart") returned None

I did not see any skips for this example caused by its own runtime gates
(Unified Memory, Cooperative Kernel Launch, platform gating, etc.).

At a high level, this looks as expected:

every local job I found for this example passed
every wheels job I found for this example skipped
I did not find any failures or errors for this example

Passed in

Test_linux-64___py3.13__13.2.0__local__rtxpro6000.txt
Test_win-64___py3.10__13.0.2__local__rtxpro6000__TCC_.txt
Test_linux-aarch64___py3.14__13.0.2__local__a100.txt
Test_linux-aarch64___py3.13__13.0.2__local__a100.txt
Test_linux-aarch64___py3.14t__13.2.0__local__a100.txt
Test_linux-aarch64___py3.11__13.2.0__local__a100.txt
Test_linux-aarch64___py3.14__13.2.0__local__a100.txt
Test_linux-aarch64___py3.11__13.0.2__local__a100.txt
Test_linux-aarch64___py3.13__13.2.0__local__a100.txt
Test_win-64___py3.12__13.2.0__local__a100__TCC_.txt
Test_linux-64___py3.11__13.0.2__local__l4.txt
Test_linux-64___py3.14__13.2.0__local__l4.txt
Test_linux-64___py3.11__13.2.0__local__l4.txt
Test_linux-64___py3.13__13.2.0__local__h100.txt
Test_win-64___py3.14__13.2.0__local__l4__MCDM_.txt
Test_linux-64___py3.13__13.0.2__local__rtxpro6000.txt
Test_linux-64___py3.14t__13.2.0__local__l4.txt
Test_linux-64___py3.14t__13.2.0__local__h100_x2_.txt
Test_linux-64___py3.14t__13.0.2__local__l4.txt
Test_linux-64___py3.14__13.0.2__local__l4.txt
Test_linux-64___py3.14__13.2.0__local__t4_x2_.txt
Test_linux-64___py3.13__13.0.2__local__h100.txt
Test_win-64___py3.12__13.0.2__local__a100__TCC_.txt
Test_win-64___py3.14__13.0.2__local__l4__MCDM_.txt
Test_win-64___py3.10__13.2.0__local__rtxpro6000__TCC_.txt

Skipped in

Test_linux-aarch64___py3.12__13.2.0__wheels__a100.txt
Test_linux-aarch64___py3.10__13.0.2__wheels__a100.txt
Test_linux-aarch64___py3.12__13.0.2__wheels__a100.txt
Test_linux-aarch64___py3.14t__13.0.2__wheels__a100.txt
Test_win-64___py3.13__13.0.2__wheels__rtxpro6000__MCDM_.txt
Test_win-64___py3.13__13.2.0__wheels__rtxpro6000__MCDM_.txt
Test_win-64___py3.14t__13.2.0__wheels__a100__MCDM_.txt
Test_linux-64___py3.10__13.2.0__wheels__l4.txt
Test_linux-64___py3.10__13.0.2__wheels__l4.txt
Test_linux-64___py3.12__13.0.2__wheels__l4.txt
Test_linux-64___py3.12__13.2.0__wheels__l4.txt
Test_linux-64___py3.12__13.2.0__wheels__rtx4090__wsl.txt
Test_linux-aarch64___py3.10__13.2.0__wheels__a100.txt
Test_win-64___py3.11__13.0.2__wheels__rtx4090__WDDM_.txt
Test_win-64___py3.14t__13.0.2__wheels__a100__MCDM_.txt
Test_win-64___py3.11__13.2.0__wheels__rtx4090__WDDM_.txt

mdboom

LGTM.

The pointer arithmetic in this example should probably be removed someday, but that was pre-existing.

github-actions · 2026-04-16T19:05:31Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

…A#1922) * fix: re-enable and fix the conjugate gradient multi-block example Remove the unconditional NVRTC waiver from the renamed example so CI can exercise its real execution path again. While re-enabling it, replace the standalone pytest.skip() checks with requirement_not_met() and simplify the platform gating. The waived code path had been hiding several Python-side runtime bugs that required these fixes: - replaced the invalid C-style %d f-string formatting - fixed gen_tridiag() variable shadowing so the CSR row-offset array is actually populated - passed the computed dynamic shared-memory size into cuLaunchCooperativeKernel() and made that size integer-valued - stopped overwriting managed-memory pointer variables with loop indices before kernel launch and cleanup - cached the residual before freeing dot_result, which removed the teardown segfault Made-with: Cursor * fix: check the QNX example gate via platform.system() QNX is an operating system rather than a machine architecture, so checking platform.machine() can miss the requirement_not_met() path on QNX hosts. Use platform.system() so the example is waived consistently on that platform. Made-with: Cursor

rwgk added 2 commits April 15, 2026 16:54

rwgk added this to the cuda.bindings next milestone Apr 16, 2026

rwgk self-assigned this Apr 16, 2026

rwgk added bug Something isn't working P1 Medium priority - Should do cuda.bindings Everything related to the cuda.bindings module labels Apr 16, 2026

This comment has been minimized.

Sign in to view

rwgk requested a review from mdboom April 16, 2026 17:16

rwgk marked this pull request as ready for review April 16, 2026 17:17

mdboom approved these changes Apr 16, 2026

View reviewed changes

leofang merged commit d818a75 into NVIDIA:main Apr 16, 2026
252 of 273 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix cuda_bindings conjugate_gradient_multi_block_cg.py example#1922

Fix cuda_bindings conjugate_gradient_multi_block_cg.py example#1922
leofang merged 2 commits intoNVIDIA:mainfrom
rwgk:fix_conjugate_gradient_multi_block_cg

rwgk commented Apr 16, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Apr 16, 2026

Uh oh!

rwgk commented Apr 16, 2026

Uh oh!

This comment has been minimized.

rwgk commented Apr 16, 2026

Uh oh!

mdboom left a comment

Uh oh!

Uh oh!

github-actions Bot commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rwgk commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot Bot commented Apr 16, 2026

Uh oh!

rwgk commented Apr 16, 2026

Uh oh!

This comment has been minimized.

rwgk commented Apr 16, 2026

Summary

Skip reason

Passed in

Skipped in

Uh oh!

mdboom left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rwgk commented Apr 16, 2026 •

edited

Loading