Skip to content

Fix cuda_bindings conjugate_gradient_multi_block_cg.py example#1922

Merged
leofang merged 2 commits intoNVIDIA:mainfrom
rwgk:fix_conjugate_gradient_multi_block_cg
Apr 16, 2026
Merged

Fix cuda_bindings conjugate_gradient_multi_block_cg.py example#1922
leofang merged 2 commits intoNVIDIA:mainfrom
rwgk:fix_conjugate_gradient_multi_block_cg

Conversation

@rwgk
Copy link
Copy Markdown
Contributor

@rwgk rwgk commented Apr 16, 2026

This PR started out as a humble attempt to eliminate the only remaining pytest.skip() in our examples (similar to #1861), but it turned into a bug-fix pass once the example actually began running.

  • Enable cuda_bindings/examples/4_CUDA_Libraries/conjugate_gradient_multi_block_cg.py by removing its unconditional NVRTC waiver (which dates all the way back to the initial cuda-python commit 4a36f83) and replacing standalone pytest.skip() calls with requirement_not_met()
  • Simplify the platform gating in that example
  • Fix the Python-side/runtime issues that had been hidden behind the waiver:
    • invalid C-style %d formatting inside an f-string
    • gen_tridiag() variable shadowing that broke CSR construction
    • cooperative kernel launch not passing the computed dynamic shared memory size
    • managed-memory pointer variables being overwritten by loop indices before kernel launch and cleanup
    • residual access after freeing dot_result, which caused a teardown segfault
  • Drive-by: fix the separate QNX gate in cuda_bindings/examples/3_CUDA_Features/global_to_shmem_async_copy.py by checking platform.system() == "QNX" instead of platform.machine() == "qnx"— QNX is an operating system rather than a machine architecture, so checking platform.machine() == "qnx" does not detect QNX hosts.

rwgk added 2 commits April 15, 2026 16:54
Remove the unconditional NVRTC waiver from the renamed example so CI can exercise its real execution path again. While re-enabling it, replace the standalone pytest.skip() checks with requirement_not_met() and simplify the platform gating.

The waived code path had been hiding several Python-side runtime bugs that required these fixes:
- replaced the invalid C-style %d f-string formatting
- fixed gen_tridiag() variable shadowing so the CSR row-offset array is actually populated
- passed the computed dynamic shared-memory size into cuLaunchCooperativeKernel() and made that size integer-valued
- stopped overwriting managed-memory pointer variables with loop indices before kernel launch and cleanup
- cached the residual before freeing dot_result, which removed the teardown segfault

Made-with: Cursor
QNX is an operating system rather than a machine architecture, so checking platform.machine() can miss the requirement_not_met() path on QNX hosts. Use platform.system() so the example is waived consistently on that platform.

Made-with: Cursor
@rwgk rwgk added this to the cuda.bindings next milestone Apr 16, 2026
@rwgk rwgk self-assigned this Apr 16, 2026
@rwgk rwgk added bug Something isn't working P1 Medium priority - Should do cuda.bindings Everything related to the cuda.bindings module labels Apr 16, 2026
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented Apr 16, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rwgk
Copy link
Copy Markdown
Contributor Author

rwgk commented Apr 16, 2026

/ok to test

@github-actions

This comment has been minimized.

@rwgk
Copy link
Copy Markdown
Contributor Author

rwgk commented Apr 16, 2026

Cursor-generated analysis of CI logs for commit be43e41:


I unpacked and inspected the combined CI logs locally under /wrk/logs_24485442949_combined.

For completeness, these are the log archives the combined directory came from:

+ gh api "/repos/NVIDIA/cuda-python/actions/runs/24485442949/attempts/1/logs" > "logs_24485442949_attempt1.zip"
Saved: logs_24485442949_attempt1.zip
+ gh api "/repos/NVIDIA/cuda-python/actions/runs/24485442949/attempts/2/logs" > "logs_24485442949_attempt2.zip"
Saved: logs_24485442949_attempt2.zip
+ gh api "/repos/NVIDIA/cuda-python/actions/runs/24485442949/attempts/3/logs" > "logs_24485442949_attempt3.zip"
Saved: logs_24485442949_attempt3.zip

I then scanned all Test_*.txt files under /wrk/logs_24485442949_combined for
conjugate_gradient_multi_block_cg.

Summary

  • 25 PASSED
  • 16 SKIPPED
  • 0 FAILED / ERROR

Skip reason

All 16 skips appear to be consistent with the same grouped pytest summary reason:

tests/test_examples.py:33: pathfinder.find_nvidia_header_directory("cudart") returned None

I did not see any skips for this example caused by its own runtime gates
(Unified Memory, Cooperative Kernel Launch, platform gating, etc.).

At a high level, this looks as expected:

  • every local job I found for this example passed
  • every wheels job I found for this example skipped
  • I did not find any failures or errors for this example

Passed in

  • Test_linux-64___py3.13__13.2.0__local__rtxpro6000.txt
  • Test_win-64___py3.10__13.0.2__local__rtxpro6000__TCC_.txt
  • Test_linux-aarch64___py3.14__13.0.2__local__a100.txt
  • Test_linux-aarch64___py3.13__13.0.2__local__a100.txt
  • Test_linux-aarch64___py3.14t__13.2.0__local__a100.txt
  • Test_linux-aarch64___py3.11__13.2.0__local__a100.txt
  • Test_linux-aarch64___py3.14__13.2.0__local__a100.txt
  • Test_linux-aarch64___py3.11__13.0.2__local__a100.txt
  • Test_linux-aarch64___py3.13__13.2.0__local__a100.txt
  • Test_win-64___py3.12__13.2.0__local__a100__TCC_.txt
  • Test_linux-64___py3.11__13.0.2__local__l4.txt
  • Test_linux-64___py3.14__13.2.0__local__l4.txt
  • Test_linux-64___py3.11__13.2.0__local__l4.txt
  • Test_linux-64___py3.13__13.2.0__local__h100.txt
  • Test_win-64___py3.14__13.2.0__local__l4__MCDM_.txt
  • Test_linux-64___py3.13__13.0.2__local__rtxpro6000.txt
  • Test_linux-64___py3.14t__13.2.0__local__l4.txt
  • Test_linux-64___py3.14t__13.2.0__local__h100_x2_.txt
  • Test_linux-64___py3.14t__13.0.2__local__l4.txt
  • Test_linux-64___py3.14__13.0.2__local__l4.txt
  • Test_linux-64___py3.14__13.2.0__local__t4_x2_.txt
  • Test_linux-64___py3.13__13.0.2__local__h100.txt
  • Test_win-64___py3.12__13.0.2__local__a100__TCC_.txt
  • Test_win-64___py3.14__13.0.2__local__l4__MCDM_.txt
  • Test_win-64___py3.10__13.2.0__local__rtxpro6000__TCC_.txt

Skipped in

  • Test_linux-aarch64___py3.12__13.2.0__wheels__a100.txt
  • Test_linux-aarch64___py3.10__13.0.2__wheels__a100.txt
  • Test_linux-aarch64___py3.12__13.0.2__wheels__a100.txt
  • Test_linux-aarch64___py3.14t__13.0.2__wheels__a100.txt
  • Test_win-64___py3.13__13.0.2__wheels__rtxpro6000__MCDM_.txt
  • Test_win-64___py3.13__13.2.0__wheels__rtxpro6000__MCDM_.txt
  • Test_win-64___py3.14t__13.2.0__wheels__a100__MCDM_.txt
  • Test_linux-64___py3.10__13.2.0__wheels__l4.txt
  • Test_linux-64___py3.10__13.0.2__wheels__l4.txt
  • Test_linux-64___py3.12__13.0.2__wheels__l4.txt
  • Test_linux-64___py3.12__13.2.0__wheels__l4.txt
  • Test_linux-64___py3.12__13.2.0__wheels__rtx4090__wsl.txt
  • Test_linux-aarch64___py3.10__13.2.0__wheels__a100.txt
  • Test_win-64___py3.11__13.0.2__wheels__rtx4090__WDDM_.txt
  • Test_win-64___py3.14t__13.0.2__wheels__a100__MCDM_.txt
  • Test_win-64___py3.11__13.2.0__wheels__rtx4090__WDDM_.txt

@rwgk rwgk requested a review from mdboom April 16, 2026 17:16
@rwgk rwgk marked this pull request as ready for review April 16, 2026 17:17
Copy link
Copy Markdown
Contributor

@mdboom mdboom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

The pointer arithmetic in this example should probably be removed someday, but that was pre-existing.

@leofang leofang merged commit d818a75 into NVIDIA:main Apr 16, 2026
252 of 273 checks passed
@github-actions
Copy link
Copy Markdown

Doc Preview CI
Preview removed because the pull request was closed or merged.

mdboom pushed a commit to mdboom/cuda-python that referenced this pull request Apr 20, 2026
…A#1922)

* fix: re-enable and fix the conjugate gradient multi-block example

Remove the unconditional NVRTC waiver from the renamed example so CI can exercise its real execution path again. While re-enabling it, replace the standalone pytest.skip() checks with requirement_not_met() and simplify the platform gating.

The waived code path had been hiding several Python-side runtime bugs that required these fixes:
- replaced the invalid C-style %d f-string formatting
- fixed gen_tridiag() variable shadowing so the CSR row-offset array is actually populated
- passed the computed dynamic shared-memory size into cuLaunchCooperativeKernel() and made that size integer-valued
- stopped overwriting managed-memory pointer variables with loop indices before kernel launch and cleanup
- cached the residual before freeing dot_result, which removed the teardown segfault

Made-with: Cursor

* fix: check the QNX example gate via platform.system()

QNX is an operating system rather than a machine architecture, so checking platform.machine() can miss the requirement_not_met() path on QNX hosts. Use platform.system() so the example is waived consistently on that platform.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cuda.bindings Everything related to the cuda.bindings module P1 Medium priority - Should do

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants