Skip to content

Add CUDA toolkit major version check#140

Draft
jacobtomlinson wants to merge 2 commits intorapidsai:mainfrom
jacobtomlinson:check/cuda-major-mismatch
Draft

Add CUDA toolkit major version check#140
jacobtomlinson wants to merge 2 commits intorapidsai:mainfrom
jacobtomlinson:check/cuda-major-mismatch

Conversation

@jacobtomlinson
Copy link
Copy Markdown
Member

@jacobtomlinson jacobtomlinson commented Mar 5, 2026

Adds a check that uses cuda.pathfinder to find your CUDA Toolkit and then compares the major version with the driver.

xref #139

@jacobtomlinson jacobtomlinson requested review from a team as code owners March 5, 2026 10:58
@jacobtomlinson jacobtomlinson requested a review from jameslamb March 5, 2026 10:58
Comment on lines +33 to +34
get_driver_cuda_major=_get_driver_cuda_major,
get_toolkit_cuda_major=_get_toolkit_cuda_major,
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with a dependency injection approach here after chatting about it with @mmccarty to make testing easier.

I haven't refactored other checks to reuse this to keep this PR simpler, but we could do that in the future.

Comment on lines +24 to +28
version_file = Path(header_dir) / "cuda_runtime_version.h"
if not version_file.exists():
return None
match = re.search(r"#define\s+CUDA_VERSION\s+(\d+)", version_file.read_text())
return int(match.group(1)) // 1000 if match else None
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if this is the best way to get the CUDA Toolkit version.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was doing some digging to see if we could pull it from cudart via python API if it was available, because cudaRuntimeGetVersion exists

https://docs.nvidia.com/cuda/archive/9.0/cuda-runtime-api/group__CUDART____VERSION.html#group__CUDART____VERSION_1g0e3952c7802fd730432180f1f4a6cdc6

but i wasn't able to do something like

from cuda import cudart
cudart.cudaRuntimeGetVersion()

BUt with the help of perplexity, I was able to get the version using ctypes and accessing libcudart.

Idk if it's cleaner though. BUt it would be something like this

import ctypes
from ctypes import byref, c_int

libcudart = ctypes.cdll.LoadLibrary("libcudart.so")  # conda cuda-cudart provides this

cudaRuntimeGetVersion = libcudart.cudaRuntimeGetVersion
cudaRuntimeGetVersion.argtypes = [ctypes.POINTER(c_int)]
cudaRuntimeGetVersion.restype = c_int

ver = c_int()
err = cudaRuntimeGetVersion(byref(ver))
if err != 0:
    raise RuntimeError(f"cudaRuntimeGetVersion failed with error code {err}")

ver_int = ver.value
major = ver_int // 1000
minor = (ver_int % 1000) // 10
print("CUDA runtime version:", ver_int, f"({major}.{minor})")

Comment on lines +63 to +65
f"CUDA toolkit major version ({toolkit_major}) is newer than what the installed driver supports "
f"({driver_major}). Update your NVIDIA driver to one that supports CUDA {toolkit_major} or "
f"downgrade your CUDA toolkit to CUDA {driver_major}."
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could improve these errors. It would be nice to detect how CUDA Toolkit has been installed (system, conda, pip, etc) and provide more nuanced advice for the user.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do that via python, for example I'm in conda environment that has cudf and cuml and you can access that info via

>>> from cuda import pathfinder
>>> loaded = pathfinder.load_nvidia_dynamic_lib("cudart")
>>> loaded.abs_path
'/raid/myuser/conda/envs/ray-cuml/lib/libcudart.so'
>>> loaded.found_via
'conda'

and on a different conda env, that only has cuda-python, but that doesn't have cuda-runtime installed I get this

>>> from cuda import pathfinder
>>> loaded = pathfinder.load_nvidia_dynamic_lib("cudart")
>>> loaded.abs_path
'/usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.13'
>>> loaded.found_via
'system-search'

@jacobtomlinson jacobtomlinson marked this pull request as draft March 5, 2026 11:05
@jacobtomlinson
Copy link
Copy Markdown
Member Author

@jayavenkatesh19 I just pushed this draft up to share more broadly, but if you want to take over this I'd be more than happy.

Comment on lines +63 to +68
if toolkit_major < driver_major:
raise ValueError(
f"CUDA toolkit major version ({toolkit_major}) is older than the driver's supported CUDA major version "
f"({driver_major}). Upgrade your CUDA toolkit to CUDA {driver_major} or "
f"downgrade your NVIDIA driver to one that supports CUDA {toolkit_major}."
)
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't necessarily be an error, a newer driver is ok as long as the CTK major matches all the packages. The problem would be when you have driver CUDA 13, with CTK 12 but a foo-cu13 Python package. E.g rapidsai/deployment#516

jayavenkatesh19 added a commit that referenced this pull request Apr 2, 2026
Adds a new `rapids doctor` check that verifies that the CUDA toolkit
(will refer to this as CTK from here on) is findable and
version-compatible with the GPU driver.

These are the things the check does:

- **Library discoverability**: Use `cuda-pathfinder` to verify that CUDA
libraries can be loaded at runtime. The CTK itself has many libraries,
some of which are not necessary for every RAPIDS operation. For now,
this check verifies that `libcudart.so`, `libnvrtc.so` and `libnvvm.so`.
These 3 were chosen because they are more commonly used (cudart is
required for all CUDA operations, while nvrtc and nvvm are used in JIT
compilation). This can be extended to add other libraries of interest in
the CTK, but to keep it universal and based on frequency of usage, I am
checking for these 3 currently.

- **Toolkit vs driver version**: Detects when CTK major version is newer
than the driver. Backward compatibility is supported. Version detection
tries header parsing first (got this from
#140 Thanks @jacobtomlinson),
and falls back to cudaRuntimeGetVersion (got the snippet from
@ncclementi's comment on the PR above) for conda/pip environment as they
do not ship dev headers.

- **System installation checks**: When CTK is not installed via
conda/pip, it checks the `/usr/local/cuda` symlink and the
`CUDA_HOME/CUDA_PATH` variables for version mismatches.

I based the order and the checks themselves after the
`load_nvidia_dynamic_lib` [documentation
page](https://nvidia.github.io/cuda-python/cuda-pathfinder/latest/generated/cuda.pathfinder.load_nvidia_dynamic_lib.html)
for `cuda-pathfinder`, where the search order is specified as
site-packages (pip) -> conda -> OS defaults -> CUDA_HOME

One scenario which isn't covered by these tests is described in this
[comment](#140 (comment)).
This check was originally only meant to test out compatibility and
discoverability between the CTK and the GPU driver but not if the python
packages match with the CTK. For `pip` packages, reading the suffixes
seems like an easy enough way to do it, but I'm not sure on how we would
do that for `conda` packages.

---------

Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants