Add CUDA toolkit major version check#140
Add CUDA toolkit major version check#140jacobtomlinson wants to merge 2 commits intorapidsai:mainfrom
Conversation
| get_driver_cuda_major=_get_driver_cuda_major, | ||
| get_toolkit_cuda_major=_get_toolkit_cuda_major, |
There was a problem hiding this comment.
I went with a dependency injection approach here after chatting about it with @mmccarty to make testing easier.
I haven't refactored other checks to reuse this to keep this PR simpler, but we could do that in the future.
| version_file = Path(header_dir) / "cuda_runtime_version.h" | ||
| if not version_file.exists(): | ||
| return None | ||
| match = re.search(r"#define\s+CUDA_VERSION\s+(\d+)", version_file.read_text()) | ||
| return int(match.group(1)) // 1000 if match else None |
There was a problem hiding this comment.
I'm curious if this is the best way to get the CUDA Toolkit version.
There was a problem hiding this comment.
I was doing some digging to see if we could pull it from cudart via python API if it was available, because cudaRuntimeGetVersion exists
but i wasn't able to do something like
from cuda import cudart
cudart.cudaRuntimeGetVersion()BUt with the help of perplexity, I was able to get the version using ctypes and accessing libcudart.
Idk if it's cleaner though. BUt it would be something like this
import ctypes
from ctypes import byref, c_int
libcudart = ctypes.cdll.LoadLibrary("libcudart.so") # conda cuda-cudart provides this
cudaRuntimeGetVersion = libcudart.cudaRuntimeGetVersion
cudaRuntimeGetVersion.argtypes = [ctypes.POINTER(c_int)]
cudaRuntimeGetVersion.restype = c_int
ver = c_int()
err = cudaRuntimeGetVersion(byref(ver))
if err != 0:
raise RuntimeError(f"cudaRuntimeGetVersion failed with error code {err}")
ver_int = ver.value
major = ver_int // 1000
minor = (ver_int % 1000) // 10
print("CUDA runtime version:", ver_int, f"({major}.{minor})")| f"CUDA toolkit major version ({toolkit_major}) is newer than what the installed driver supports " | ||
| f"({driver_major}). Update your NVIDIA driver to one that supports CUDA {toolkit_major} or " | ||
| f"downgrade your CUDA toolkit to CUDA {driver_major}." |
There was a problem hiding this comment.
I think we could improve these errors. It would be nice to detect how CUDA Toolkit has been installed (system, conda, pip, etc) and provide more nuanced advice for the user.
There was a problem hiding this comment.
We can do that via python, for example I'm in conda environment that has cudf and cuml and you can access that info via
>>> from cuda import pathfinder
>>> loaded = pathfinder.load_nvidia_dynamic_lib("cudart")
>>> loaded.abs_path
'/raid/myuser/conda/envs/ray-cuml/lib/libcudart.so'
>>> loaded.found_via
'conda'and on a different conda env, that only has cuda-python, but that doesn't have cuda-runtime installed I get this
>>> from cuda import pathfinder
>>> loaded = pathfinder.load_nvidia_dynamic_lib("cudart")
>>> loaded.abs_path
'/usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.13'
>>> loaded.found_via
'system-search'|
@jayavenkatesh19 I just pushed this draft up to share more broadly, but if you want to take over this I'd be more than happy. |
| if toolkit_major < driver_major: | ||
| raise ValueError( | ||
| f"CUDA toolkit major version ({toolkit_major}) is older than the driver's supported CUDA major version " | ||
| f"({driver_major}). Upgrade your CUDA toolkit to CUDA {driver_major} or " | ||
| f"downgrade your NVIDIA driver to one that supports CUDA {toolkit_major}." | ||
| ) |
There was a problem hiding this comment.
This shouldn't necessarily be an error, a newer driver is ok as long as the CTK major matches all the packages. The problem would be when you have driver CUDA 13, with CTK 12 but a foo-cu13 Python package. E.g rapidsai/deployment#516
Adds a new `rapids doctor` check that verifies that the CUDA toolkit (will refer to this as CTK from here on) is findable and version-compatible with the GPU driver. These are the things the check does: - **Library discoverability**: Use `cuda-pathfinder` to verify that CUDA libraries can be loaded at runtime. The CTK itself has many libraries, some of which are not necessary for every RAPIDS operation. For now, this check verifies that `libcudart.so`, `libnvrtc.so` and `libnvvm.so`. These 3 were chosen because they are more commonly used (cudart is required for all CUDA operations, while nvrtc and nvvm are used in JIT compilation). This can be extended to add other libraries of interest in the CTK, but to keep it universal and based on frequency of usage, I am checking for these 3 currently. - **Toolkit vs driver version**: Detects when CTK major version is newer than the driver. Backward compatibility is supported. Version detection tries header parsing first (got this from #140 Thanks @jacobtomlinson), and falls back to cudaRuntimeGetVersion (got the snippet from @ncclementi's comment on the PR above) for conda/pip environment as they do not ship dev headers. - **System installation checks**: When CTK is not installed via conda/pip, it checks the `/usr/local/cuda` symlink and the `CUDA_HOME/CUDA_PATH` variables for version mismatches. I based the order and the checks themselves after the `load_nvidia_dynamic_lib` [documentation page](https://nvidia.github.io/cuda-python/cuda-pathfinder/latest/generated/cuda.pathfinder.load_nvidia_dynamic_lib.html) for `cuda-pathfinder`, where the search order is specified as site-packages (pip) -> conda -> OS defaults -> CUDA_HOME One scenario which isn't covered by these tests is described in this [comment](#140 (comment)). This check was originally only meant to test out compatibility and discoverability between the CTK and the GPU driver but not if the python packages match with the CTK. For `pip` packages, reading the suffixes seems like an easy enough way to do it, but I'm not sure on how we would do that for `conda` packages. --------- Signed-off-by: Jaya Venkatesh <jjayabaskar@nvidia.com>
Adds a check that uses
cuda.pathfinderto find your CUDA Toolkit and then compares the major version with the driver.xref #139