Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
f16dc1b
Move all files from cuda/core/experimental/ to cuda/core/
rwgk Dec 2, 2025
77445b3
Update all internal imports from cuda.core.experimental.* to cuda.core.*
rwgk Dec 2, 2025
3c79ca6
Update cuda/core/__init__.py to export all moved symbols
rwgk Dec 2, 2025
b54bf6a
Create forwarding stubs in experimental/__init__.py with deprecation …
rwgk Dec 2, 2025
411bbd7
Update test imports to use new cuda.core.* paths
rwgk Dec 2, 2025
29e31df
Update Cython test file to use new import paths
rwgk Dec 2, 2025
b7cb72b
Add backward compatibility tests for experimental stubs
rwgk Dec 2, 2025
55a50e6
Update example files to use new cuda.core.* imports
rwgk Dec 2, 2025
b2f28e9
Fix build_hooks.py to use new cuda.core.* paths
rwgk Dec 2, 2025
45c92f5
Fix _context.pyx to use new import path
rwgk Dec 2, 2025
b7c0cc0
Fix _memory/_legacy.py to use new import path
rwgk Dec 2, 2025
d47b29d
Fix _memory/_virtual_memory_resource.py to use new import path
rwgk Dec 2, 2025
2e0da66
Fix test helpers to use new cuda.core.* import paths
rwgk Dec 2, 2025
c414b7d
Fix test files to use new cuda.core._* import paths
rwgk Dec 2, 2025
d582448
Fix experimental.utils submodule access and test_device_id
rwgk Dec 2, 2025
3cc287c
Fix test_experimental_direct_imports to handle module caching
rwgk Dec 2, 2025
a040505
Merge main into cuda_core_experimental_deprecation_v0
rwgk Dec 2, 2025
4f34fdd
Address review comments: use pytest.deprecated_call and clarify warni…
rwgk Dec 2, 2025
6718a70
pre-commit fixes (automatic, trivial)
rwgk Dec 2, 2025
506b912
Fix import path in test_memory_peer_access.py
rwgk Dec 2, 2025
dfa0f0e
Remove experimental namespace from remaining test files
rwgk Dec 2, 2025
64a2b52
Merge main into cuda_core_experimental_deprecation_v0
rwgk Dec 9, 2025
5328873
Merge main into cuda_core_experimental_deprecation_v0
rwgk Dec 10, 2025
11f52c7
Merge main into cuda_core_experimental_deprecation_v0
rwgk Dec 15, 2025
4307b5d
Migrate StridedLayout from experimental to main namespace
rwgk Dec 15, 2025
4b9796b
Sort __getattr__ name list alphabetically and remove one duplicate (_…
rwgk Dec 15, 2025
798ce3d
Remove redundant TYPE_CHECKING block from experimental/__init__.py
rwgk Dec 15, 2025
09a78cc
Update references from cuda.core.experimental to cuda.core
rwgk Dec 15, 2025
516e088
Fix merge script to preserve Python modules in cuda/core
rwgk Dec 16, 2025
475a1df
Fix recursion in merge script by excluding versioned directories
rwgk Dec 16, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .github/ISSUE_TEMPLATE/bug_report.yml
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ body:
attributes:
label: Describe the bug
description: A clear and concise description of what problem you are running into.
placeholder: "Attempting to compile a program via `cuda.core.experimental.Program.compile` throws a `ValueError`."
placeholder: "Attempting to compile a program via `cuda.core.Program.compile` throws a `ValueError`."
validations:
required: true

Expand All @@ -62,7 +62,7 @@ body:
label: How to Reproduce
description: Steps used to reproduce the bug.
placeholder: |
0. Construct a `cuda.core.experimental.Program` instance
0. Construct a `cuda.core.Program` instance
1. Call the `.compile(...)` method of the instance
2. The call throws a `ValueError` with the following:
```
Expand All @@ -76,7 +76,7 @@ body:
attributes:
label: Expected behavior
description: A clear and concise description of what you expected to happen.
placeholder: "Using `cuda.core.experimental.Program.compile(...)` should run successfully and not throw a `ValueError`"
placeholder: "Using `cuda.core.Program.compile(...)` should run successfully and not throw a `ValueError`"
validations:
required: true

Expand Down
6 changes: 3 additions & 3 deletions .github/ISSUE_TEMPLATE/feature_request.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ body:
attributes:
label: Is your feature request related to a problem? Please describe.
description: A clear and concise description of what the problem is, e.g., "I would like to be able to..."
placeholder: I would like to be able to use the equivalent of `cuda.core.experimental.Program.compile(...)` to compile my code to PTX.
placeholder: I would like to be able to use the equivalent of `cuda.core.Program.compile(...)` to compile my code to PTX.
validations:
required: true

Expand All @@ -46,7 +46,7 @@ body:
label: Describe the solution you'd like
description: A clear and concise description of what you want to happen.
placeholder: |
Support a `ptx` target_type in the `cuda.core.experimental.Program.compile(...)` function.
Support a `ptx` target_type in the `cuda.core.Program.compile(...)` function.
validations:
required: true

Expand All @@ -57,7 +57,7 @@ body:
description:
If applicable, please add a clear and concise description of any alternative solutions or features you've
considered.
placeholder: The alternatives to using `cuda.core.experimental.Program.compile(...)` are unappealing. They usually involve using lower level bindings to something like nvRTC or invoking the nvcc executable.
placeholder: The alternatives to using `cuda.core.Program.compile(...)` are unappealing. They usually involve using lower level bindings to something like nvRTC or invoking the nvcc executable.
validations:
required: false

Expand Down
2 changes: 1 addition & 1 deletion .spdx-ignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,6 @@ requirements*.txt
cuda_bindings/examples/*

# Vendored
cuda_core/cuda/core/experimental/include/dlpack.h
cuda_core/cuda/core/include/dlpack.h

qa/ctk-next.drawio.svg
53 changes: 32 additions & 21 deletions ci/tools/merge_cuda_core_wheels.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@

In particular, each wheel contains a CUDA-specific build of the `cuda.core` library
and the associated bindings. This script merges these directories into a single wheel
that supports both CUDA versions, i.e., containing both `cuda/core/experimental/cu12`
and `cuda/core/experimental/cu13`. At runtime, the code in `cuda/core/experimental/__init__.py`
that supports both CUDA versions, i.e., containing both `cuda/core/cu12`
and `cuda/core/cu13`. At runtime, the code in `cuda/core/__init__.py`
is used to import the appropriate CUDA-specific bindings.

This script is based on the one in NVIDIA/CCCL.
Expand Down Expand Up @@ -94,27 +94,38 @@ def merge_wheels(wheels: List[Path], output_dir: Path) -> Path:
# Use the first wheel as the base and merge binaries from others
base_wheel = extracted_wheels[0]

# now copy the version-specific directory from other wheels
# into the appropriate place in the base wheel
# Copy version-specific binaries from each wheel into versioned subdirectories
# Note: Python modules stay in cuda/core/, only binaries go into cu12/cu13/
base_dir = Path("cuda") / "core"

for i, wheel_dir in enumerate(extracted_wheels):
cuda_version = wheels[i].name.split(".cu")[1].split(".")[0]
base_dir = Path("cuda") / "core" / "experimental"
# Copy from other wheels
print(f" Copying {wheel_dir} to {base_wheel}", file=sys.stderr)
shutil.copytree(wheel_dir / base_dir, base_wheel / base_dir / f"cu{cuda_version}")

# Overwrite the __init__.py in versioned dirs
os.truncate(base_wheel / base_dir / f"cu{cuda_version}" / "__init__.py", 0)

# The base dir should only contain __init__.py, the include dir, and the versioned dirs
files_to_remove = os.scandir(base_wheel / base_dir)
for f in files_to_remove:
f_abspath = f.path
if f.name not in ("__init__.py", "cu12", "cu13", "include"):
if f.is_dir():
shutil.rmtree(f_abspath)
else:
os.remove(f_abspath)
versioned_dir = base_wheel / base_dir / f"cu{cuda_version}"

# Create versioned directory
versioned_dir.mkdir(parents=True, exist_ok=True)

# Copy only version-specific binaries (.so, .pyd, .dll files) from the source wheel
# Python modules (.py, .pyx, .pxd) remain in cuda/core/
# Exclude versioned directories (cu12/, cu13/) to avoid recursion
source_dir = wheel_dir / base_dir
for item in source_dir.rglob("*"):
if item.is_dir():
continue

# Skip files in versioned directories to avoid recursion
rel_path = item.relative_to(source_dir)
if any(part in ("cu12", "cu13") for part in rel_path.parts):
continue

# Only copy binary files, not Python source files
if item.suffix in (".so", ".pyd", ".dll"):
dest_item = versioned_dir / rel_path
dest_item.parent.mkdir(parents=True, exist_ok=True)
shutil.copy2(item, dest_item)

# Create empty __init__.py in versioned dirs
(versioned_dir / "__init__.py").touch()

# Repack the merged wheel
output_dir.mkdir(parents=True, exist_ok=True)
Expand Down
13 changes: 9 additions & 4 deletions cuda_core/build_hooks.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ def _build_cuda_core():

# It seems setuptools' wildcard support has problems for namespace packages,
# so we explicitly spell out all Extension instances.
root_module = "cuda.core.experimental"
root_module = "cuda.core"
root_path = f"{os.path.sep}".join(root_module.split(".")) + os.path.sep
ext_files = glob.glob(f"{root_path}/**/*.pyx", recursive=True)

Expand All @@ -84,11 +84,16 @@ def get_cuda_paths():
print("CUDA paths:", CUDA_PATH)
return CUDA_PATH

# Add local include directory for cuda/core/include
local_include_dirs = ["cuda/core"]
Comment on lines +87 to +88
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Why is this needed? We do not keep any headers under cuda/core or cuda/core/experimental. If without adding this Cython runs into compilation errors, I suspect this has to do with the __init_experimental.pxd file below.

cuda_include_dirs = list(os.path.join(root, "include") for root in get_cuda_paths())
all_include_dirs = local_include_dirs + cuda_include_dirs

ext_modules = tuple(
Extension(
f"cuda.core.experimental.{mod.replace(os.path.sep, '.')}",
sources=[f"cuda/core/experimental/{mod}.pyx"],
include_dirs=list(os.path.join(root, "include") for root in get_cuda_paths()),
f"cuda.core.{mod.replace(os.path.sep, '.')}",
sources=[f"cuda/core/{mod}.pyx"],
include_dirs=all_include_dirs,
language="c++",
)
for mod in module_names
Expand Down
56 changes: 56 additions & 0 deletions cuda_core/cuda/core/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,59 @@
# SPDX-License-Identifier: Apache-2.0

from cuda.core._version import __version__

try:
from cuda import bindings
except ImportError:
raise ImportError("cuda.bindings 12.x or 13.x must be installed") from None
else:
cuda_major, cuda_minor = bindings.__version__.split(".")[:2]
if cuda_major not in ("12", "13"):
raise ImportError("cuda.bindings 12.x or 13.x must be installed")

import importlib

subdir = f"cu{cuda_major}"
try:
versioned_mod = importlib.import_module(f".{subdir}", __package__)
# Import all symbols from the module
globals().update(versioned_mod.__dict__)
except ImportError:
# This is not a wheel build, but a conda or local build, do nothing
pass
else:
del versioned_mod
finally:
del bindings, importlib, subdir, cuda_major, cuda_minor

from cuda.core import utils # noqa: E402
from cuda.core._device import Device # noqa: E402
from cuda.core._event import Event, EventOptions # noqa: E402
from cuda.core._graph import ( # noqa: E402
Graph,
GraphBuilder,
GraphCompleteOptions,
GraphDebugPrintOptions,
)
from cuda.core._launch_config import LaunchConfig # noqa: E402
from cuda.core._launcher import launch # noqa: E402
from cuda.core._layout import StridedLayout # noqa: E402
from cuda.core._linker import Linker, LinkerOptions # noqa: E402
from cuda.core._memory import ( # noqa: E402
Buffer,
DeviceMemoryResource,
DeviceMemoryResourceOptions,
GraphMemoryResource,
LegacyPinnedMemoryResource,
MemoryResource,
VirtualMemoryResource,
VirtualMemoryResourceOptions,
)
Comment on lines +44 to +53
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list needs to be updated to capture a few new things.

from cuda.core._module import Kernel, ObjectCode # noqa: E402
from cuda.core._program import Program, ProgramOptions # noqa: E402
from cuda.core._stream import Stream, StreamOptions # noqa: E402
from cuda.core._system import System # noqa: E402

system = System()
__import__("sys").modules[__spec__.name + ".system"] = system
del System
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cython needs an exact filename __init__.pxd so as to find the current modules. This renaming does not make sense to me. What does it do?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably an accident.

I had to git merge main already several times, a likely cause of accidents. The merges are messy. I'm using Cursor and it seems to miss some details sometimes.

This PR has become much more complex than it looked like initially (tests passed a couple weeks ago).

E.g. I don't understand why ci/tools/merge_cuda_core_wheels.py (last changed in Oct) worked before but now needs a change (it makes sense that it needs a change). Testing it locally isn't easy because the input wheels only live on the runners and disappear when the jobs finish. So I'd have to figure out how to build them locally (currently I'm not trying to).

I'll keep beating on it until I'll get all tests to pass again. Maybe I'll start over fresh. LLMs seems to be less successful when working incrementally.

Thanks for the hints.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! Leaving a quick note before signing off...

The merge script can be tested locally easily. Create two conda envs, one with CUDA 13 and another with CUDA 12, and build cuda.core wheels in each env via pip wheel --no-deps --no-build-isolation -v .. The cuda-bindings package can just be conda-installed together with the CTK for simplicity. No need to build it from source. Then, we take the two wheels and rename them before running the merge script. This logic is relevant:

# Rename wheel to include CUDA version suffix
mkdir -p "${{ env.CUDA_CORE_ARTIFACTS_DIR }}/cu${BUILD_PREV_CUDA_MAJOR}"
for wheel in ${{ env.CUDA_CORE_ARTIFACTS_DIR }}/*.whl; do
if [[ -f "${wheel}" ]]; then
base_name=$(basename "${wheel}" .whl)
new_name="${base_name}.cu${BUILD_PREV_CUDA_MAJOR}.whl"
mv "${wheel}" "${{ env.CUDA_CORE_ARTIFACTS_DIR }}/cu${BUILD_PREV_CUDA_MAJOR}/${new_name}"
echo "Renamed wheel to: ${new_name}"
fi
done

Once this PR (#1085) is revived/merged, we will be able to build cuda-core in a pure-wheel env and no need to use conda.

File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

from dataclasses import dataclass

from cuda.core.experimental._utils.cuda_utils import driver
from cuda.core._utils.cuda_utils import driver


@dataclass
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,27 +6,27 @@ cimport cpython
from libc.stdint cimport uintptr_t

from cuda.bindings cimport cydriver
from cuda.core.experimental._utils.cuda_utils cimport HANDLE_RETURN
from cuda.core._utils.cuda_utils cimport HANDLE_RETURN

import threading
from typing import Optional, TYPE_CHECKING, Union

from cuda.core.experimental._context import Context, ContextOptions
from cuda.core.experimental._event import Event, EventOptions
from cuda.core.experimental._graph import GraphBuilder
from cuda.core.experimental._stream import IsStreamT, Stream, StreamOptions
from cuda.core.experimental._utils.clear_error_support import assert_type
from cuda.core.experimental._utils.cuda_utils import (
from cuda.core._context import Context, ContextOptions
from cuda.core._event import Event, EventOptions
from cuda.core._graph import GraphBuilder
from cuda.core._stream import IsStreamT, Stream, StreamOptions
from cuda.core._utils.clear_error_support import assert_type
from cuda.core._utils.cuda_utils import (
ComputeCapability,
CUDAError,
driver,
handle_return,
runtime,
)
from cuda.core.experimental._stream cimport default_stream
from cuda.core._stream cimport default_stream

if TYPE_CHECKING:
from cuda.core.experimental._memory import Buffer, MemoryResource
from cuda.core._memory import Buffer, MemoryResource

# TODO: I prefer to type these as "cdef object" and avoid accessing them from within Python,
# but it seems it is very convenient to expose them for testing purposes...
Expand Down Expand Up @@ -1154,17 +1154,17 @@ class Device:
)
)
if attr == 1:
from cuda.core.experimental._memory import DeviceMemoryResource
from cuda.core._memory import DeviceMemoryResource
self._memory_resource = DeviceMemoryResource(self._id)
else:
from cuda.core.experimental._memory import _SynchronousMemoryResource
from cuda.core._memory import _SynchronousMemoryResource
self._memory_resource = _SynchronousMemoryResource(self._id)

return self._memory_resource

@memory_resource.setter
def memory_resource(self, mr):
from cuda.core.experimental._memory import MemoryResource
from cuda.core._memory import MemoryResource
assert_type(mr, MemoryResource)
self._memory_resource = mr

Expand Down Expand Up @@ -1223,7 +1223,7 @@ class Device:
Acts as an entry point of this object. Users always start a code by
calling this method, e.g.

>>> from cuda.core.experimental import Device
>>> from cuda.core import Device
>>> dev0 = Device(0)
>>> dev0.set_current()
>>> # ... do work on device 0 ...
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ cimport cpython
from libc.stdint cimport uintptr_t
from libc.string cimport memcpy
from cuda.bindings cimport cydriver
from cuda.core.experimental._utils.cuda_utils cimport (
from cuda.core._utils.cuda_utils cimport (
check_or_create_options,
HANDLE_RETURN
)
Expand All @@ -18,8 +18,8 @@ from dataclasses import dataclass
import multiprocessing
from typing import TYPE_CHECKING, Optional

from cuda.core.experimental._context import Context
from cuda.core.experimental._utils.cuda_utils import (
from cuda.core._context import Context
from cuda.core._utils.cuda_utils import (
CUDAError,
check_multiprocessing_start_method,
driver,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@
from typing import TYPE_CHECKING

if TYPE_CHECKING:
from cuda.core.experimental._stream import Stream
from cuda.core.experimental._utils.cuda_utils import (
from cuda.core._stream import Stream
from cuda.core._utils.cuda_utils import (
driver,
get_binding_version,
handle_return,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ import ctypes

import numpy

from cuda.core.experimental._memory import Buffer
from cuda.core.experimental._utils.cuda_utils import driver
from cuda.core._memory import Buffer
from cuda.core._utils.cuda_utils import driver
from cuda.bindings cimport cydriver


Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@
#
# SPDX-License-Identifier: Apache-2.0

from cuda.core.experimental._device import Device
from cuda.core.experimental._utils.cuda_utils import (
from cuda.core._device import Device
from cuda.core._utils.cuda_utils import (
CUDAError,
cast_to_3_tuple,
driver,
Expand Down
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
# SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# SPDX-License-Identifier: Apache-2.0
from cuda.core.experimental._launch_config cimport LaunchConfig, _to_native_launch_config
from cuda.core.experimental._stream cimport Stream_accept
from cuda.core._launch_config cimport LaunchConfig, _to_native_launch_config
from cuda.core._stream cimport Stream_accept


from cuda.core.experimental._kernel_arg_handler import ParamHolder
from cuda.core.experimental._module import Kernel
from cuda.core.experimental._stream import Stream
from cuda.core.experimental._utils.clear_error_support import assert_type
from cuda.core.experimental._utils.cuda_utils import (
from cuda.core._kernel_arg_handler import ParamHolder
from cuda.core._module import Kernel
from cuda.core._stream import Stream
from cuda.core._utils.clear_error_support import assert_type
from cuda.core._utils.cuda_utils import (
_reduce_3_tuple,
check_or_create_options,
driver,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ ctypedef uint32_t property_mask_t
ctypedef vector.vector[stride_t] extents_strides_t
ctypedef vector.vector[axis_t] axis_vec_t

from cuda.core.experimental._utils cimport cuda_utils
from cuda.core._utils cimport cuda_utils


ctypedef fused integer_t:
Expand Down
Loading
Loading