Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions cuda_core/build_hooks.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,7 @@ def get_sources(mod_name):
sources = [f"cuda/core/{mod_name}.pyx"]

# Add module-specific .cpp file from _cpp/ directory if it exists
# Example: _resource_handles.pyx finds _cpp/resource_handles.cpp.
cpp_file = f"cuda/core/_cpp/{mod_name.lstrip('_')}.cpp"
if os.path.exists(cpp_file):
sources.append(cpp_file)
Expand Down
80 changes: 30 additions & 50 deletions cuda_core/cuda/core/_cpp/DESIGN.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,25 +101,20 @@ return as_py(h_stream) # cuda.bindings.driver.CUstream
```
cuda/core/
├── _resource_handles.pyx # Cython module (compiles resource_handles.cpp)
├── _resource_handles.pxd # Cython declarations and dispatch wrappers
├── _resource_handles.pxd # Cython declarations for consumer modules
└── _cpp/
├── resource_handles.hpp # C++ API declarations
├── resource_handles.cpp # C++ implementation
└── resource_handles_cxx_api.hpp # Capsule struct definition
└── resource_handles.cpp # C++ implementation
```

### Build Implications

The `_cpp/` subdirectory contains C++ source files that are compiled into the
`_resource_handles` extension module. Other Cython modules in cuda.core do **not**
link against this code directly—they access it through a capsule mechanism
(explained below).
link against this code directly—they `cimport` functions from
`_resource_handles.pxd`, and calls go through `_resource_handles.so` at runtime.

## Capsule Architecture

The implementation uses **two separate capsule mechanisms** for different purposes:

### Capsule 1: C++ API Table (`_CXX_API`)
## Cross-Module Function Sharing

**Problem**: Cython extension modules compile independently. If multiple modules
(`_memory.pyx`, `_ipc.pyx`, etc.) each linked `resource_handles.cpp`, they would
Expand All @@ -129,38 +124,32 @@ each have their own copies of:
- Thread-local error state
- Other static data, including global caches

**Solution**: Only `_resource_handles.so` links the C++ code. It exports a capsule
containing function pointers:

```cpp
struct ResourceHandlesCxxApiV1 {
uint32_t abi_version;
uint32_t struct_size;

// Thread-local error handling
CUresult (*get_last_error)() noexcept;
CUresult (*peek_last_error)() noexcept;
void (*clear_last_error)() noexcept;
**Solution**: Only `_resource_handles.so` links the C++ code. The `.pyx` file
uses `cdef extern from` to declare C++ functions with Cython-accessible names:

// Handle creation functions
ContextHandle (*get_primary_context)(int device_id) noexcept;
StreamHandle (*create_stream_handle)(...) noexcept;
// ... etc
};
```cython
# In _resource_handles.pyx
cdef extern from "_cpp/resource_handles.hpp" namespace "cuda_core":
StreamHandle create_stream_handle "cuda_core::create_stream_handle" (
ContextHandle h_ctx, unsigned int flags, int priority) nogil
# ... other functions
```

Other Cython modules import this capsule at runtime and call through the function
pointers. The `.pxd` file provides inline wrappers that hide this indirection:
The `.pxd` file declares these same functions so other modules can `cimport` them:

```cython
cdef inline StreamHandle create_stream_handle(...) except * nogil:
return _handles_table.create_stream_handle(...)
# In _resource_handles.pxd
cdef StreamHandle create_stream_handle(
ContextHandle h_ctx, unsigned int flags, int priority) noexcept nogil
```

Importing modules are expected to call `_init_handles_table()` prior to calling
any wrapper functions.
The `cdef extern from` declaration in the `.pyx` satisfies the `.pxd` declaration
directly—no wrapper functions are needed. When consumer modules `cimport` these
functions, Cython generates calls through `_resource_handles.so` at runtime.
This ensures all static and thread-local state lives in a single shared library,
avoiding the duplicate state problem.

### Capsule 2: CUDA Driver API (`_CUDA_DRIVER_API_V1`)
## CUDA Driver API Capsule (`_CUDA_DRIVER_API_V1`)

**Problem**: cuda.core cannot directly call CUDA driver functions because:

Expand All @@ -186,13 +175,6 @@ struct CudaDriverApiV1 {
The C++ code retrieves this capsule once (via `load_driver_api()`) and caches the
function pointers for subsequent use.

### Why Two Capsules?

| Capsule | Direction | Purpose |
|---------|-----------|---------|
| `_CXX_API` | C++ → Cython | Share handle functions across modules |
| `_CUDA_DRIVER_API_V1` | Cython → C++ | Provide resolved driver symbols |

## Key Implementation Details

### Structural Dependencies
Expand Down Expand Up @@ -276,14 +258,12 @@ Related functions:
from cuda.core._resource_handles cimport (
StreamHandle,
create_stream_handle,
cu,
intptr,
as_cu,
as_intptr,
as_py,
get_last_error,
_init_handles_table,
)

_init_handles_table() # prerequisite before calling handle API functions

# Create a stream
cdef StreamHandle h_stream = create_stream_handle(h_ctx, flags, priority)
if not h_stream:
Expand All @@ -302,10 +282,10 @@ The resource handle design:

1. **Separates resource management** into its own layer, independent of Python objects.
2. **Encodes lifetimes structurally** via embedded handle dependencies.
3. **Uses capsules** to solve two distinct problems:
- Sharing C++ code across Cython modules without duplicate statics.
- Resolving CUDA driver symbols dynamically through cuda-bindings.
4. **Provides overloaded accessors** (`cu`, `intptr`, `py`) since handles cannot
3. **Uses Cython's `cimport` mechanism** to share C++ code across modules without
duplicate static/thread-local state.
4. **Uses a capsule** to resolve CUDA driver symbols dynamically through cuda-bindings.
5. **Provides overloaded accessors** (`as_cu`, `as_intptr`, `as_py`) since handles cannot
have attributes without unnecessary Python object wrappers.

This architecture ensures CUDA resources are managed correctly regardless of Python
Expand Down
59 changes: 1 addition & 58 deletions cuda_core/cuda/core/_cpp/resource_handles.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@
#include <Python.h>

#include "resource_handles.hpp"
#include "resource_handles_cxx_api.hpp"
#include <cuda.h>
#include <cstdint>
#include <cstring>
Expand Down Expand Up @@ -470,7 +469,7 @@ EventHandle create_event_handle(ContextHandle h_ctx, unsigned int flags) {
return EventHandle(box, &box->resource);
}

EventHandle create_event_handle(unsigned int flags) {
EventHandle create_event_handle_noctx(unsigned int flags) {
return create_event_handle(ContextHandle{}, flags);
}

Expand Down Expand Up @@ -857,60 +856,4 @@ DevicePtrHandle deviceptr_import_ipc(MemoryPoolHandle h_pool, const void* export
}
}

// ============================================================================
// Capsule C++ API table
// ============================================================================

const ResourceHandlesCxxApiV1* get_resource_handles_cxx_api_v1() noexcept {
static const ResourceHandlesCxxApiV1 table = []() {
ResourceHandlesCxxApiV1 t{};
t.abi_version = RESOURCE_HANDLES_CXX_API_VERSION;
t.struct_size = static_cast<std::uint32_t>(sizeof(ResourceHandlesCxxApiV1));

// Error handling
t.get_last_error = &get_last_error;
t.peek_last_error = &peek_last_error;
t.clear_last_error = &clear_last_error;

// Context
t.create_context_handle_ref = &create_context_handle_ref;
t.get_primary_context = &get_primary_context;
t.get_current_context = &get_current_context;

// Stream
t.create_stream_handle = &create_stream_handle;
t.create_stream_handle_ref = &create_stream_handle_ref;
t.create_stream_handle_with_owner = &create_stream_handle_with_owner;
t.get_legacy_stream = &get_legacy_stream;
t.get_per_thread_stream = &get_per_thread_stream;

// Event (resolve overloads explicitly)
t.create_event_handle =
static_cast<EventHandle (*)(ContextHandle, unsigned int)>(&create_event_handle);
t.create_event_handle_noctx =
static_cast<EventHandle (*)(unsigned int)>(&create_event_handle);
t.create_event_handle_ipc = &create_event_handle_ipc;

// Memory pool
t.create_mempool_handle = &create_mempool_handle;
t.create_mempool_handle_ref = &create_mempool_handle_ref;
t.get_device_mempool = &get_device_mempool;
t.create_mempool_handle_ipc = &create_mempool_handle_ipc;

// Device pointer
t.deviceptr_alloc_from_pool = &deviceptr_alloc_from_pool;
t.deviceptr_alloc_async = &deviceptr_alloc_async;
t.deviceptr_alloc = &deviceptr_alloc;
t.deviceptr_alloc_host = &deviceptr_alloc_host;
t.deviceptr_create_ref = &deviceptr_create_ref;
t.deviceptr_create_with_owner = &deviceptr_create_with_owner;
t.deviceptr_import_ipc = &deviceptr_import_ipc;
t.deallocation_stream = &deallocation_stream;
t.set_deallocation_stream = &set_deallocation_stream;

return t;
}();
return &table;
}

} // namespace cuda_core
2 changes: 1 addition & 1 deletion cuda_core/cuda/core/_cpp/resource_handles.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ EventHandle create_event_handle(ContextHandle h_ctx, unsigned int flags);
// Use for temporary events that are created and destroyed in the same scope.
// When the last reference is released, cuEventDestroy is called automatically.
// Returns empty handle on error (caller must check).
EventHandle create_event_handle(unsigned int flags);
EventHandle create_event_handle_noctx(unsigned int flags);

// Create an owning event handle from an IPC handle.
// The originating process owns the event and its context.
Expand Down
79 changes: 0 additions & 79 deletions cuda_core/cuda/core/_cpp/resource_handles_cxx_api.hpp

This file was deleted.

4 changes: 0 additions & 4 deletions cuda_core/cuda/core/_device.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -17,15 +17,11 @@ from cuda.core._event cimport Event as cyEvent
from cuda.core._event import Event, EventOptions
from cuda.core._resource_handles cimport (
ContextHandle,
_init_handles_table,
create_context_handle_ref,
get_primary_context,
as_cu,
)

# Prerequisite before calling handle API functions (see _cpp/DESIGN.md)
_init_handles_table()

from cuda.core._graph import GraphBuilder
from cuda.core._stream import IsStreamT, Stream, StreamOptions
from cuda.core._utils.clear_error_support import assert_type
Expand Down
4 changes: 0 additions & 4 deletions cuda_core/cuda/core/_event.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -11,17 +11,13 @@ from cuda.core._context cimport Context
from cuda.core._resource_handles cimport (
ContextHandle,
EventHandle,
_init_handles_table,
create_event_handle,
create_event_handle_ipc,
as_intptr,
as_cu,
as_py,
)

# Prerequisite before calling handle API functions (see _cpp/DESIGN.md)
_init_handles_table()

from cuda.core._utils.cuda_utils cimport (
check_or_create_options,
HANDLE_RETURN
Expand Down
4 changes: 0 additions & 4 deletions cuda_core/cuda/core/_memory/_buffer.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -16,16 +16,12 @@ from cuda.core._memory cimport _ipc
from cuda.core._resource_handles cimport (
DevicePtrHandle,
StreamHandle,
_init_handles_table,
deviceptr_create_with_owner,
as_intptr,
as_cu,
set_deallocation_stream,
)

# Prerequisite before calling handle API functions (see _cpp/DESIGN.md)
_init_handles_table()

from cuda.core._stream cimport Stream_accept, Stream
from cuda.core._utils.cuda_utils cimport HANDLE_RETURN

Expand Down
4 changes: 0 additions & 4 deletions cuda_core/cuda/core/_memory/_graph_memory_resource.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,10 @@ from cuda.bindings cimport cydriver
from cuda.core._memory._buffer cimport Buffer, Buffer_from_deviceptr_handle, MemoryResource
from cuda.core._resource_handles cimport (
DevicePtrHandle,
_init_handles_table,
deviceptr_alloc_async,
as_cu,
)

# Prerequisite before calling handle API functions (see _cpp/DESIGN.md)
_init_handles_table()

from cuda.core._stream cimport default_stream, Stream_accept, Stream
from cuda.core._utils.cuda_utils cimport HANDLE_RETURN

Expand Down
4 changes: 0 additions & 4 deletions cuda_core/cuda/core/_memory/_ipc.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -10,16 +10,12 @@ from cuda.core._memory._memory_pool cimport _MemPool
from cuda.core._stream cimport Stream
from cuda.core._resource_handles cimport (
DevicePtrHandle,
_init_handles_table,
create_mempool_handle_ipc,
deviceptr_import_ipc,
get_last_error,
as_cu,
)

# Prerequisite before calling handle API functions (see _cpp/DESIGN.md)
_init_handles_table()

from cuda.core._stream cimport default_stream
from cuda.core._utils.cuda_utils cimport HANDLE_RETURN
from cuda.core._utils.cuda_utils import check_multiprocessing_start_method
Expand Down
Loading
Loading