Conversation
leofang
left a comment
There was a problem hiding this comment.
Thanks, @oleksandr-pavlyk. One very high level (and perhaps naive, since I am not familiar with oneAPI) question: My understanding is oneAPI can target non-Intel devices, such as NVIDIA or AMD GPUs. What happens, then, when a oneAPI-based library targeting CUDA/HIP/OpenCL exchanges with another library implemented natively on one of these devices?
Say I have kDLONEAPI_GPU --> kDLCUDA. Does an importer handle this in their DLPack implementation? Or is there a runtime flag somewhere that we can check and see if two libraries are actually running on an NVIDIA GPU? Or do we only allow exchanges between two oneAPI libraries (maybe context sharing is not trivial in oneAPI)?
include/dlpack/dlpack.h
Outdated
| kDLONEAPI_GPU = 14, | ||
| kDLONEAPI_CPU = 15, | ||
| kDLONEAPI_ACCELERATOR = 16, |
There was a problem hiding this comment.
Could you add short comments for each entry for better clarity?
There was a problem hiding this comment.
Addressed.
I do mean to try this scenario out, hopefully later this week, so for now I will be speaking hypothetically. Implementation of SYCL API is provided by a backend. Open source LLVM SYCL compiler has such a CUDA backend. In principle, SYCL entities (such as SYCL device, SYCL context) allow one to retrieve the corresponding native object stored by the backend, using backend interoperability. This means, however, that for an importer, say cupy, to be able to import CUDA memory exported with One would also need oneAPI runtime to check the type of USM allocation, since DLPack uses different device codes for USM-device allocation ( |
apps/from_usm_ndarray/main.py
Outdated
|
|
||
| def main(): | ||
| usm_ary = dpt.usm_ndarray((3,1,30), dtype="f4", buffer="device"); | ||
| usm_ary[:] = np.random.rand(3, 1, 30) |
There was a problem hiding this comment.
Assuming f4 indicates 4 byte float, you'll want
| usm_ary[:] = np.random.rand(3, 1, 30) | |
| usm_ary[:] = np.random.rand(3, 1, 30).astype(np.float32) |
There was a problem hiding this comment.
Also.. is this a valid access of a USM device allocation? A brief reading of Table 99 had me expecting that host access on a USM device buffer to be forbidden.
There was a problem hiding this comment.
One can cast on the host, if it makes the code clearer, but it is up to Python code in __setitem__ to do the right thing, such as copy while coercing in a kernel executed on the device associated with USM allocation.
| sycl::context ctxt = q.get_context(); | ||
| delete d_ptr; | ||
|
|
||
| sycl::usm::alloc kind = sycl::get_pointer_type(tensor.data, ctxt); |
There was a problem hiding this comment.
In a zero-copy exchange of a USM allocation between two applications, is there additional handling that generated code needs to employ when the allocation kind is shared vs device? Said differently, if a backend consumes a dltensor with kDLONEAPI_GPU device type, will it be sufficient for the runtime to consume allocations of these two kinds interchangeably?
There was a problem hiding this comment.
Once the importer figures out the SYCL context associated with USM allocation, any SYCL kernel submitted to a queue associated with such a context can access USM memory irrespective of the USM allocation type. The runtime takes care of that.
Once USM allocation is made, there is no way to change its allocation type. The type of USM allocation is queried from the pointer and the sycl context it is associated with. If query is done against a different SYCL context, the returned type is "unknown"
In this particular application, line 68 serves to demonstrate that query worked correctly, and thus correct context was reconstructed.
| case sycl::usm::alloc::host: | ||
| std::cout << "USM-host allocation-based array" << std::endl; | ||
| break; |
There was a problem hiding this comment.
Can an allocation made outside of sycl::usm, e.g. by numpy, be zero copied into a DLTensor of type kDLONEAPI_CPU for use in a sycl program?
I wonder the same about CUDA/OpenCL buffer allocations, but limit the question to CPU allocations for simplicity.
There was a problem hiding this comment.
Yes. Host memory can be access from within SYCL kernels through use of sycl::buffer.
For example, see https://github.com/IntelPython/dpctl/blob/master/examples/pybind11/use_dpctl_syclqueue/pybind11_example.cpp#L61
The function offloaded_array_mod works on a regular NumPy array allocated using NumPy's host allocator:
|
@tqchen The real purpose of this PR is to extend The app was only added to demonstrate feasibility. The POC code relies on the platform default conteget xt extension implementation, which is implemented in open-source intel/llvm, but is not part of oneAPI 2021.3 and oneAPI 2021.4 yet. An example of building
Please let me know if I should remove the code from apps for this PR to get ahead. |
|
@oleksandr-pavlyk indeed it would be helpful to separate it out so folks can focus on the standard itself |
…I_ACCELERATOR
Adds DLPack support for OneAPI SYCL root-devices, addressable with filter-selector,
e.g. ``sycl::ext::filter_selector("cpu:device_id")`` for kDLONEAPI_CPU devices.
Two parties wishing to zero-copy exchange USM allocations using DLPack need to bind their
allocations to the default platform context (implicitly used by ``sycl::queue(dev)``
constructor), ensuring that both parties use the same context associate with agreed
upon SYCL device, thus ensuring that USM allocations made by one party are accessible
to another.
```c++
/* Routines to construct SYCL device from DLPack's device_id */
sycl::device
get_cpu_device(size_t dev_id) {
sycl::ext::oneapi::filter_selector fs("cpu:" + std::to_string(dev_id));
return sycl::device{fs};
}
sycl::device
get_gpu_device(size_t dev_id) {
sycl::ext::oneapi::filter_selector fs("gpu:" + std::to_string(dev_id));
return sycl::device{fs};
}
sycl::device
get_accelerator_device(size_t dev_id) {
sycl::ext::oneapi::filter_selector fs("accelerator:" + std::to_string(dev_id));
return sycl::device{fs};
}
sycl::context
get_default_context(const sycl::device &dev) {
auto p = dev.get_platform();
return p.ext_oneapi_get_default_context();
#error Required default platform context extension is not available, see https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/PlatformContext/PlatformContext.adoc
}
```
For a given root SYCL device, its device ID and device type enum can be constructed
as follows:
```c++
DLDeviceType
get_dlpack_device(const sycl::device &d, size_t &dev_id) {
DLDeviceType dev_type;
sycl::info::device_type sycl_dev_type =
d.get_info<sycl::info::device::device_type>();
switch(sycl_dev_type) {
case sycl::info::device_type::cpu:
dev_type = kDLONEAPI_CPU;
break;
case sycl::info::device_type::gpu:
dev_type = kDLONEAPI_GPU;
break;
case sycl::info::device_type::accelarator:
dev_type = kDLONEAPI_ACCELERATOR;
break;
default:
throw std::runtime_error(
"Custom SYCL devices are not supported by DLPack protocol"
);
}
constexpr int not_found = -1;
const auto &root_devices = sycl::device::get_devices();
sycl::default_selector mRanker;
int index = not_found;
for (const auto &root_device : root_devices) {
// rejected devices have negative score
if (mRanker(root_device) < 0)
continue;
if (sycl_dev_type == root_device.get_info<sycl::info::device::device_type>()) {
++index;
if (root_device == d)
break;
}
}
dev_id = index;
return dev_type;
}
```
Added comments next to kDLONEAPI_* enum entries per feedback
5cd08e9 to
473f0a6
Compare
I force pushed, removing additions to Note that SYCL host devices are not represented in |
|
cc @leofang @csullivan please take another look |
|
I see that the device, and therefore the device aspect, (type: cpu/gpu/accelerator) can be queried from a USM allocation. @oleksandr-pavlyk, could it be sufficient to then only introduce I started wondering along this track based on your above comment about kDLCPU as it does seem strange to have both it and kDLOneAPI_CPU. If the only distinction comes down to whether data was allocated via a usm_allocator, then perhaps a single entry to the DLDeviceType could suffice. |
I pushed the removed changes to a branch in my fork (https://github.com/oleksandr-pavlyk/dlpack/tree/app-from-usm-ndarray). |
Good point. Formulating the queries requires #include <CL/sycl.hpp>
int main(void) {
sycl::device d( sycl::default_selector{} ); // create device
sycl::context ctx1(d);
sycl::context ctx2(d);
double *p = sycl::malloc_device<double>(1024, d, ctx2);
sycl::usm::alloc allocation_type = sycl::get_pointer_type(p, ctx1);
assert( allocation_type == sycl::usm::alloc::device);
sycl::free(p, ctx2);
return 0;
}Now, compiling this $ dpcpp a.cpp -o a.out
$ SYCL_DEVICE_FILTER=opencl:gpu ./a.out
a.out: a.cpp:13: int main(): Assertion `allocation_type == sycl::usm::alloc::device' failed.
Aborted (core dumped)Identifying the common context is possible with oneAPI's sycl extensions. The two relevant extensions are filter selector to map The next ingredient is platform default context, which provides a canonical context to associate with any unpartitioned device. If both the DLPack exporter and the DLPack importer use this context the USM allocation created by the exporter can be accessed by the importer. Now, answering your specific question, one can in fact use just one addition enum Mapping from a root device to DLPack can not be used to hand off USM allocations created on sub-devices, or bound to non-canonical contexts.
My suggestion to add 3 enums entries was aiming to keep the device type explicit, but since DLPack is not being directly used by users, I am fine with using only 1 enum. |
|
Interesting, thanks for the detailed explanation @oleksandr-pavlyk. So, IIUC say I can access CPU, Intel GPU, NVIDIA GPU, FPGA in the same process, Another question with regard to how it's intended for oneAPI to use DLPack: Do you require |
|
@oleksandr-pavlyk Thank you for the detailed follow up on this. Given the flexibility to derive the sycl::device from the device id as you've shown, my preference would be to introduce the single DLDeviceType::kDLOneAPI as the initial OneAPI support in DLPack. |
|
One more question for @oleksandr-pavlyk: Any chance you have verified the statements in #78 (comment)? I guess ultimately to decide whether we wanna keep 1 enum vs 3 enums we need to know:
The same question also applies to the unified Based on my reading it seems to me it's the easiest if we don't consider the interaction between oneAPI and other non-oneAPI-based frameworks (because oneAPI runtime is needed to look up). If it's the case then indeed a single |
|
The upside of keeping just USM allocations made on Allocations made on If the oneAPI device has CUDA backend ( |
csullivan
left a comment
There was a problem hiding this comment.
Thanks @oleksandr-pavlyk!
The upside of keeping just DLDeviceType::kDLOneAPI is simplicity. The downside is that OneAPI run-time is required to query a device type (is it a CPU device, or a GPU device).
True. If there are notable cases in which this is overly burdensome in the future, we certainly can consider extending the DLDeviceTypes to include others.
Looks good to me, approved sans one nitpick typo below.
Co-authored-by: Chris Sullivan <csullivan@octoml.ai>
leofang
left a comment
There was a problem hiding this comment.
Thanks, @oleksandr-pavlyk. My only nitpick is we need to expand the inline comment to mention using the oneAPI runtime is expected to retrieve the device type. Once it's added I think it's good to go!
|
@leofang I have expanded the comment in two ways: 1. to indicate that DLPack is sharing USM allocation, 2. to note that oneAPI runtime call is required to learn more about the device type as well as the USM allocation type. |
leofang
left a comment
There was a problem hiding this comment.
LGTM, thanks @oleksandr-pavlyk!
Just for my own curiosity, if we store a |
I do not think so. Runtime is still needed to retrieve the native object |
|
@tqchen I believe we're good to go? 🙂 |
|
Thanks everyone this is merged! |
This PR proposes to extend
DLDeviceTypeenum with 3 new entries:kDLONEAPI_GPU,kDLONEAPI_CPU, andkDLONEAPI_ACCELERATOR.This adds DLPack support for OneAPI SYCL root-devices, addressable with
filter-selector, e.g.
sycl::ext::oneapi::filter_selector("gpu:device_id")forkDLONEAPI_GPUdevices.Two parties wishing to zero-copy exchange USM allocations using DLPack need to bind their
allocations to the default platform context (implicitly used by
sycl::queue(dev)constructor), ensuring that both parties use the same
sycl::contextassociated with the agreedupon SYCL device, thus ensuring that USM allocations made by one party are accessible
to another.
An application
apps/from_usm_ndarrayis included in this PR to demonstrate working prototype,compiled with DPC++ 2021.3, or Open Source LLVM-Sycl compiler release 2021-07.