Skip to content

Add oneapi support#78

Merged
tqchen merged 5 commits intodmlc:mainfrom
oleksandr-pavlyk:add-oneapi-support
Oct 28, 2021
Merged

Add oneapi support#78
tqchen merged 5 commits intodmlc:mainfrom
oleksandr-pavlyk:add-oneapi-support

Conversation

@oleksandr-pavlyk
Copy link
Collaborator

This PR proposes to extend DLDeviceType enum with 3 new entries:
kDLONEAPI_GPU, kDLONEAPI_CPU, and kDLONEAPI_ACCELERATOR.

This adds DLPack support for OneAPI SYCL root-devices, addressable with
filter-selector, e.g. sycl::ext::oneapi::filter_selector("gpu:device_id") for
kDLONEAPI_GPU devices.

Two parties wishing to zero-copy exchange USM allocations using DLPack need to bind their
allocations to the default platform context (implicitly used by sycl::queue(dev)
constructor), ensuring that both parties use the same sycl::context associated with the agreed
upon SYCL device, thus ensuring that USM allocations made by one party are accessible
to another.

An application apps/from_usm_ndarray is included in this PR to demonstrate working prototype,
compiled with DPC++ 2021.3, or Open Source LLVM-Sycl compiler release 2021-07.

@oleksandr-pavlyk
Copy link
Collaborator Author

@tqchen @leofang @rgommers Please provide feedback

Copy link
Collaborator

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @oleksandr-pavlyk. One very high level (and perhaps naive, since I am not familiar with oneAPI) question: My understanding is oneAPI can target non-Intel devices, such as NVIDIA or AMD GPUs. What happens, then, when a oneAPI-based library targeting CUDA/HIP/OpenCL exchanges with another library implemented natively on one of these devices?

Say I have kDLONEAPI_GPU --> kDLCUDA. Does an importer handle this in their DLPack implementation? Or is there a runtime flag somewhere that we can check and see if two libraries are actually running on an NVIDIA GPU? Or do we only allow exchanges between two oneAPI libraries (maybe context sharing is not trivial in oneAPI)?

Comment on lines +71 to +85
kDLONEAPI_GPU = 14,
kDLONEAPI_CPU = 15,
kDLONEAPI_ACCELERATOR = 16,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add short comments for each entry for better clarity?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

@oleksandr-pavlyk
Copy link
Collaborator Author

oleksandr-pavlyk commented Sep 27, 2021

Thanks, @oleksandr-pavlyk. One very high level (and perhaps naive, since I am not familiar with oneAPI) question: My understanding is oneAPI can target non-Intel devices, such as NVIDIA or AMD GPUs. What happens, then, when a oneAPI-based library targeting CUDA/HIP/OpenCL exchanges with another library implemented natively on one of these devices?

Say I have kDLONEAPI_GPU --> kDLCUDA. Does an importer handle this in their DLPack implementation? Or is there a runtime flag somewhere that we can check and see if two libraries are actually running on an NVIDIA GPU? Or do we only allow exchanges between two oneAPI libraries (maybe context sharing is not trivial in oneAPI)?

I do mean to try this scenario out, hopefully later this week, so for now I will be speaking hypothetically.

Implementation of SYCL API is provided by a backend. Open source LLVM SYCL compiler has such a CUDA backend.

In principle, SYCL entities (such as SYCL device, SYCL context) allow one to retrieve the corresponding native object stored by the backend, using backend interoperability. This means, however, that for an importer, say cupy, to be able to import CUDA memory exported with device_type=kDLONEAPI_GPU, one needs oneAPI runtime to check if the backend is sycl::backend::cuda, and to get appropriate CUDART objects.

One would also need oneAPI runtime to check the type of USM allocation, since DLPack uses different device codes for USM-device allocation (kDLGPU) vs. USM-shared allocation (I surmise this should correspond to kDLCUDAManaged)

@tqchen tqchen self-assigned this Sep 27, 2021

def main():
usm_ary = dpt.usm_ndarray((3,1,30), dtype="f4", buffer="device");
usm_ary[:] = np.random.rand(3, 1, 30)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming f4 indicates 4 byte float, you'll want

Suggested change
usm_ary[:] = np.random.rand(3, 1, 30)
usm_ary[:] = np.random.rand(3, 1, 30).astype(np.float32)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also.. is this a valid access of a USM device allocation? A brief reading of Table 99 had me expecting that host access on a USM device buffer to be forbidden.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One can cast on the host, if it makes the code clearer, but it is up to Python code in __setitem__ to do the right thing, such as copy while coercing in a kernel executed on the device associated with USM allocation.

sycl::context ctxt = q.get_context();
delete d_ptr;

sycl::usm::alloc kind = sycl::get_pointer_type(tensor.data, ctxt);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a zero-copy exchange of a USM allocation between two applications, is there additional handling that generated code needs to employ when the allocation kind is shared vs device? Said differently, if a backend consumes a dltensor with kDLONEAPI_GPU device type, will it be sufficient for the runtime to consume allocations of these two kinds interchangeably?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the importer figures out the SYCL context associated with USM allocation, any SYCL kernel submitted to a queue associated with such a context can access USM memory irrespective of the USM allocation type. The runtime takes care of that.

Once USM allocation is made, there is no way to change its allocation type. The type of USM allocation is queried from the pointer and the sycl context it is associated with. If query is done against a different SYCL context, the returned type is "unknown"

In this particular application, line 68 serves to demonstrate that query worked correctly, and thus correct context was reconstructed.

Comment on lines +76 to +78
case sycl::usm::alloc::host:
std::cout << "USM-host allocation-based array" << std::endl;
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can an allocation made outside of sycl::usm, e.g. by numpy, be zero copied into a DLTensor of type kDLONEAPI_CPU for use in a sycl program?

I wonder the same about CUDA/OpenCL buffer allocations, but limit the question to CPU allocations for simplicity.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Host memory can be access from within SYCL kernels through use of sycl::buffer.

For example, see https://github.com/IntelPython/dpctl/blob/master/examples/pybind11/use_dpctl_syclqueue/pybind11_example.cpp#L61

The function offloaded_array_mod works on a regular NumPy array allocated using NumPy's host allocator:

https://github.com/IntelPython/dpctl/blob/master/examples/pybind11/use_dpctl_syclqueue/example.py#L39

@oleksandr-pavlyk
Copy link
Collaborator Author

@tqchen The real purpose of this PR is to extend DLDeviceType enum in dlpack/include/dlpack.h.

The app was only added to demonstrate feasibility. The POC code relies on the platform default conteget xt extension implementation, which is implemented in open-source intel/llvm, but is not part of oneAPI 2021.3 and oneAPI 2021.4 yet.

An example of building dpctl with open source llvm sycl bundle can be found in here.

dpctl has just enabled use of this default context in IntelPython/dpctl#627 and so the POC should work now.

Please let me know if I should remove the code from apps for this PR to get ahead.

@tqchen
Copy link
Member

tqchen commented Oct 14, 2021

@oleksandr-pavlyk indeed it would be helpful to separate it out so folks can focus on the standard itself

…I_ACCELERATOR

Adds DLPack support for OneAPI SYCL root-devices, addressable with filter-selector,
e.g. ``sycl::ext::filter_selector("cpu:device_id")`` for kDLONEAPI_CPU devices.

Two parties wishing to zero-copy exchange USM allocations using DLPack need to bind their
allocations to the default platform context (implicitly used by ``sycl::queue(dev)``
constructor), ensuring that both parties use the same context associate with agreed
upon SYCL device, thus ensuring that USM allocations made by one party are accessible
to another.

```c++
/* Routines to construct SYCL device from DLPack's device_id */

sycl::device
get_cpu_device(size_t dev_id) {
    sycl::ext::oneapi::filter_selector fs("cpu:" + std::to_string(dev_id));
    return sycl::device{fs};
}

sycl::device
get_gpu_device(size_t dev_id) {
    sycl::ext::oneapi::filter_selector fs("gpu:" + std::to_string(dev_id));
    return sycl::device{fs};
}

sycl::device
get_accelerator_device(size_t dev_id) {
    sycl::ext::oneapi::filter_selector fs("accelerator:" + std::to_string(dev_id));
    return sycl::device{fs};
}

sycl::context
get_default_context(const sycl::device &dev) {
    auto p = dev.get_platform();
    return p.ext_oneapi_get_default_context();
    #error Required default platform context extension is not available, see https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/PlatformContext/PlatformContext.adoc
}
```

For a given root SYCL device, its device ID and device type enum can be constructed
as follows:

```c++
DLDeviceType
get_dlpack_device(const sycl::device &d, size_t &dev_id) {
   DLDeviceType dev_type;
   sycl::info::device_type sycl_dev_type =
       d.get_info<sycl::info::device::device_type>();
   switch(sycl_dev_type) {
   case sycl::info::device_type::cpu:
      dev_type = kDLONEAPI_CPU;
      break;
   case sycl::info::device_type::gpu:
      dev_type = kDLONEAPI_GPU;
      break;
   case sycl::info::device_type::accelarator:
      dev_type = kDLONEAPI_ACCELERATOR;
      break;
   default:
      throw std::runtime_error(
          "Custom SYCL devices are not supported by DLPack protocol"
	  );
   }
   constexpr int not_found = -1;
   const auto &root_devices = sycl::device::get_devices();
   sycl::default_selector mRanker;
   int index = not_found;
   for (const auto &root_device : root_devices) {
      // rejected devices have negative score
      if (mRanker(root_device) < 0)
         continue;
       if (sycl_dev_type == root_device.get_info<sycl::info::device::device_type>()) {
            ++index;
            if (root_device == d)
                break;
       }
   }
   dev_id = index;
   return dev_type;
}
```

Added comments next to kDLONEAPI_* enum entries per feedback
@oleksandr-pavlyk
Copy link
Collaborator Author

@oleksandr-pavlyk indeed it would be helpful to separate it out so folks can focus on the standard itself

I force pushed, removing additions to apps/ folder.

Note that SYCL host devices are not represented in DLDeviceType enum, but kDLCPU can be safely used for host devices.

@tqchen
Copy link
Member

tqchen commented Oct 18, 2021

cc @leofang @csullivan please take another look

@csullivan
Copy link
Contributor

csullivan commented Oct 19, 2021

I see that the device, and therefore the device aspect, (type: cpu/gpu/accelerator) can be queried from a USM allocation. @oleksandr-pavlyk, could it be sufficient to then only introduce kDLOneAPI?

I started wondering along this track based on your above comment about kDLCPU as it does seem strange to have both it and kDLOneAPI_CPU. If the only distinction comes down to whether data was allocated via a usm_allocator, then perhaps a single entry to the DLDeviceType could suffice.

@oleksandr-pavlyk
Copy link
Collaborator Author

I force pushed, removing additions to apps/ folder.

I pushed the removed changes to a branch in my fork (https://github.com/oleksandr-pavlyk/dlpack/tree/app-from-usm-ndarray).

@oleksandr-pavlyk
Copy link
Collaborator Author

oleksandr-pavlyk commented Oct 19, 2021

I see that the device, and therefore the device aspect, (type: cpu/gpu/accelerator) can be queried from a USM allocation. @oleksandr-pavlyk, could it be sufficient to then only introduce kDLOneAPI?

Good point. Formulating the queries requires sycl::context. DLPack importer must reconstruct this context to get a copy of the same context that the exporter used. The queries made against a different context (even though addressing the same device) may be unable to come back with expected results.

#include <CL/sycl.hpp>

int main(void) {
   sycl::device d( sycl::default_selector{} );  // create device

   sycl::context ctx1(d);
   sycl::context ctx2(d);

   double *p = sycl::malloc_device<double>(1024, d, ctx2);

   sycl::usm::alloc allocation_type = sycl::get_pointer_type(p, ctx1);

   assert( allocation_type == sycl::usm::alloc::device);
   sycl::free(p, ctx2);
   return 0;
}

Now, compiling this

$ dpcpp a.cpp -o a.out
$ SYCL_DEVICE_FILTER=opencl:gpu ./a.out
a.out: a.cpp:13: int main(): Assertion `allocation_type == sycl::usm::alloc::device' failed.
Aborted (core dumped)

Identifying the common context is possible with oneAPI's sycl extensions. The two relevant extensions are filter selector to map device_id to an actual root (unpartititoned) sycl::device. The root devices are created by DPC++ runtime, all sycl::device instances are references to these singletons.

The next ingredient is platform default context, which provides a canonical context to associate with any unpartitioned device. If both the DLPack exporter and the DLPack importer use this context the USM allocation created by the exporter can be accessed by the importer.

Now, answering your specific question, one can in fact use just one addition enum kDLOneAPI.

inline sycl::device get_sycl_device_by_device_id(unsigned int device_id) {
      return sycl::device( sycl::ext::oneapi::filter_selector{ std::to_string(device_id) } );
}

Mapping from a root device to device_id is also well-defined, since sycl::get_devices() provides a stable ordering on the same platform, and as relevant to DLPack, in the same process.

In [1]: import dpctl

In [2]: dpctl.SyclDevice().get_filter_string()
Out[2]: 'level_zero:gpu:0'

In [3]: dpctl.SyclDevice().get_filter_string(include_backend=False, include_device_type=False)
Out[3]: '6'

In [4]: dpctl.get_devices()[6] == dpctl.SyclDevice()
Out[4]: True

DLPack can not be used to hand off USM allocations created on sub-devices, or bound to non-canonical contexts.

I started wondering along this track based on your above comment about kDLCPU as it does seem strange to have both it and kDLOneAPI_CPU. If the only distinction comes down to whether data was allocated via a usm_allocator, then perhaps a single entry to the DLDeviceType could suffice.

My suggestion to add 3 enums entries was aiming to keep the device type explicit, but since DLPack is not being directly used by users, I am fine with using only 1 enum.

@leofang
Copy link
Collaborator

leofang commented Oct 19, 2021

Interesting, thanks for the detailed explanation @oleksandr-pavlyk. So, IIUC say I can access CPU, Intel GPU, NVIDIA GPU, FPGA in the same process, sycl::get_devices() can assign a unique device ID for each of these devices?

Another question with regard to how it's intended for oneAPI to use DLPack: Do you require manager_ctx to hold a pointer to sycl::context, or the look-up via extension as you suggested is sufficient?

@csullivan
Copy link
Contributor

@oleksandr-pavlyk Thank you for the detailed follow up on this. Given the flexibility to derive the sycl::device from the device id as you've shown, my preference would be to introduce the single DLDeviceType::kDLOneAPI as the initial OneAPI support in DLPack.

@leofang
Copy link
Collaborator

leofang commented Oct 19, 2021

One more question for @oleksandr-pavlyk: Any chance you have verified the statements in #78 (comment)?

I guess ultimately to decide whether we wanna keep 1 enum vs 3 enums we need to know:

  • How does kDLONEAPI_CPU interact with kDLCPU?
  • How does kDLONEAPI_GPU interact with kDLCUDA, kDLROCM and their host/managed counterparts, if any of them is used as the SYCL backend?
  • How does kDLONEAPI_ACCELERATOR interact with kDLOpenCL?

The same question also applies to the unified kDLONEAPI.

Based on my reading it seems to me it's the easiest if we don't consider the interaction between oneAPI and other non-oneAPI-based frameworks (because oneAPI runtime is needed to look up). If it's the case then indeed a single kDLONEAPI would be sufficient. But it's best to consider all potential possibilities before moving forward.

@oleksandr-pavlyk
Copy link
Collaborator Author

The upside of keeping just DLDeviceType::kDLOneAPI is simplicity. The downside is that OneAPI run-time is required to query a device type (is it a CPU device, or a GPU device).

USM allocations made on kDLONEAPI_CPU sycl-device are accessible by any host application, similar to kDLCPU, but also allow for synchronizations if exporter and importer both use oneAPI runtime.

Allocations made on DLDeviceType::kDLOneAPI devices are USM-based, hence they do not interoperate with kDLOpenCL allocations which pass cl_mem objects (akin to sycl::buffer), rather than pointers.

If the oneAPI device has CUDA backend (sycl::device::get_backend() == sycl::backend::cuda), one can use the USM pointer in CUDA RT functions provided that the CUDA context stored in the sycl::context to which the USM allocation is bound, retrievable via sycl::get_native<sycl::backend::cuda>(ctx), is the CUDA device's primary context for this device (the one returned by cuCtxGetCurrent). If not, the importer may need to use cuCtxPushCurrent to make it current.

Copy link
Contributor

@csullivan csullivan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @oleksandr-pavlyk!

The upside of keeping just DLDeviceType::kDLOneAPI is simplicity. The downside is that OneAPI run-time is required to query a device type (is it a CPU device, or a GPU device).

True. If there are notable cases in which this is overly burdensome in the future, we certainly can consider extending the DLDeviceTypes to include others.

Looks good to me, approved sans one nitpick typo below.

Co-authored-by: Chris Sullivan <csullivan@octoml.ai>
Copy link
Collaborator

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @oleksandr-pavlyk. My only nitpick is we need to expand the inline comment to mention using the oneAPI runtime is expected to retrieve the device type. Once it's added I think it's good to go!

@oleksandr-pavlyk
Copy link
Collaborator Author

@leofang I have expanded the comment in two ways: 1. to indicate that DLPack is sharing USM allocation, 2. to note that oneAPI runtime call is required to learn more about the device type as well as the USM allocation type.

Copy link
Collaborator

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @oleksandr-pavlyk!

@leofang
Copy link
Collaborator

leofang commented Oct 25, 2021

If the oneAPI device has CUDA backend (sycl::device::get_backend() == sycl::backend::cuda), one can use the USM pointer in CUDA RT functions provided that the CUDA context stored in the sycl::context to which the USM allocation is bound, retrievable via sycl::get_native<sycl::backend::cuda>(ctx), is the CUDA device's primary context for this device (the one returned by cuCtxGetCurrent). If not, the importer may need to use cuCtxPushCurrent to make it current.

Just for my own curiosity, if we store a sycl::context pointer in the DLManagedTensor.manager_ctx field, would it allow us to bypass oneAPI runtime calls when interfacing with CUDA/HIP?

@oleksandr-pavlyk
Copy link
Collaborator Author

If the oneAPI device has CUDA backend (sycl::device::get_backend() == sycl::backend::cuda), one can use the USM pointer in CUDA RT functions provided that the CUDA context stored in the sycl::context to which the USM allocation is bound, retrievable via sycl::get_native<sycl::backend::cuda>(ctx), is the CUDA device's primary context for this device (the one returned by cuCtxGetCurrent). If not, the importer may need to use cuCtxPushCurrent to make it current.

Just for my own curiosity, if we store a sycl::context pointer in the DLManagedTensor.manager_ctx field, would it allow us to bypass oneAPI runtime calls when interfacing with CUDA/HIP?

I do not think so. Runtime is still needed to retrieve the native object sycl::get_native<sycl::backend::cuda>(sycl_ctx).

@leofang
Copy link
Collaborator

leofang commented Oct 28, 2021

@tqchen I believe we're good to go? 🙂

@tqchen tqchen merged commit 173fe96 into dmlc:main Oct 28, 2021
@tqchen
Copy link
Member

tqchen commented Oct 28, 2021

Thanks everyone this is merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants