Make GPU kernel compilation caching consistent across GPU backends.#5546
Make GPU kernel compilation caching consistent across GPU backends.#5546
Conversation
…all by removing Comdat IR annotations in runtime on Mac OS and iOS.
cache for kernels. Introduces a finalization routine for kernel compilation to indicate when kernals are not strictly required to be defined. Thus allowing them to be unloaded or discarded, but not when they are needed.
Quick fix for syntax error in C codegen. Tab fixes. Makefile fixes.
steven-johnson
left a comment
There was a problem hiding this comment.
LGTM pending green
|
Updated to master just to tickle the buildbots. |
|
https://buildbot.halide-lang.org/master/#/builders/25/builds/15 CMake Error at cmake/AddCudaToTarget.cmake:3 (target_link_libraries): |
Make Metal context creation test API consistent with CUDA and OpenCL by having it return a success/fail indication instead of asserting internally.
the device context when compiling a kernel.
|
Looks like |
|
Looks like we now have only one failure: correctness_gpu_many_kernels for D3D12Computer |
|
At this point I assume we want to bring in the D3D12 experts to assist figure out the last gotcha here? |
|
Going to take a look at the |
|
I think I can give it a shot tomorrow! |
|
Status report: So far, it's still inconclusive...:( Curiously, I can build the project with msbuild from the command-line and run the executable. |
|
Status report: OK, the issue seems to be related with releasing the device, and the next time I'll investigate further. |
|
Ok, I think I found the issue (a very silly one). |
|
As for the new failure case: |
|
correctness_interpreter is clearly a flake of some sort -- it's due to be investigated after SIGGRAPH deadlines pass. It shouldn't block landing this. |
|
Ready to land? |
Fine by me! |
This is a continuation of #5474 .
Move to using common code for kernel compilation caching for CUDA, OpenCL, Metal, and D3D12 GPU runtimes. New caching endeavors to be robust in not using a kernel compiled for one context on another and uses a hash table to avoid small allocations across multiple pages of VM. OpenCL was particularly broken in that code using two contexts was almost guaranteed to fail. This PR also opens the door to allowing better client control of caching, such as setting a size limit or allowing eviction of specific kernels, and is pretty close to allowing runtime overloads of the kernel compilation itself to allow persistent caching across process invocations for GPU APIs that allow this. (The compile_kernel function in multiple files needs to be promoted to a client visible runtime overload for each GPU API.)
Tests are added to cover many kernels and more than one context. A test using multiple contexts across multiple threads both tests things that didn't necessarily work before and provides an example for a common use case.
Two small fixes to CUDA prevent a crash in a very rare error case and make device release work if the CUDA library is linked directly into the app. (The latter would have shown up as a crash due to allocation caching for static linking as the code to release allocations when freeing a context did not run.)
OpenGL and OpenGLCompute were not addressed in this PR due to both time limitations and because there are more significant issues in these runtimes around this area. OpenGL is basically a Superfund site at this point and should be deleted. OpenGLCompute may or may not be worth preserving, though similar work is needed re: how kernels are communicated to the runtime and compiled.
Kernel compilations are now ref counted such that they are marked as held when the initialize kernels call is made for a filter and released via a new finalization call that is made in the destructor section of a filter invocation. This is required to get both object lifetime and multiple context cache releasing to work with per-device cached APIs such as Metal and D3D.
Opportunistic fix to a syntax error in the output of the C++ codegen back end.