vulkan: improve partial offloading performance on AMD#19976
Conversation
|
Would this help with --cpu-moe (weight offloading to cpu) ? |
|
It should, yes. Any partial offload. |
|
I checked out the pull request and indeed it was faster too with ---n-cpu-moe (on an 890M Integrated GPU). |
|
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) disabling intel gpu - no crash |
|
@characharm Thanks for testing it, there was a check missing. Can you try again? |
Yeah, looks like it's not crashing anymore. |
transfer queue is supported since mesa 26. but hidden under a flag
it is also supported on intel. but only avaliable when using xe drm driver (rather than i915) and gen > 12.5 |
I know, I've tested it. It's slightly faster than using the graphics queue. The majority of the improvement comes from any second queue.
That's good to know, thank you! I'll give it a try with Xe. |
|
I get mixed results with Xe, slightly improved pp, slightly worse tg. I'll leave it off there, for now. |
So do we have to upgrade to Mesa 26 or is the RADV_PERFTEST flag is also available on older versions? |
|
It's available on older versions as well. I'm not sure why it was reported again for 26.0 |
* vulkan: fix and enable cpy_tensor_async function * use transfer_queue for async transfers on AMD, synchronize with timeline semaphore * update offload_op logic * fix missing transfer submission * disable async transfer queue on AMD GCN * revert op batch size change * fix cpy_tensor_async checks
* vulkan: fix and enable cpy_tensor_async function * use transfer_queue for async transfers on AMD, synchronize with timeline semaphore * update offload_op logic * fix missing transfer submission * disable async transfer queue on AMD GCN * revert op batch size change * fix cpy_tensor_async checks
|
This merged PR seems to be causing issues on the vulkan backend regarding gibberish output with -fa on on Intel GPUs. |
im also just getting gibberish on my setup with this. qwen3.5-35b on 2 radeon pro v340l's. reverting this fixed it, havent done any other debugging |
|
Please see if #20233 helped. |
seems like that fixed it! |
* vulkan: fix and enable cpy_tensor_async function * use transfer_queue for async transfers on AMD, synchronize with timeline semaphore * update offload_op logic * fix missing transfer submission * disable async transfer queue on AMD GCN * revert op batch size change * fix cpy_tensor_async checks
* vulkan: fix and enable cpy_tensor_async function * use transfer_queue for async transfers on AMD, synchronize with timeline semaphore * update offload_op logic * fix missing transfer submission * disable async transfer queue on AMD GCN * revert op batch size change * fix cpy_tensor_async checks
* vulkan: fix and enable cpy_tensor_async function * use transfer_queue for async transfers on AMD, synchronize with timeline semaphore * update offload_op logic * fix missing transfer submission * disable async transfer queue on AMD GCN * revert op batch size change * fix cpy_tensor_async checks
I saw a big difference between Vulkan and ROCm performance in partial offloads. I narrowed it down to transfer speeds for weight transfer from CPU to GPU with offloaded ops. One possible explanation is that using the dedicated transfer queue on AMD may be faster than using a compute queue, so I implemented using a transfer queue for async transfers as well and synchronizing transfers using a timeline semaphore. This does improve performance.
Then I checked and found that the dedicated transfer queue on AMD is not exposed by the Linux driver by default, so it's not actually being used. The difference comes from using a second queue (the graphics queue) for transfers, so I assume the issue was the compute queue being congested with other work.
This helps on AMD RDNA4, but not on GCN and not on Nvidia. I couldn't test Intel because the Linux driver only exposes a single queue.
I also changed the Vulkan backend offload_op function to work like the CUDA backend's, to cover some more cases. And I moved some redundant context code into a function to reduce code duplication.