vulkan: Fix data races in coopmat1 mul_mat(_id)#20084
Conversation
Add barriers between coopmat store and regular loads. We sort of got away with this because it was the same subgroup accessing the values, but it's still a race and may not work.
|
I do see small performance regressions from this change on AMD. Considering even without barriers it seems to work, since the accesses are limited within subgroups, we don't need full barriers here. A perfect place for |
|
subgroupMemoryBarrierShared wouldn't be sufficient, but I'm surprised this shows up as a real perf hit since this is just in the epilogue of the shader. |
|
Why isn't it enough? It's about shared memory that's only used by the subgroup. |
|
subgroupMemoryBarrierShared is only an OpMemoryBarrier. Synchronization requires a release on the storing thread, an acquire on the loading thread, and an "edge" between those that can be provided either by a control barrier or an atomic store and an atomic load that sees the stored value (this is in the "Synchronizes-With" section of the memory model appendix). When using a control barrier, the release and acquire can be performed by the same control barrier (similarly, the release and acquire can optionally be folded into the corresponding atomics). A subgroup control barrier ought to be relatively cheap compared to a workgroup control barrier. |
|
Yeah, then let's use the minimal |
|
OK, changed to subgroup control barriers. |
|
This seems to be causing |
|
Damn, I missed that. But I don't see these failures locally, with llvmpipe. That's odd. |
|
Do you have a recent version of llvmpipe that supports coopmat1? I think we should just revert this for now. I wonder if llvmpipe has some bug in its handling of this unusual barrier instruction. |
|
Ah yeah, on Arch I can trigger it with a newer llvmpipe version. But I don't think llvmpipe coopmat makes much sense anyways, so we can also just disable coopmat in the CI. |
* vulkan: Fix data races in coopmat1 mul_mat(_id) Add barriers between coopmat store and regular loads. We sort of got away with this because it was the same subgroup accessing the values, but it's still a race and may not work. * switch to subgroup control barriers
* vulkan: Fix data races in coopmat1 mul_mat(_id) Add barriers between coopmat store and regular loads. We sort of got away with this because it was the same subgroup accessing the values, but it's still a race and may not work. * switch to subgroup control barriers
* vulkan: Fix data races in coopmat1 mul_mat(_id) Add barriers between coopmat store and regular loads. We sort of got away with this because it was the same subgroup accessing the values, but it's still a race and may not work. * switch to subgroup control barriers
* vulkan: Fix data races in coopmat1 mul_mat(_id) Add barriers between coopmat store and regular loads. We sort of got away with this because it was the same subgroup accessing the values, but it's still a race and may not work. * switch to subgroup control barriers
Add barriers between coopmat store and regular loads. We sort of got away with this because it was the same subgroup accessing the values, but it's still a race and may not work.
I added shared memory data race detection for coopmat1 (KhronosGroup/Vulkan-ValidationLayers#11780) and this fixes the issues it found. No performance regressions on my system.