[ET-VK] Serialize prepack dispatches on PowerVR GPUs#17468
[ET-VK] Serialize prepack dispatches on PowerVR GPUs#17468abdelaziz-mahdy wants to merge 11 commits intopytorch:mainfrom
Conversation
Add PowerVR GPU type detection to the Vulkan backend device enumeration, PowerVR-specific workgroup size tuning for convolution operators, and correctness fixes for PowerVR's TBDR architecture. Changes: - Add POWERVR to DeviceType enum with string detection - Add device_is_powervr() convenience method on ComputeGraph - Add PowerVR-specific workgroup sizes (32 instead of 64) for convolution dispatch to match PowerVR execution unit configuration - Force optimal tiling on PowerVR (linear tiling may produce incorrect results in compute shaders on TBDR architecture) - Enable robustBufferAccess on PowerVR for well-defined OOB behavior Tested on Pixel 10 Pro (PowerVR D-Series DXT-48-1536 MC1): - FP32 convolution passes all tests - Non-conv FP16 ops (add, multiply) pass correctly - FP16 conv has known bias texture initialization issue (pytorch#17299) Related: pytorch#17299
set_staging_zeros() and cast_and_copy_from() write to staging buffers without flushing, unlike copy_from() which correctly calls vmaFlushAllocation(). On GPUs where VMA staging memory is not host-coherent (e.g. PowerVR), CPU writes stay in cache and the GPU reads garbage, causing incorrect inference results. This fixes FP16 convolution producing wrong outputs on PowerVR GPUs where the implicit zero-bias texture reads uninitialized memory.
Remove PowerVR-specific diagnostic cerr logging and unused iostream include that were used during development.
This reverts commit 9509064.
Add PowerVR GPU type detection to the Vulkan backend device enumeration, PowerVR-specific workgroup size tuning for convolution operators, and correctness fixes for PowerVR's TBDR architecture. Changes: - Add POWERVR to DeviceType enum with string detection - Add device_is_powervr() convenience method on ComputeGraph - Add PowerVR-specific workgroup sizes (32 instead of 64) for convolution dispatch to match PowerVR execution unit configuration - Force optimal tiling on PowerVR (linear tiling may produce incorrect results in compute shaders on TBDR architecture) - Enable robustBufferAccess on PowerVR for well-defined OOB behavior Tested on Pixel 10 Pro (PowerVR D-Series DXT-48-1536 MC1): - FP32 convolution passes all tests - Non-conv FP16 ops (add, multiply) pass correctly - FP16 conv has known bias texture initialization issue (pytorch#17299) Related: pytorch#17299
set_staging_zeros() and cast_and_copy_from() write to staging buffers without flushing, unlike copy_from() which correctly calls vmaFlushAllocation(). On GPUs where VMA staging memory is not host-coherent (e.g. PowerVR), CPU writes stay in cache and the GPU reads garbage, causing incorrect inference results. This fixes FP16 convolution producing wrong outputs on PowerVR GPUs where the implicit zero-bias texture reads uninitialized memory.
Remove PowerVR-specific diagnostic cerr logging and unused iostream include that were used during development.
This reverts commit 9509064.
The local_wg_size variable was computed but never used since DynamicDispatchNode uses the conv2d_local_wg_size callback which already contains the PowerVR-specific logic.
- Remove unused wg_size variable left behind after removing inline workgroup size calculation (DynamicDispatchNode uses callbacks) - Fix robustBufferAccess comment to accurately describe buffer-only scope - Query device feature support before enabling robustBufferAccess
PowerVR corrupts prepacked constant data when multiple prepack compute dispatches are batched in a single command buffer. Only the first constant is correct; subsequent constants read as zero. This caused MobileNet to produce NaN (division-by-zero in Hardswish decomposition) and FP16 convolution to show a +0.5 bias offset. Submit and wait after each prepack node on PowerVR to ensure each constant is fully consumed before the next staging buffer is created.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17468
Note: Links to docs will display an error until the docs builds have been completed. ❌ 13 New Failures, 1 Unrelated FailureAs of commit dc11701 with merge base 429925d ( NEW FAILURES - The following jobs have failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Pull request overview
This PR fixes all Vulkan backend failures on PowerVR GPUs (Pixel 10 Pro) by implementing a workaround for a PowerVR driver bug where multiple prepack compute dispatches batched in a single command buffer cause data corruption. The fix serializes prepack dispatches by submitting and waiting after each prepack node, ensuring constants are fully processed before the next staging buffer is created.
Changes:
- Add PowerVR device type detection and helper methods
- Serialize prepack dispatches on PowerVR to work around driver data corruption
- Implement PowerVR-specific workgroup sizes (32 instead of 64) for better hardware compatibility
- Force optimal tiling on PowerVR and enable robustBufferAccess feature
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| Device.h | Adds POWERVR to DeviceType enum for device identification |
| Device.cpp | Implements PowerVR device name detection (case-insensitive string matching) |
| Adapter.cpp | Enables robustBufferAccess feature on PowerVR for well-defined out-of-bounds behavior |
| Convolution.cpp | Adds PowerVR-specific workgroup sizes (32 vs 64) and removes duplicated inline workgroup computation logic |
| ComputeGraph.h | Adds device_is_powervr() helper method for device-specific logic |
| ComputeGraph.cpp | Implements core fix: serializes prepack dispatches on PowerVR by submitting and waiting after each prepack node |
| Context.cpp | Forces optimal tiling on PowerVR to avoid linear tiling compute shader issues |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This PR needs a
|
Load Time Benchmark (Pixel 10 Pro — PowerVR DXT-48-1536)I ran benchmarks comparing Vulkan load times with and without the serialized prepack fix: With serialize_prepack (this PR):
Without serialize_prepack (batched, produces wrong results on PowerVR):
The serialization itself accounts for a 40–94x increase in load time. Without it, Vulkan loads are ~90–170ms (acceptable). With it, they jump to 6–9 seconds because each prepack node requires a separate command buffer submit + GPU wait. Would batching dispatches in small groups (e.g., 4–8 per command buffer) instead of fully serializing to 1-per-submit be worth investigating? That could find a middle ground between correctness on PowerVR and load time. Note: In real app usage, Vulkan inference is noticeably faster than XNNPACK on this device, so the load time is a one-time cost that pays off during repeated inference. |
|
@abdelaziz-mahdy thanks for this fix! The load time increases are quite substantial, unfortunately. Are there any insights as to why PowerVR GPU requires this to produce correct outputs? |
|
@SS-JIA , Sadly I dont know if there is another workaround to fix it, if you have any idea let me know, below a summary of my testing OverviewExtensive testing on Pixel 10 Pro ( This matches a known issue on Imagination's developer forum where push constants get corrupted when updated multiple times within the same command buffer. Test ResultsI tested 11 different synchronization strategies between prepack dispatches, using MobileNet V3 Small (Vulkan output vs XNNPACK reference):
Findings
Impact
|
Summary
Fixes all Vulkan backend failures on PowerVR GPUs (Pixel 10 Pro) by serializing prepack compute shader dispatches.
PowerVR corrupts prepacked constant data when multiple prepack compute dispatches are batched in a single command buffer. Only the first constant is correct; subsequent constants read as zero. This caused MobileNet to produce NaN (via division-by-zero in Hardswish decomposition) and FP16 convolution to show a +0.5 bias offset.
The fix submits and waits after each prepack node on PowerVR, ensuring each constant is fully consumed by the GPU before the next staging buffer is created.
Changes
ComputeGraph.cpp— Addserialize_prepackflag for PowerVR that submits and waits after each prepack node in theprepack()loopTest Results (Pixel 10 Pro, PowerVR D-Series DXT-48-1536 MC1)
Trade-off
Serializing prepack dispatches increases model load time on PowerVR since each constant requires a separate command buffer submission. For MobileNet V3 Small, load time increases from ~50ms to ~200ms. This only affects model loading (one-time cost), not inference latency.
Related
Test Plan
device_is_powervr())cc @SS-JIA @manuelcandales @digantdesai @cbilgin