Initial PowerVR GPU support for Vulkan backend#17323
Initial PowerVR GPU support for Vulkan backend#17323abdelaziz-mahdy wants to merge 10 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17323
Note: Links to docs will display an error until the docs builds have been completed.
|
This PR needs a
|
| for (size_t i = 0; i < numel; ++i) { | ||
| dst[i] = static_cast<DST_T>(src[i]); | ||
| } | ||
| vmaFlushAllocation( |
There was a problem hiding this comment.
Great catch. I noticed that there were a few other calls missing as well, so I created a dedicated PR to fix: #17341
Feel free to remove these changes from this PR; also, afaik this doesn't impact the model correctness issue that you are observing.
SS-JIA
left a comment
There was a problem hiding this comment.
Overall, the changes LGTM. Feel free to take it out of draft mode.
There was a problem hiding this comment.
Pull request overview
This PR adds initial PowerVR GPU support to the Vulkan backend, addressing device detection and correctness issues identified on the Pixel 10 Pro. The PR enables FP32 inference to work correctly on PowerVR hardware, though FP16 support has a known +0.5 offset issue that remains unresolved.
Changes:
- Added PowerVR device type detection and convenience methods
- Applied PowerVR-specific workgroup size optimizations for convolution operations
- Forced optimal tiling on PowerVR to avoid linear tiling correctness issues
- Enabled robustBufferAccess feature for PowerVR devices
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| backends/vulkan/runtime/vk_api/Device.h | Added POWERVR to DeviceType enum |
| backends/vulkan/runtime/vk_api/Device.cpp | Added PowerVR string detection in PhysicalDevice constructor |
| backends/vulkan/runtime/graph/ComputeGraph.h | Added device_is_powervr() convenience method |
| backends/vulkan/runtime/graph/ops/impl/Convolution.cpp | Added PowerVR-specific workgroup sizes (32 instead of 64) for convolution dispatch |
| backends/vulkan/runtime/api/Context.cpp | Forced optimal tiling on PowerVR devices |
| backends/vulkan/runtime/vk_api/Adapter.cpp | Enabled robustBufferAccess on PowerVR for well-defined out-of-bounds behavior |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| MALI, | ||
| ADRENO, | ||
| SWIFTSHADER, | ||
| POWERVR, |
There was a problem hiding this comment.
The PR description mentions StagingBuffer.h changes (vmaFlushAllocation additions), but these changes are not present in the PR diff. Upon inspection, these changes already exist in the codebase (lines 77, 91, 125 of StagingBuffer.h). Please update the PR description to clarify that the StagingBuffer correctness fixes are already present in the codebase and are not part of this PR's changes.
SS-JIA
left a comment
There was a problem hiding this comment.
Overall LGTM, just a very small request.
Add PowerVR GPU type detection to the Vulkan backend device enumeration, PowerVR-specific workgroup size tuning for convolution operators, and correctness fixes for PowerVR's TBDR architecture. Changes: - Add POWERVR to DeviceType enum with string detection - Add device_is_powervr() convenience method on ComputeGraph - Add PowerVR-specific workgroup sizes (32 instead of 64) for convolution dispatch to match PowerVR execution unit configuration - Force optimal tiling on PowerVR (linear tiling may produce incorrect results in compute shaders on TBDR architecture) - Enable robustBufferAccess on PowerVR for well-defined OOB behavior Tested on Pixel 10 Pro (PowerVR D-Series DXT-48-1536 MC1): - FP32 convolution passes all tests - Non-conv FP16 ops (add, multiply) pass correctly - FP16 conv has known bias texture initialization issue (pytorch#17299) Related: pytorch#17299
set_staging_zeros() and cast_and_copy_from() write to staging buffers without flushing, unlike copy_from() which correctly calls vmaFlushAllocation(). On GPUs where VMA staging memory is not host-coherent (e.g. PowerVR), CPU writes stay in cache and the GPU reads garbage, causing incorrect inference results. This fixes FP16 convolution producing wrong outputs on PowerVR GPUs where the implicit zero-bias texture reads uninitialized memory.
Remove PowerVR-specific diagnostic cerr logging and unused iostream include that were used during development.
This reverts commit 9509064.
Add PowerVR GPU type detection to the Vulkan backend device enumeration, PowerVR-specific workgroup size tuning for convolution operators, and correctness fixes for PowerVR's TBDR architecture. Changes: - Add POWERVR to DeviceType enum with string detection - Add device_is_powervr() convenience method on ComputeGraph - Add PowerVR-specific workgroup sizes (32 instead of 64) for convolution dispatch to match PowerVR execution unit configuration - Force optimal tiling on PowerVR (linear tiling may produce incorrect results in compute shaders on TBDR architecture) - Enable robustBufferAccess on PowerVR for well-defined OOB behavior Tested on Pixel 10 Pro (PowerVR D-Series DXT-48-1536 MC1): - FP32 convolution passes all tests - Non-conv FP16 ops (add, multiply) pass correctly - FP16 conv has known bias texture initialization issue (pytorch#17299) Related: pytorch#17299
set_staging_zeros() and cast_and_copy_from() write to staging buffers without flushing, unlike copy_from() which correctly calls vmaFlushAllocation(). On GPUs where VMA staging memory is not host-coherent (e.g. PowerVR), CPU writes stay in cache and the GPU reads garbage, causing incorrect inference results. This fixes FP16 convolution producing wrong outputs on PowerVR GPUs where the implicit zero-bias texture reads uninitialized memory.
Remove PowerVR-specific diagnostic cerr logging and unused iostream include that were used during development.
This reverts commit 9509064.
The local_wg_size variable was computed but never used since DynamicDispatchNode uses the conv2d_local_wg_size callback which already contains the PowerVR-specific logic.
9e14319 to
bd9b151
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| utils::uvec3 wg_size = create_conv2d_global_wg_size( | ||
| graph, method, out, weight_data, stride_equals_dilation); | ||
|
|
||
| utils::uvec3 local_wg_size; | ||
| if (method == Conv2dMethod::Depthwise || method == Conv2dMethod::Pointwise) { | ||
| wg_size = {wg_size[0] * wg_size[1], wg_size[2], 1}; | ||
| } |
There was a problem hiding this comment.
wg_size is computed (and reshaped for depthwise/pointwise) but is no longer used after removing the local_wg_size block. This leaves dead code and can trigger unused-variable warnings; consider deleting the wg_size calculation entirely here since DynamicDispatchNode uses conv2d_global_wg_size/conv2d_local_wg_size callbacks to determine dispatch sizes.
| // Enable robustBufferAccess on PowerVR devices to ensure well-defined | ||
| // behavior for out-of-bounds buffer/image accesses. Without this, PowerVR | ||
| // drivers may return zeros or undefined values for edge cases in compute | ||
| // shaders. This has a minor performance cost but improves correctness. |
There was a problem hiding this comment.
The comment mentions "out-of-bounds buffer/image accesses", but VkPhysicalDeviceFeatures::robustBufferAccess only applies to buffer descriptor accesses (it doesn't make out-of-bounds storage image operations well-defined). Please tighten the wording so it matches what the enabled feature actually guarantees, to avoid future debugging confusion.
| // Enable robustBufferAccess on PowerVR devices to ensure well-defined | |
| // behavior for out-of-bounds buffer/image accesses. Without this, PowerVR | |
| // drivers may return zeros or undefined values for edge cases in compute | |
| // shaders. This has a minor performance cost but improves correctness. | |
| // Enable robustBufferAccess on PowerVR devices to provide more well-defined | |
| // behavior for out-of-bounds buffer descriptor accesses. Without this, | |
| // PowerVR drivers may return zeros or undefined values for some edge cases | |
| // in compute shaders. This has a minor performance cost but improves correctness. |
| // shaders. This has a minor performance cost but improves correctness. | ||
| VkPhysicalDeviceFeatures enabled_features{}; | ||
| if (physical_device.device_type == DeviceType::POWERVR) { | ||
| enabled_features.robustBufferAccess = VK_TRUE; |
There was a problem hiding this comment.
robustBufferAccess is enabled for PowerVR without checking whether the physical device reports support for it. If an implementation were to report robustBufferAccess = VK_FALSE, this would cause vkCreateDevice to fail on PowerVR only. Consider querying vkGetPhysicalDeviceFeatures (or storing it in PhysicalDevice) and only requesting the feature when it is supported (or explicitly erroring with a clear message).
| enabled_features.robustBufferAccess = VK_TRUE; | |
| VkPhysicalDeviceFeatures supported_features{}; | |
| vkGetPhysicalDeviceFeatures(physical_device.handle, &supported_features); | |
| if (supported_features.robustBufferAccess == VK_TRUE) { | |
| enabled_features.robustBufferAccess = VK_TRUE; | |
| } |
- Remove unused wg_size variable left behind after removing inline workgroup size calculation (DynamicDispatchNode uses callbacks) - Fix robustBufferAccess comment to accurately describe buffer-only scope - Query device feature support before enabling robustBufferAccess
Summary
Initial work to add PowerVR GPU support to the Vulkan backend. This PR adds device detection, workgroup tuning, and correctness fixes needed for PowerVR's TBDR architecture.
FP32 inference works correctly with these changes. FP16 has a known issue (see below).
Motivation
The Vulkan backend currently has no awareness of PowerVR GPUs. Testing on Pixel 10 Pro (PowerVR D-Series DXT-48-1536 MC1) showed that without explicit PowerVR handling, inference produces incorrect results. This PR adds the foundational support needed to run models on PowerVR hardware.
Changes
Device Detection (
vk_api/Device.h,vk_api/Device.cpp)POWERVRtoDeviceTypeenumPhysicalDeviceconstructorCompute Graph (
graph/ComputeGraph.h)device_is_powervr()convenience methodConvolution Optimization (
graph/ops/impl/Convolution.cpp)conv2d_local_wg_sizecallbackTiling Fix (
api/Context.cpp)Buffer Access (
vk_api/Adapter.cpp)robustBufferAccesson PowerVR for well-defined out-of-bounds behaviorTest Results (Pixel 10 Pro)
I tested 13 minimal single-operator models to isolate behavior:
Key observations:
MobileNet progressive slices all produce NaN, likely cascading from the +0.5 offset through batch normalization.
Known Issues
FP16 +0.5 Offset (Unresolved)
All FP16 convolution and linear operations on PowerVR show a constant +0.5 added to the output. This is deterministic and affects all conv variants. The root cause has not been identified yet.
What I've ruled out:
The issue likely lives somewhere in the FP16 prepack or texture path, but I haven't pinpointed it yet. Help from the Vulkan team would be appreciated.
Test Plan
Related Issues
cc @SS-JIA @manuelcandales @digantdesai @cbilgin