[ET-VK] Use direct buffer-to-image copy for 1D tensor prepacking#17467
[ET-VK] Use direct buffer-to-image copy for 1D tensor prepacking#17467abdelaziz-mahdy wants to merge 1 commit intopytorch:mainfrom
Conversation
The nchw_to_image compute shader uses axis_map coordinate remapping designed for multi-dimensional tensors. For 1D tensors (e.g., conv2d bias) padded to 4D with three dimensions of size 1, the coordinate math produces out-of-bounds texture writes. Some GPU drivers (PowerVR) strictly validate texture bounds and silently drop these writes, causing bias values to never reach the texture. Other drivers wrap coordinates, masking the bug. For 1D width-packed tensors where the element count is divisible by 4, the staging buffer data is already in the correct layout for the target texture. This change bypasses the compute shader and uses vkCmdCopyBufferToImage to transfer the data directly. Also adds TRANSFER_SRC to staging buffer usage flags (harmless for existing paths) and a copy_buffer_to_image helper to CommandBuffer.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17467
Note: Links to docs will display an error until the docs builds have been completed. ❌ 8 Awaiting Approval, 1 New FailureAs of commit 6974d32 with merge base 429925d ( AWAITING APPROVAL - The following workflows need approval before CI can run:
NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
There was a problem hiding this comment.
Pull request overview
This PR fixes a critical bug where 1D tensor prepacking (e.g., conv2d bias) produces incorrect results on PowerVR GPUs due to strict texture bounds checking. The nchw_to_image compute shader's coordinate remapping logic produces out-of-bounds texture writes for 1D tensors padded to 4D with fake dimensions, which PowerVR silently drops. The fix bypasses the buggy shader for 1D width-packed tensors by using a direct vkCmdCopyBufferToImage DMA operation instead.
Changes:
- Adds direct buffer-to-image copy path for 1D width-packed tensors with element counts divisible by 4
- Implements new
copy_buffer_to_image()helper with proper layout transitions and post-copy barriers - Updates staging buffer usage flags to include
TRANSFER_SRCfor DMA operations
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
backends/vulkan/runtime/vk_api/memory/Allocator.cpp |
Adds VK_BUFFER_USAGE_TRANSFER_SRC_BIT to staging buffer usage flags to enable direct copy operations |
backends/vulkan/runtime/vk_api/Command.h |
Declares new copy_buffer_to_image() method for buffer-to-image DMA with layout transitions |
backends/vulkan/runtime/vk_api/Command.cpp |
Implements copy_buffer_to_image() with proper synchronization via pre/post-copy barriers |
backends/vulkan/runtime/graph/ops/PrepackNode.cpp |
Detects 1D width-packed tensors and uses direct copy path instead of compute shader remapping |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Closing - Superseded by Serialize Prepack FixI'm closing this PR. The 1D bias prepacking issue on PowerVR turned out to be a symptom of a broader problem, not a coordinate math issue. What I FoundThe actual root cause is that PowerVR corrupts prepacked constant data when multiple prepack compute dispatches are batched in a single Vulkan command buffer. Only the first constant in the batch is correct; subsequent constants read as zero. For conv2d bias, this meant the bias tensor (typically the second prepacked tensor after weights) read as zero, producing the incorrect output I attributed to out-of-bounds texture writes. The Real FixA change in I tested MobileNet V3 Small on Pixel 10 Pro with only this fix (all changes from this PR reverted): FP32 and FP16 both produce correct classification results, matching XNNPACK output. |
Summary
Fixes 1D tensor prepacking (e.g., conv2d bias) producing incorrect results on GPUs with strict texture bounds checking (PowerVR).
The
nchw_to_imagecompute shader usesaxis_mapcoordinate remapping designed for multi-dimensional tensors. For 1D tensors padded to 4D with three fake dimensions of size 1, the coordinate math produces out-of-bounds texture writes. PowerVR GPUs silently drop these writes, so bias values never reach the texture. Other GPUs (Adreno, Mali) wrap coordinates, masking the bug.For 1D width-packed tensors where the element count is divisible by 4, the staging buffer data is already in the correct layout. This change bypasses the compute shader and uses
vkCmdCopyBufferToImageto copy the data directly.Changes:
PrepackNode::encode()— detect 1D width-packed tensors and use direct copy pathCommandBuffer::copy_buffer_to_image()— new helper for buffer-to-image DMA with post-copy barrierAllocator::create_staging_buffer()— addTRANSFER_SRCto staging buffer usage flags (harmless for existing paths, enables direct copy)Related: #17299
Test plan