Skip to content

[ET-VK] Use direct buffer-to-image copy for 1D tensor prepacking#17467

Closed
abdelaziz-mahdy wants to merge 1 commit intopytorch:mainfrom
abdelaziz-mahdy:fix/prepack-1d-direct-copy
Closed

[ET-VK] Use direct buffer-to-image copy for 1D tensor prepacking#17467
abdelaziz-mahdy wants to merge 1 commit intopytorch:mainfrom
abdelaziz-mahdy:fix/prepack-1d-direct-copy

Conversation

@abdelaziz-mahdy
Copy link
Contributor

Summary

Fixes 1D tensor prepacking (e.g., conv2d bias) producing incorrect results on GPUs with strict texture bounds checking (PowerVR).

The nchw_to_image compute shader uses axis_map coordinate remapping designed for multi-dimensional tensors. For 1D tensors padded to 4D with three fake dimensions of size 1, the coordinate math produces out-of-bounds texture writes. PowerVR GPUs silently drop these writes, so bias values never reach the texture. Other GPUs (Adreno, Mali) wrap coordinates, masking the bug.

For 1D width-packed tensors where the element count is divisible by 4, the staging buffer data is already in the correct layout. This change bypasses the compute shader and uses vkCmdCopyBufferToImage to copy the data directly.

Changes:

  • PrepackNode::encode() — detect 1D width-packed tensors and use direct copy path
  • CommandBuffer::copy_buffer_to_image() — new helper for buffer-to-image DMA with post-copy barrier
  • Allocator::create_staging_buffer() — add TRANSFER_SRC to staging buffer usage flags (harmless for existing paths, enables direct copy)

Related: #17299

Test plan

  • Tested on Pixel 10 Pro (PowerVR DXT-48-1536) — conv2d bias values now reach the texture correctly
  • Verified no regression on other GPUs (Adreno, Mali) since the coordinate wrapping behavior is unchanged for multi-dimensional tensors

The nchw_to_image compute shader uses axis_map coordinate remapping
designed for multi-dimensional tensors. For 1D tensors (e.g., conv2d
bias) padded to 4D with three dimensions of size 1, the coordinate
math produces out-of-bounds texture writes. Some GPU drivers (PowerVR)
strictly validate texture bounds and silently drop these writes,
causing bias values to never reach the texture. Other drivers wrap
coordinates, masking the bug.

For 1D width-packed tensors where the element count is divisible by 4,
the staging buffer data is already in the correct layout for the
target texture. This change bypasses the compute shader and uses
vkCmdCopyBufferToImage to transfer the data directly.

Also adds TRANSFER_SRC to staging buffer usage flags (harmless for
existing paths) and a copy_buffer_to_image helper to CommandBuffer.
@pytorch-bot
Copy link

pytorch-bot bot commented Feb 14, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17467

Note: Links to docs will display an error until the docs builds have been completed.

❌ 8 Awaiting Approval, 1 New Failure

As of commit 6974d32 with merge base 429925d (image):

AWAITING APPROVAL - The following workflows need approval before CI can run:

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 14, 2026
@github-actions
Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@abdelaziz-mahdy abdelaziz-mahdy marked this pull request as ready for review February 14, 2026 23:15
Copilot AI review requested due to automatic review settings February 14, 2026 23:15
@abdelaziz-mahdy abdelaziz-mahdy marked this pull request as draft February 14, 2026 23:15
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical bug where 1D tensor prepacking (e.g., conv2d bias) produces incorrect results on PowerVR GPUs due to strict texture bounds checking. The nchw_to_image compute shader's coordinate remapping logic produces out-of-bounds texture writes for 1D tensors padded to 4D with fake dimensions, which PowerVR silently drops. The fix bypasses the buggy shader for 1D width-packed tensors by using a direct vkCmdCopyBufferToImage DMA operation instead.

Changes:

  • Adds direct buffer-to-image copy path for 1D width-packed tensors with element counts divisible by 4
  • Implements new copy_buffer_to_image() helper with proper layout transitions and post-copy barriers
  • Updates staging buffer usage flags to include TRANSFER_SRC for DMA operations

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
backends/vulkan/runtime/vk_api/memory/Allocator.cpp Adds VK_BUFFER_USAGE_TRANSFER_SRC_BIT to staging buffer usage flags to enable direct copy operations
backends/vulkan/runtime/vk_api/Command.h Declares new copy_buffer_to_image() method for buffer-to-image DMA with layout transitions
backends/vulkan/runtime/vk_api/Command.cpp Implements copy_buffer_to_image() with proper synchronization via pre/post-copy barriers
backends/vulkan/runtime/graph/ops/PrepackNode.cpp Detects 1D width-packed tensors and uses direct copy path instead of compute shader remapping

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@abdelaziz-mahdy
Copy link
Contributor Author

Closing - Superseded by Serialize Prepack Fix

I'm closing this PR. The 1D bias prepacking issue on PowerVR turned out to be a symptom of a broader problem, not a coordinate math issue.

What I Found

The actual root cause is that PowerVR corrupts prepacked constant data when multiple prepack compute dispatches are batched in a single Vulkan command buffer. Only the first constant in the batch is correct; subsequent constants read as zero.

For conv2d bias, this meant the bias tensor (typically the second prepacked tensor after weights) read as zero, producing the incorrect output I attributed to out-of-bounds texture writes.

The Real Fix

A change in ComputeGraph::prepack() that serializes prepack dispatches on PowerVR. This fixes bias prepacking, weight prepacking, scalar constant prepacking, and every other multi-constant model — without touching PrepackNode, CommandBuffer, or Allocator.

I tested MobileNet V3 Small on Pixel 10 Pro with only this fix (all changes from this PR reverted): FP32 and FP16 both produce correct classification results, matching XNNPACK output.

PR #17468 has the fix. Details in #17299.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants