Skip to content

[ET-VK] Serialize prepack dispatches on PowerVR GPUs#17468

Open
abdelaziz-mahdy wants to merge 11 commits intopytorch:mainfrom
abdelaziz-mahdy:powervr-serialize-prepack
Open

[ET-VK] Serialize prepack dispatches on PowerVR GPUs#17468
abdelaziz-mahdy wants to merge 11 commits intopytorch:mainfrom
abdelaziz-mahdy:powervr-serialize-prepack

Conversation

@abdelaziz-mahdy
Copy link
Contributor

@abdelaziz-mahdy abdelaziz-mahdy commented Feb 15, 2026

Summary

Fixes all Vulkan backend failures on PowerVR GPUs (Pixel 10 Pro) by serializing prepack compute shader dispatches.

PowerVR corrupts prepacked constant data when multiple prepack compute dispatches are batched in a single command buffer. Only the first constant is correct; subsequent constants read as zero. This caused MobileNet to produce NaN (via division-by-zero in Hardswish decomposition) and FP16 convolution to show a +0.5 bias offset.

The fix submits and waits after each prepack node on PowerVR, ensuring each constant is fully consumed by the GPU before the next staging buffer is created.

Changes

  • ComputeGraph.cpp — Add serialize_prepack flag for PowerVR that submits and waits after each prepack node in the prepack() loop

Test Results (Pixel 10 Pro, PowerVR D-Series DXT-48-1536 MC1)

Test Status
12 multi-op chain models (2+ prepacked constants) All PASS
6 Hardswish isolation models All PASS
9 single scalar-constant models All PASS
MobileNet V3 Small FP32 (cat.jpg) PASS - class 281 (37.09%), exact match with XNNPACK
MobileNet V3 Small FP16 (cat.jpg) PASS - class 281 (37.19%), correct within FP16 precision

Trade-off

Serializing prepack dispatches increases model load time on PowerVR since each constant requires a separate command buffer submission. For MobileNet V3 Small, load time increases from ~50ms to ~200ms. This only affects model loading (one-time cost), not inference latency.

Related

Test Plan

  • 12 multi-op chain models pass on Pixel 10 Pro (PowerVR)
  • 6 Hardswish isolation models pass on Pixel 10 Pro
  • MobileNet V3 Small FP32 and FP16 produce correct classification on Pixel 10 Pro
  • Verify no regression on Adreno/Mali (serialize_prepack is guarded by device_is_powervr())

cc @SS-JIA @manuelcandales @digantdesai @cbilgin

Add PowerVR GPU type detection to the Vulkan backend device enumeration,
PowerVR-specific workgroup size tuning for convolution operators, and
correctness fixes for PowerVR's TBDR architecture.

Changes:
- Add POWERVR to DeviceType enum with string detection
- Add device_is_powervr() convenience method on ComputeGraph
- Add PowerVR-specific workgroup sizes (32 instead of 64) for
  convolution dispatch to match PowerVR execution unit configuration
- Force optimal tiling on PowerVR (linear tiling may produce
  incorrect results in compute shaders on TBDR architecture)
- Enable robustBufferAccess on PowerVR for well-defined OOB behavior

Tested on Pixel 10 Pro (PowerVR D-Series DXT-48-1536 MC1):
- FP32 convolution passes all tests
- Non-conv FP16 ops (add, multiply) pass correctly
- FP16 conv has known bias texture initialization issue (pytorch#17299)

Related: pytorch#17299
set_staging_zeros() and cast_and_copy_from() write to staging buffers
without flushing, unlike copy_from() which correctly calls
vmaFlushAllocation(). On GPUs where VMA staging memory is not
host-coherent (e.g. PowerVR), CPU writes stay in cache and the GPU
reads garbage, causing incorrect inference results.

This fixes FP16 convolution producing wrong outputs on PowerVR GPUs
where the implicit zero-bias texture reads uninitialized memory.
Remove PowerVR-specific diagnostic cerr logging and unused iostream
include that were used during development.
Add PowerVR GPU type detection to the Vulkan backend device enumeration,
PowerVR-specific workgroup size tuning for convolution operators, and
correctness fixes for PowerVR's TBDR architecture.

Changes:
- Add POWERVR to DeviceType enum with string detection
- Add device_is_powervr() convenience method on ComputeGraph
- Add PowerVR-specific workgroup sizes (32 instead of 64) for
  convolution dispatch to match PowerVR execution unit configuration
- Force optimal tiling on PowerVR (linear tiling may produce
  incorrect results in compute shaders on TBDR architecture)
- Enable robustBufferAccess on PowerVR for well-defined OOB behavior

Tested on Pixel 10 Pro (PowerVR D-Series DXT-48-1536 MC1):
- FP32 convolution passes all tests
- Non-conv FP16 ops (add, multiply) pass correctly
- FP16 conv has known bias texture initialization issue (pytorch#17299)

Related: pytorch#17299
set_staging_zeros() and cast_and_copy_from() write to staging buffers
without flushing, unlike copy_from() which correctly calls
vmaFlushAllocation(). On GPUs where VMA staging memory is not
host-coherent (e.g. PowerVR), CPU writes stay in cache and the GPU
reads garbage, causing incorrect inference results.

This fixes FP16 convolution producing wrong outputs on PowerVR GPUs
where the implicit zero-bias texture reads uninitialized memory.
Remove PowerVR-specific diagnostic cerr logging and unused iostream
include that were used during development.
The local_wg_size variable was computed but never used since
DynamicDispatchNode uses the conv2d_local_wg_size callback
which already contains the PowerVR-specific logic.
- Remove unused wg_size variable left behind after removing inline
  workgroup size calculation (DynamicDispatchNode uses callbacks)
- Fix robustBufferAccess comment to accurately describe buffer-only scope
- Query device feature support before enabling robustBufferAccess
PowerVR corrupts prepacked constant data when multiple prepack compute
dispatches are batched in a single command buffer. Only the first
constant is correct; subsequent constants read as zero.

This caused MobileNet to produce NaN (division-by-zero in Hardswish
decomposition) and FP16 convolution to show a +0.5 bias offset.

Submit and wait after each prepack node on PowerVR to ensure each
constant is fully consumed before the next staging buffer is created.
Copilot AI review requested due to automatic review settings February 15, 2026 03:26
@pytorch-bot
Copy link

pytorch-bot bot commented Feb 15, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17468

Note: Links to docs will display an error until the docs builds have been completed.

❌ 13 New Failures, 1 Unrelated Failure

As of commit dc11701 with merge base 429925d (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 15, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes all Vulkan backend failures on PowerVR GPUs (Pixel 10 Pro) by implementing a workaround for a PowerVR driver bug where multiple prepack compute dispatches batched in a single command buffer cause data corruption. The fix serializes prepack dispatches by submitting and waiting after each prepack node, ensuring constants are fully processed before the next staging buffer is created.

Changes:

  • Add PowerVR device type detection and helper methods
  • Serialize prepack dispatches on PowerVR to work around driver data corruption
  • Implement PowerVR-specific workgroup sizes (32 instead of 64) for better hardware compatibility
  • Force optimal tiling on PowerVR and enable robustBufferAccess feature

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
Device.h Adds POWERVR to DeviceType enum for device identification
Device.cpp Implements PowerVR device name detection (case-insensitive string matching)
Adapter.cpp Enables robustBufferAccess feature on PowerVR for well-defined out-of-bounds behavior
Convolution.cpp Adds PowerVR-specific workgroup sizes (32 vs 64) and removes duplicated inline workgroup computation logic
ComputeGraph.h Adds device_is_powervr() helper method for device-specific logic
ComputeGraph.cpp Implements core fix: serializes prepack dispatches on PowerVR by submitting and waiting after each prepack node
Context.cpp Forces optimal tiling on PowerVR to avoid linear tiling compute shader issues

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@abdelaziz-mahdy
Copy link
Contributor Author

abdelaziz-mahdy commented Feb 15, 2026

Load Time Benchmark (Pixel 10 Pro — PowerVR DXT-48-1536)

I ran benchmarks comparing Vulkan load times with and without the serialized prepack fix:

With serialize_prepack (this PR):

Model XNNPACK Vulkan Ratio
YOLO11n 11ms 8,890ms 808x
YOLOv8n 14ms 6,504ms 465x
MobileNet V3 Small 5ms 8,230ms 1646x

Without serialize_prepack (batched, produces wrong results on PowerVR):

Model XNNPACK Vulkan Ratio
YOLO11n 11ms 171ms 16x
YOLOv8n 14ms 161ms 12x
MobileNet V3 Small 5ms 88ms 18x

The serialization itself accounts for a 40–94x increase in load time. Without it, Vulkan loads are ~90–170ms (acceptable). With it, they jump to 6–9 seconds because each prepack node requires a separate command buffer submit + GPU wait.

Would batching dispatches in small groups (e.g., 4–8 per command buffer) instead of fully serializing to 1-per-submit be worth investigating? That could find a middle ground between correctness on PowerVR and load time.

Note: In real app usage, Vulkan inference is noticeably faster than XNNPACK on this device, so the load time is a one-time cost that pays off during repeated inference.

@nil-is-all nil-is-all added the module: vulkan Issues related to the Vulkan delegate and code under backends/vulkan/ label Feb 17, 2026
@SS-JIA
Copy link
Contributor

SS-JIA commented Feb 19, 2026

@abdelaziz-mahdy thanks for this fix! The load time increases are quite substantial, unfortunately. Are there any insights as to why PowerVR GPU requires this to produce correct outputs?

@abdelaziz-mahdy
Copy link
Contributor Author

@SS-JIA , Sadly I dont know if there is another workaround to fix it, if you have any idea let me know, below a summary of my testing

Overview

Extensive testing on Pixel 10 Pro (powervr d-series dxt-48-1536 mc1) confirms this is a broader PowerVR driver bug, not limited to push constants.

This matches a known issue on Imagination's developer forum where push constants get corrupted when updated multiple times within the same command buffer.

Test Results

I tested 11 different synchronization strategies between prepack dispatches, using MobileNet V3 Small (Vulkan output vs XNNPACK reference):

Mode Strategy Load Time Max Diff NaN Top-1 Result
0 No barrier (baseline) ~100ms N/A 1000 NO FAIL
1 Execution barrier ~100ms N/A 1000 NO FAIL
2 Memory barrier (compute->compute) ~100ms N/A 1000 NO FAIL
3 Submit + wait per node ~8200ms 0.50 0 YES PASS
4 Submit, no wait (new CB) ~7800ms 0.50 0 YES PASS
5 Submit, no wait + flush ~8200ms 0.50 0 YES PASS
6 Batch every 2 nodes ~4080ms 50.12 0 NO FAIL
7 Batch every 4 nodes ~2100ms 4.27 0 NO FAIL
8 Batch every 8 nodes ~1100ms 4.27 0 NO FAIL
9 Batch every 16 nodes ~600ms 4.27 0 NO FAIL
10 Hybrid: UBO + serialize PC-only ~7600ms 4.27 0 NO FAIL

Findings

  1. Barriers don't help (modes 0-2) - Neither execution-only nor full memory barriers fix it, ruling out a synchronization issue.

  2. Only new command buffers work (modes 3-5) - Submitting and starting a fresh CB after each node is the only fix. Mode 4 shows I don't even need vkQueueWaitIdle, just a CB boundary - pointing to internal CB state corruption.

  3. Even 2 nodes per CB corrupts (mode 6) - No middle ground. Strictly 1 dispatch per CB required.

  4. Not limited to push constants (mode 10) - I replaced push constants with UBOs for standard/bias prepack shaders (using the existing no_pc shader variants + sizes_ubo()). UBO-only nodes still corrupt when batched. The bug affects all dispatch types sharing a command buffer.

Impact

  • ~12 line fix, PowerVR-only, zero impact on other GPUs
  • Load time: ~100ms to ~8200ms (one-time at model load, inference unaffected)
  • Mode 4 (submit without CPU wait) could reduce overhead while still fixing the bug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: vulkan Issues related to the Vulkan delegate and code under backends/vulkan/

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Vulkan backend produces all-zero outputs on PowerVR GPU (Pixel 10 Pro)

4 participants