[ET-VK] Serialize prepack dispatches on PowerVR GPUs by abdelaziz-mahdy · Pull Request #17468 · pytorch/executorch

abdelaziz-mahdy · 2026-02-15T03:26:17Z

Summary

Fixes all Vulkan backend failures on PowerVR GPUs (Pixel 10 Pro) by serializing prepack compute shader dispatches.

PowerVR corrupts prepacked constant data when multiple prepack compute dispatches are batched in a single command buffer. Only the first constant is correct; subsequent constants read as zero. This caused MobileNet to produce NaN (via division-by-zero in Hardswish decomposition) and FP16 convolution to show a +0.5 bias offset.

The fix submits and waits after each prepack node on PowerVR, ensuring each constant is fully consumed by the GPU before the next staging buffer is created.

Changes

ComputeGraph.cpp — Add serialize_prepack flag for PowerVR that submits and waits after each prepack node in the prepack() loop

Test Results (Pixel 10 Pro, PowerVR D-Series DXT-48-1536 MC1)

Test	Status
12 multi-op chain models (2+ prepacked constants)	All PASS
6 Hardswish isolation models	All PASS
9 single scalar-constant models	All PASS
MobileNet V3 Small FP32 (cat.jpg)	PASS - class 281 (37.09%), exact match with XNNPACK
MobileNet V3 Small FP16 (cat.jpg)	PASS - class 281 (37.19%), correct within FP16 precision

Trade-off

Serializing prepack dispatches increases model load time on PowerVR since each constant requires a separate command buffer submission. For MobileNet V3 Small, load time increases from ~50ms to ~200ms. This only affects model loading (one-time cost), not inference latency.

Builds on Initial PowerVR GPU support for Vulkan backend #17323 (PowerVR device detection, workgroup tuning)
Supersedes [ET-VK] Use direct buffer-to-image copy for 1D tensor prepacking #17467 (1D bias direct copy workaround - same root cause)
Fixes Vulkan backend produces all-zero outputs on PowerVR GPU (Pixel 10 Pro) #17299 (Vulkan backend all-zero outputs on PowerVR)

Test Plan

12 multi-op chain models pass on Pixel 10 Pro (PowerVR)
6 Hardswish isolation models pass on Pixel 10 Pro
MobileNet V3 Small FP32 and FP16 produce correct classification on Pixel 10 Pro
Verify no regression on Adreno/Mali (serialize_prepack is guarded by device_is_powervr())

cc @SS-JIA @manuelcandales @digantdesai @cbilgin

Add PowerVR GPU type detection to the Vulkan backend device enumeration, PowerVR-specific workgroup size tuning for convolution operators, and correctness fixes for PowerVR's TBDR architecture. Changes: - Add POWERVR to DeviceType enum with string detection - Add device_is_powervr() convenience method on ComputeGraph - Add PowerVR-specific workgroup sizes (32 instead of 64) for convolution dispatch to match PowerVR execution unit configuration - Force optimal tiling on PowerVR (linear tiling may produce incorrect results in compute shaders on TBDR architecture) - Enable robustBufferAccess on PowerVR for well-defined OOB behavior Tested on Pixel 10 Pro (PowerVR D-Series DXT-48-1536 MC1): - FP32 convolution passes all tests - Non-conv FP16 ops (add, multiply) pass correctly - FP16 conv has known bias texture initialization issue (pytorch#17299) Related: pytorch#17299

set_staging_zeros() and cast_and_copy_from() write to staging buffers without flushing, unlike copy_from() which correctly calls vmaFlushAllocation(). On GPUs where VMA staging memory is not host-coherent (e.g. PowerVR), CPU writes stay in cache and the GPU reads garbage, causing incorrect inference results. This fixes FP16 convolution producing wrong outputs on PowerVR GPUs where the implicit zero-bias texture reads uninitialized memory.

Remove PowerVR-specific diagnostic cerr logging and unused iostream include that were used during development.

This reverts commit 9509064.

Add PowerVR GPU type detection to the Vulkan backend device enumeration, PowerVR-specific workgroup size tuning for convolution operators, and correctness fixes for PowerVR's TBDR architecture. Changes: - Add POWERVR to DeviceType enum with string detection - Add device_is_powervr() convenience method on ComputeGraph - Add PowerVR-specific workgroup sizes (32 instead of 64) for convolution dispatch to match PowerVR execution unit configuration - Force optimal tiling on PowerVR (linear tiling may produce incorrect results in compute shaders on TBDR architecture) - Enable robustBufferAccess on PowerVR for well-defined OOB behavior Tested on Pixel 10 Pro (PowerVR D-Series DXT-48-1536 MC1): - FP32 convolution passes all tests - Non-conv FP16 ops (add, multiply) pass correctly - FP16 conv has known bias texture initialization issue (pytorch#17299) Related: pytorch#17299

set_staging_zeros() and cast_and_copy_from() write to staging buffers without flushing, unlike copy_from() which correctly calls vmaFlushAllocation(). On GPUs where VMA staging memory is not host-coherent (e.g. PowerVR), CPU writes stay in cache and the GPU reads garbage, causing incorrect inference results. This fixes FP16 convolution producing wrong outputs on PowerVR GPUs where the implicit zero-bias texture reads uninitialized memory.

Remove PowerVR-specific diagnostic cerr logging and unused iostream include that were used during development.

This reverts commit 9509064.

The local_wg_size variable was computed but never used since DynamicDispatchNode uses the conv2d_local_wg_size callback which already contains the PowerVR-specific logic.

- Remove unused wg_size variable left behind after removing inline workgroup size calculation (DynamicDispatchNode uses callbacks) - Fix robustBufferAccess comment to accurately describe buffer-only scope - Query device feature support before enabling robustBufferAccess

PowerVR corrupts prepacked constant data when multiple prepack compute dispatches are batched in a single command buffer. Only the first constant is correct; subsequent constants read as zero. This caused MobileNet to produce NaN (division-by-zero in Hardswish decomposition) and FP16 convolution to show a +0.5 bias offset. Submit and wait after each prepack node on PowerVR to ensure each constant is fully consumed before the next staging buffer is created.

pytorch-bot · 2026-02-15T03:26:21Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17468

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 13 New Failures, 1 Unrelated Failure

As of commit dc11701 with merge base 429925d ():

NEW FAILURES - The following jobs have failed:

Copilot code review / Cleanup artifacts (gh)
Process completed with exit code 1.
pull / android / run-emulator (gh)
The process '/usr/bin/sh' failed with exit code 1
pull / test-samsung-models-linux / linux-job (gh)
RuntimeError: Command docker exec -t d802ec71873727d50b4d1abe7d7d9d1b7ec445c888cfb9bc535e603e770d7dcf /exec failed with exit code 1
pull / test-samsung-quantmodels-linux / linux-job (gh)
RuntimeError: Command docker exec -t 7c0a98bb8f1e73b6d87690153dbfa5e535a4f60101a0ce3d6632323bb4274b3d /exec failed with exit code 1
pull / test-voxtral-realtime-xnnpack-linux / linux-job (gh)
RuntimeError: Command docker exec -t c74ef0804f6512b7b1f59c0e01b6733297eed0b9e5d4f848d7479e0d2e734b48 /exec failed with exit code 1
pull / test-vulkan-operators-linux / linux-job (gh)
RuntimeError: Command docker exec -t 4479edf9343688890d02dfaf1fe6f793522e007a21082e7ba86ac8d8d624dedc /exec failed with exit code 127
pull / unittest / linux / linux-job (gh)
/tmp/pip-build-env-1i2rlius/overlay/lib/python3.10/site-packages/pybind11/include/pybind11/pybind11.h:2462:25: error: static assertion failed: def_property family does not currently support keep_alive. Use a py::cpp_function instead.
pull / unittest-arm-backend-with-no-deps (test_pytest_models_tosa) / linux-job (gh)
/tmp/pip-build-env-buh7fe58/overlay/lib/python3.10/site-packages/pybind11/include/pybind11/pybind11.h:2462:25: error: static assertion failed: def_property family does not currently support keep_alive. Use a py::cpp_function instead.
pull / unittest-arm-backend-with-no-deps (test_pytest_ops_no_target) / linux-job (gh)
/tmp/pip-build-env-bsx198y4/overlay/lib/python3.10/site-packages/pybind11/include/pybind11/pybind11.h:2462:25: error: static assertion failed: def_property family does not currently support keep_alive. Use a py::cpp_function instead.
pull / unittest-arm-backend-with-no-deps (test_pytest_ops_tosa) / linux-job (gh)
/tmp/pip-build-env-zzs3twh5/overlay/lib/python3.10/site-packages/pybind11/include/pybind11/pybind11.h:2462:25: error: static assertion failed: def_property family does not currently support keep_alive. Use a py::cpp_function instead.
pull / unittest-arm-backend-with-no-deps (test_run_tosa) / linux-job (gh)
/tmp/pip-build-env-trth9ym9/overlay/lib/python3.10/site-packages/pybind11/include/pybind11/pybind11.h:2462:25: error: static assertion failed: def_property family does not currently support keep_alive. Use a py::cpp_function instead.
pull / unittest-editable / linux / linux-job (gh)
/tmp/pip-build-env-4mztwyvr/overlay/lib/python3.10/site-packages/pybind11/include/pybind11/pybind11.h:2462:25: error: static assertion failed: def_property family does not currently support keep_alive. Use a py::cpp_function instead.
pull / unittest-nxp-neutron / linux-job (gh)
RuntimeError: Command docker exec -t 20320bd8fe907bbb030e316e44b956dd4bf1de833838dc5327a1aca81329f95d /exec failed with exit code 1

FLAKY - The following job failed but was likely due to flakiness present on trunk:

Test CUDA Builds / test-models-cuda (resnet18) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copilot

Pull request overview

This PR fixes all Vulkan backend failures on PowerVR GPUs (Pixel 10 Pro) by implementing a workaround for a PowerVR driver bug where multiple prepack compute dispatches batched in a single command buffer cause data corruption. The fix serializes prepack dispatches by submitting and waiting after each prepack node, ensuring constants are fully processed before the next staging buffer is created.

Changes:

Add PowerVR device type detection and helper methods
Serialize prepack dispatches on PowerVR to work around driver data corruption
Implement PowerVR-specific workgroup sizes (32 instead of 64) for better hardware compatibility
Force optimal tiling on PowerVR and enable robustBufferAccess feature

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
Device.h	Adds POWERVR to DeviceType enum for device identification
Device.cpp	Implements PowerVR device name detection (case-insensitive string matching)
Adapter.cpp	Enables robustBufferAccess feature on PowerVR for well-defined out-of-bounds behavior
Convolution.cpp	Adds PowerVR-specific workgroup sizes (32 vs 64) and removes duplicated inline workgroup computation logic
ComputeGraph.h	Adds device_is_powervr() helper method for device-specific logic
ComputeGraph.cpp	Implements core fix: serializes prepack dispatches on PowerVR by submitting and waiting after each prepack node
Context.cpp	Forces optimal tiling on PowerVR to avoid linear tiling compute shader issues

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

github-actions · 2026-02-15T03:31:55Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

abdelaziz-mahdy · 2026-02-15T04:18:55Z

Load Time Benchmark (Pixel 10 Pro — PowerVR DXT-48-1536)

I ran benchmarks comparing Vulkan load times with and without the serialized prepack fix:

With serialize_prepack (this PR):

Model	XNNPACK	Vulkan	Ratio
YOLO11n	11ms	8,890ms	808x
YOLOv8n	14ms	6,504ms	465x
MobileNet V3 Small	5ms	8,230ms	1646x

Without serialize_prepack (batched, produces wrong results on PowerVR):

Model	XNNPACK	Vulkan	Ratio
YOLO11n	11ms	171ms	16x
YOLOv8n	14ms	161ms	12x
MobileNet V3 Small	5ms	88ms	18x

The serialization itself accounts for a 40–94x increase in load time. Without it, Vulkan loads are ~90–170ms (acceptable). With it, they jump to 6–9 seconds because each prepack node requires a separate command buffer submit + GPU wait.

Would batching dispatches in small groups (e.g., 4–8 per command buffer) instead of fully serializing to 1-per-submit be worth investigating? That could find a middle ground between correctness on PowerVR and load time.

Note: In real app usage, Vulkan inference is noticeably faster than XNNPACK on this device, so the load time is a one-time cost that pays off during repeated inference.

SS-JIA · 2026-02-19T23:52:46Z

@abdelaziz-mahdy thanks for this fix! The load time increases are quite substantial, unfortunately. Are there any insights as to why PowerVR GPU requires this to produce correct outputs?

abdelaziz-mahdy · 2026-02-20T04:12:23Z

@SS-JIA , Sadly I dont know if there is another workaround to fix it, if you have any idea let me know, below a summary of my testing

Overview

Extensive testing on Pixel 10 Pro (powervr d-series dxt-48-1536 mc1) confirms this is a broader PowerVR driver bug, not limited to push constants.

This matches a known issue on Imagination's developer forum where push constants get corrupted when updated multiple times within the same command buffer.

Test Results

I tested 11 different synchronization strategies between prepack dispatches, using MobileNet V3 Small (Vulkan output vs XNNPACK reference):

Mode	Strategy	Load Time	Max Diff	NaN	Top-1	Result
0	No barrier (baseline)	~100ms	N/A	1000	NO	FAIL
1	Execution barrier	~100ms	N/A	1000	NO	FAIL
2	Memory barrier (compute->compute)	~100ms	N/A	1000	NO	FAIL
3	Submit + wait per node	~8200ms	0.50	0	YES	PASS
4	Submit, no wait (new CB)	~7800ms	0.50	0	YES	PASS
5	Submit, no wait + flush	~8200ms	0.50	0	YES	PASS
6	Batch every 2 nodes	~4080ms	50.12	0	NO	FAIL
7	Batch every 4 nodes	~2100ms	4.27	0	NO	FAIL
8	Batch every 8 nodes	~1100ms	4.27	0	NO	FAIL
9	Batch every 16 nodes	~600ms	4.27	0	NO	FAIL
10	Hybrid: UBO + serialize PC-only	~7600ms	4.27	0	NO	FAIL

Findings

Barriers don't help (modes 0-2) - Neither execution-only nor full memory barriers fix it, ruling out a synchronization issue.
Only new command buffers work (modes 3-5) - Submitting and starting a fresh CB after each node is the only fix. Mode 4 shows I don't even need vkQueueWaitIdle, just a CB boundary - pointing to internal CB state corruption.
Even 2 nodes per CB corrupts (mode 6) - No middle ground. Strictly 1 dispatch per CB required.
Not limited to push constants (mode 10) - I replaced push constants with UBOs for standard/bias prepack shaders (using the existing no_pc shader variants + sizes_ubo()). UBO-only nodes still corrupt when batched. The bug affects all dispatch types sharing a command buffer.

Impact

~12 line fix, PowerVR-only, zero impact on other GPUs
Load time: ~100ms to ~8200ms (one-time at model load, inference unaffected)
Mode 4 (submit without CPU wait) could reduce overhead while still fixing the bug

abdelaziz-mahdy added 11 commits February 11, 2026 21:43

Remove debug logging from Convolution.cpp

a75bec1

Remove PowerVR-specific diagnostic cerr logging and unused iostream include that were used during development.

Revert "Fix missing vmaFlushAllocation in StagingBuffer"

c653c2e

This reverts commit 9509064.

Remove debug logging from Convolution.cpp

aebe7fd

Remove PowerVR-specific diagnostic cerr logging and unused iostream include that were used during development.

Revert "Fix missing vmaFlushAllocation in StagingBuffer"

ed41c7a

This reverts commit 9509064.

Remove dead local_wg_size code from add_conv2d_node

bd9b151

The local_wg_size variable was computed but never used since DynamicDispatchNode uses the conv2d_local_wg_size callback which already contains the PowerVR-specific logic.

abdelaziz-mahdy requested a review from SS-JIA as a code owner February 15, 2026 03:26

Copilot AI review requested due to automatic review settings February 15, 2026 03:26

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 15, 2026

Copilot started reviewing on behalf of abdelaziz-mahdy February 15, 2026 03:26 View session

This was referenced Feb 15, 2026

Vulkan backend produces all-zero outputs on PowerVR GPU (Pixel 10 Pro) #17299

Open

[ET-VK] Use direct buffer-to-image copy for 1D tensor prepacking #17467

Closed

Copilot AI reviewed Feb 15, 2026

View reviewed changes

nil-is-all added the module: vulkan Issues related to the Vulkan delegate and code under backends/vulkan/ label Feb 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ET-VK] Serialize prepack dispatches on PowerVR GPUs#17468

[ET-VK] Serialize prepack dispatches on PowerVR GPUs#17468
abdelaziz-mahdy wants to merge 11 commits intopytorch:mainfrom
abdelaziz-mahdy:powervr-serialize-prepack

abdelaziz-mahdy commented Feb 15, 2026 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Feb 15, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

github-actions bot commented Feb 15, 2026

Uh oh!

abdelaziz-mahdy commented Feb 15, 2026 •

edited

Loading

Uh oh!

SS-JIA commented Feb 19, 2026

Uh oh!

abdelaziz-mahdy commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

abdelaziz-mahdy commented Feb 15, 2026 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test Results (Pixel 10 Pro, PowerVR D-Series DXT-48-1536 MC1)

Trade-off

Related

Test Plan

Uh oh!

pytorch-bot bot commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17468

❌ 13 New Failures, 1 Unrelated Failure

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

github-actions bot commented Feb 15, 2026

This PR needs a release notes: label

Uh oh!

abdelaziz-mahdy commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Load Time Benchmark (Pixel 10 Pro — PowerVR DXT-48-1536)

Uh oh!

SS-JIA commented Feb 19, 2026

Uh oh!

abdelaziz-mahdy commented Feb 20, 2026

Overview

Test Results

Findings

Impact

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

abdelaziz-mahdy commented Feb 15, 2026 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Feb 15, 2026 •

edited

Loading

This PR needs a `release notes:` label

abdelaziz-mahdy commented Feb 15, 2026 •

edited

Loading