[SPIRV] Vulkan SPIR-V correctness: atomic-view aliasing, PSB stride, narrow storage caps, u1 cast, per-init layer recheck by duburcqa · Pull Request #513 · Genesis-Embodied-AI/quadrants

duburcqa · 2026-04-19T11:44:12Z

Vulkan SPIR-V correctness: atomic-view aliasing, PSB pointer stride, narrow-type storage caps, `u1` storage cast, per-init validation-layer re-check

Six small SPIR-V codegen / RHI changes, all silent-corruption or latent validation fixes on Vulkan, stacked so each builds on the last. None regresses any other backend.

TL;DR

Six independent correctness gaps, all with observable silent-corruption failure modes on Vulkan:

1. Plain float loads / stores went through the `u32`-punned view of the buffer instead of the native `f32` view

OpAtomicFAddEXT has always used the native f32 view. Plain OpLoad / OpStore went through the u32 view. Each (buffer, element_type) pair gets its own OpVariable with a fresh DescriptorSet / Binding, so those two views are different SPIR-V variables pointing at the same VkBuffer. Without an Aliased decoration, a driver is free to assume they don't alias, so the plain load is not memory-ordered against the preceding atomic. Reverse-mode AD's load-and-clear pattern (x.grad[i] += delta; tmp = x.grad[i]; x.grad[i] = 0; y.grad += tmp * factor) reads the stale zero, silent gradient drop.

Originally discovered via tests/python/test_ad_dynamic_index.py::test_matrix_non_constant_index[arch=vulkan] asserting 0.0 == 1.0.

2. `OpTypePointer` in `PhysicalStorageBuffer` was missing `ArrayStride`

OpPtrAccessChain on a PSB pointer multiplies Element by ArrayStride to produce the byte offset. We were emitting bare pointers; strict drivers collapsed indexed access back to the base. Symptom: every arr[i] ndarray read returns arr[0].

@qd.kernel
def kern(arr_a: qd.types.ndarray(dtype=qd.i32, ndim=1)):
    for _d in range(1):
        out[0] = arr_a[0]; out[1] = arr_a[1]; out[2] = arr_a[2]
# arr_a=[10,20,30] -> out=[10,10,10] pre-fix, [10,20,30] post-fix.

3. Type-punned views of the same buffer weren't flagged as mutually `Aliased`

Fix 1 closes the aliasing hazard only for the native-atomic-add hot path. CAS-emulation atomic on devices without shaderBufferFloat32AtomicAdd, non-add float atomics (min/max/mul), and f16 / f64-without-native-add each route the atomic through u32 while plain load / store are now on the native float view -- same cross-view aliasing, different pairing. Close all of them with an Aliased decoration on every buffer OpVariable that gets a second type view. Lazy: single-view buffers stay undecorated.

4. Narrow-type `StorageBuffer` access requires `StorageBuffer{8,16}BitAccess` capabilities that weren't emitted

The uint-punning path has always relied on OpLoad / OpStore through u16 / u8 descriptor-bound pointers. Per SPV_KHR_16bit_storage / SPV_KHR_8bit_storage, that needs CapabilityStorageBuffer16BitAccess / CapabilityStorageBuffer8BitAccess in the header. Neither capability nor its extension was emitted, so strict drivers were within their rights to reject every Quadrants shader touching an i8 / i16 / f16 / u8 / u16 field. Also narrow the is_real(dt) predicate in pick_buffer_access_type to an explicit {f16, f32, f64} whitelist so a future bfloat16 / fp8 doesn't silently fall through.

5. Widening a `u1` value to its `i8` storage slot used `OpBitcast`, which is ill-formed on booleans

TaskCodegen::store_buffer widened u1 through OpBitcast %char %bool_val. spirv-val: Expected input to be a pointer or int or float vector or scalar: Bitcast. Mesa RADV (AMD RX 7900 XTX) crashed deep inside libvulkan_radeon.so::create_compute_pipeline with a raw SIGSEGV the moment a u1 field / ndarray / struct member store was registered. Surfaced as timeout: the monitored command dumped core across test_tensor_consistency, test_pickle[vulkan-u1], test_matrixfree_{cg,bicgstab}, test_struct_field_with_bool, test_dual_return_spirv, and test_offload_cross.

Route the u1 -> i8 widening through IRBuilder::cast, which lowers bool -> int to the canonical OpSelect(cond, 1, 0) at the target integer type. That's what the load side already does, and it preserves "u1 serialises as 0 / 1" for to_numpy() / from_numpy().

6. `VulkanDeviceCreator::create_instance` skipped the validation-layer re-check on re-init

The instance-reuse short-circuit (kept around an NVIDIA driver bug with repeated vkDestroyInstance / vkCreateInstance) used to return before check_validation_layer_support() ran, so every re-init after the first one read params_.enable_validation_layer as whatever the caller passed (True for debug=True) instead of the flipped-False reflecting the host's actual layer availability. Downstream, the spirv_has_non_semantic_info cap followed the stale True, the shader emitted NonSemantic.DebugPrintf extinsts with no validation layer loaded to route them, and every capfd-based assertion after the first parametrisation in the same pytest session failed with an empty capture buffer. test_overflow.py::test_shl_overflow[arch=vulkan-ty0-6] passing + test_shl_overflow[arch=vulkan-ty1-7] failing in the same -n 1 session was the fingerprint.

Move the check above the cached-instance return. Re-running vkEnumerateInstanceLayerProperties on every cycle costs microseconds and keeps the flag consistent.

Why

Every fix is a spec-compliance gap, not a driver workaround:

[Build] Upgrade cmake minimum version to be compatible with cmake 4.x #1 is only observable on devices with shaderBufferFloat32AtomicAdd. CI's T4 falls back to CAS emulation, which matched the old u32 plain-load path, so the aliasing gap didn't exist there and the bug has been latent since native-atomic lowering landed on non-T4 drivers.
wipe out existing CI #2 is observable on every Vulkan device with PSB, but only strict drivers (AMD RADV) implement the spec letter. Lenient drivers (NVIDIA) tolerated the missing stride.
[Build] Linux x86 runner #3 would re-surface the silent-zero gradient on the CAS / non-add / narrow-float paths after [Build] Upgrade cmake minimum version to be compatible with cmake 4.x #1, trading one failure class for another. Shipping [Build] Upgrade cmake minimum version to be compatible with cmake 4.x #1 alone wasn't safe.
[Build] Add Windows github runner #4 is cross-cutting: any narrow-typed field access would be rejected at pipeline creation on drivers that enforce the storage-caps.
[Build] Add mac build for Mac OS 14 and 15 #5 crashes the pipeline compiler on any u1 store path. Dozens of existing tests were observed to segfault; the only reason this wasn't caught earlier is that Quadrants' other u1 code paths (loads, casts, atomics) already went through IRBuilder::cast and didn't exercise the broken store widening.
[Build] Add linters #6 is a pytest-session-scoped flake fingerprint: first parametrisation passes, later ones fail. The failure has nothing to do with the kernel under test -- the caps state is corrupted across re-inits. Any debug=True test that expects DebugPrintf output in capfd is at risk.

None of the six regresses any other backend. Metal's MoltenVK path goes through SPIRV-Cross -> MSL, which ignores the aliasing question, the ArrayStride, the narrow-storage caps, and the device-creator-path ordering. AMDGPU / CUDA / CPU don't use the SPIR-V codegen. Fix 5 is a bit-identical refactor on every other type path (only u1 changes); fix 6 runs only inside VulkanDeviceCreator.

Mechanism

Fix 1: `pick_buffer_access_type` routes whitelisted float types through the native view

quadrants/codegen/spirv/spirv_codegen.cpp:

static DataType pick_buffer_access_type(DataType dt, const spirv::Value &ptr_val, spirv::IRBuilder &ir) {
  if (dt->is_primitive(PrimitiveTypeID::u1)) return PrimitiveType::u8;
  if (ptr_val.stype.dt == PrimitiveType::u64) return dt;
  if (dt->is_primitive(PrimitiveTypeID::f16) || dt->is_primitive(PrimitiveTypeID::f32) ||
      dt->is_primitive(PrimitiveTypeID::f64)) return dt;
  return ir.get_quadrants_uint_type(dt);
}

Fix 2: `get_pointer_type` emits `ArrayStride` for PSB scalar / vector pointees

quadrants/codegen/spirv/spirv_ir_builder.cpp::get_pointer_type. When storage_class == PhysicalStorageBuffer and the pointee is a primitive scalar / vector, decorate with ArrayStride = sizeof(pointee). Struct / array pointees already carry full per-member layout, so no double-decoration. Non-PSB storage classes skipped.

Fix 3: `get_buffer_value` decorates multi-view buffer variables with `Aliased`

quadrants/codegen/spirv/spirv_codegen.cpp::get_buffer_value. Track per-BufferInfo list of existing type views. When a second (or later) view is minted, retroactively decorate every peer with Aliased; dedupe via an id-set. Single-view buffers stay undecorated.

Fix 4: narrow-type storage caps gated on Vulkan-queried feature bits

Three-part change:

quadrants/inc/rhi_constants.inc.h: add spirv_has_storage_buffer_{8,16}bit_access to the DeviceCapability enum.
quadrants/rhi/vulkan/vulkan_device_creator.cpp: query VkPhysicalDevice{8,16}BitStorageFeatures::storageBuffer{8,16}BitAccess and set the device caps. Strictly gated on the feature bit (the Vulkan 1.2-core VK_KHR_{8,16}bit_storage promotion doesn't imply the feature is supported).
quadrants/codegen/spirv/spirv_ir_builder.cpp: emit CapabilityStorageBuffer{8,16}BitAccess + the matching SPV_KHR_{8,16}bit_storage extension when the device caps are set.

Fix 5: `store_buffer` routes `u1 -> i8` through `IRBuilder::cast`

quadrants/codegen/spirv/spirv_codegen.cpp::TaskCodegen::store_buffer. Widened the three-way logic:

if (val.stype.dt == ti_buffer_type) {
  val_bits = val;
} else if (val.stype.dt->is_primitive(PrimitiveTypeID::u1)) {
  val_bits = ir_->cast(ir_->get_primitive_type(ti_buffer_type), val);   // -> OpSelect(1, 0)
} else {
  val_bits = ir_->make_value(spv::OpBitcast, ir_->get_primitive_type(ti_buffer_type), val);
}

IRBuilder::cast(int, bool) already emits OpSelect(cond, int_immediate(1), int_immediate(0)) at the target type -- matches the spec-compliant bool -> int lowering, matches what load_buffer does on the reverse path, and keeps the serialisation convention.

Fix 6: `create_instance` runs the layer-availability check before the cached-instance return

quadrants/rhi/vulkan/vulkan_device_creator.cpp::create_instance. Moved:

if (params_.enable_validation_layer && !check_validation_layer_support()) {
  RHI_LOG_ERROR("Validation layers requested but not available, turning off... ...");
  params_.enable_validation_layer = false;
}

from after the cached-VkInstance short-circuit to before it. Dropped the duplicate check that used to live lower in the function.

Per-backend coverage matrix

Backend	#1 (float view)	#2 (ArrayStride)	#3 (Aliased)	#4 (storage caps)	#5 (`u1` cast)	#6 (layer recheck)
CPU	N/A - no SPIR-V	N/A	N/A	N/A	N/A	N/A
CUDA	N/A	N/A	N/A	N/A	N/A	N/A
AMDGPU	N/A	N/A	N/A	N/A	N/A	N/A
Vulkan (native)	Fixes silent-zero gradient on devices with `shaderBufferFloat32AtomicAdd`	Fixes `arr[i] -> arr[0]` collapse on strict-PSB drivers	Fixes the CAS / non-add / narrow-float paths #1 doesn't cover	Fixes pipeline rejection on strict `SPV_KHR_{8,16}bit_storage` drivers	Fixes `u1` store pipeline crash (RADV `SIGSEGV`; spec-invalid `OpBitcast %bool`)	Fixes stale `spirv_has_non_semantic_info` on re-init across a pytest session
Metal / MoltenVK	No-op	No-op	No-op	No-op	No-op - MSL's `bool -> int` cast was already correct	N/A - MoltenVK has its own validation-layer story

Tests

No new tests added in this PR. Every fix is pinned by an existing regression whose assertion already covered the failure mode:

[Build] Upgrade cmake minimum version to be compatible with cmake 4.x #1: tests/python/test_ad_dynamic_index.py::test_matrix_non_constant_index[arch=vulkan].
wipe out existing CI #2: tests/python/test_ndarray.py::test_ndarray_1d[arch=vulkan] + every cross-suite test that does arr[i] indexed access.
[Build] Linux x86 runner #3: covered implicitly by reverse-mode tests on the CI T4 runner (CAS path).
[Build] Add Windows github runner #4: covered by test_ndarray.py's dtype-parametrised suite on narrow primitives.
[Build] Add mac build for Mac OS 14 and 15 #5: the segfaults in test_tensor_consistency, test_pickle[vulkan-u1], test_matrixfree_{cg,bicgstab}, test_struct_field_with_bool, test_dual_return_spirv, test_offload_cross all disappear post-fix.
[Build] Add linters #6: test_overflow.py / test_print.py first-passes-then-fails pattern across a single pytest session.

Measured on an AMD RX 7900 XTX (advertises shaderBufferFloat32AtomicAdd, strict PSB, storageBuffer{8,16}BitAccess), across tests/python/ with test_adstack.py and test_ndarray.py excluded for measurement independence: 506 failing pre-series, 22 failing post-series. On test_adstack.py specifically this takes Vulkan from 12 failing -> fully green; on test_ndarray.py, from 3 failing -> fully green. The remaining 22 are all in test_scan.py -- a pre-series parallel-scan correctness issue unrelated to any of the six gaps here, left for a follow-up PR.

Side-effect audit

Concern	Verdict
Integer ndarray / field access (#1)	Unchanged - integer branch of `pick_buffer_access_type` returns `get_quadrants_uint_type(dt)` as before
PSB pointer path in `at_buffer` (#1)	Unchanged - `ptr_val.stype.dt == u64` branch preserved
Future real-like primitive (bf16 / fp8) (#1)	Explicit `{f16, f32, f64}` whitelist replaces `is_real(dt)` -- unknown reals fall into uint view
Struct / array PSB pointer ArrayStride (#2)	Not decorated - struct / array pointees carry full layout elsewhere
Non-PSB pointer types (#2)	Skipped - Uniform / StorageBuffer / Input / Output / Workgroup don't use PSB arithmetic
Single-view buffer performance (#3)	Undecorated - compiler scheduling freedom preserved
Double-decoration on revisit of a buffer view (#3)	Prevented via `aliased_decorated_buffer_ids_` id-set
Devices without `storageBuffer{8,16}BitAccess` (#4)	SPIR-V cap not emitted - shaders remain acceptable
SPIR-V header without matching extension (#4)	Capability and extension always emitted together
Non-`u1` stores (#5)	Unchanged - they take the else-branch `OpBitcast` path as before
`u1` loads (#5)	Already correctly routed through `IRBuilder::cast` pre-fix
First-cycle validation-layer flip (#6)	Unchanged - the first init sees the same flip as before, just at a different line
Duplicate re-check inside the post-return block (#6)	Removed - would have been dead code now that the check is above
MoltenVK / Metal output	No-op for all six fixes
Offline cache key	SPIR-V blob differs, cache invalidates once on upgrade
LLVM / CUDA / AMDGPU codegen	Files outside `quadrants/codegen/spirv/`, `quadrants/rhi/vulkan/`, `quadrants/inc/` untouched

hughperkins · 2026-04-19T11:48:46Z

Semi-orthogonal, but related: I wonder if we should start considering a CI that runs on a better GPU. If we want CI for Genesis, we will certainly need this. AMD GPU cloud does provide such GPUs I think. (or there is packet.ai, that @v01dXYZ discovered)

hughperkins · 2026-04-19T11:51:10Z

Opus summary:

Summary

Fixes a SPIR-V codegen bug where plain loads/stores of float buffer elements could alias incorrectly with atomic float operations (OpAtomicFAddEXT) on the same memory,
causing reverse-mode autodiff to read stale values on Vulkan.
The fix changes load_buffer / store_buffer to access primitive float types through their native float view of the storage buffer, rather than the uint-punned view.
The view-selection logic is extracted into a new helper pick_buffer_access_type(dt, ptr_val, ir) shared by both functions.

Root cause

In SPIR-V / Vulkan, each (descriptor_set, binding) is a distinct variable. at_buffer creates a new binding per (buffer, element_type) pair, so the u32 view and the
f32 view of the same buffer are different variables aliasing the same memory. Without an Aliased decoration, the driver / SPIRV-Tools is free to assume they do not alias,
so an OpLoad through the u32 view is not ordered against a preceding OpAtomicFAddEXT through the f32 view at the same address.
The reverse-mode pattern

m.grad[i][j, k] += loss.grad
tmp = m.grad[i][j, k]
m.grad[i][j, k] = 0
n.grad += tmp * factor

hits this exactly: the load reads the stale zero initial value, tmp == 0, and the adjoint never propagates.
test_ad_dynamic_index.py::test_matrix_non_constant_index[arch=vulkan] asserts 0.0 == 1.0 as a result.

Behavior matrix

dt	Before	After
`f16` / `f32` / `f64`	uint view (`u32`)	native float view
`i` / `u` (≥ 8-bit)	uint view	uint view (unchanged)
`u1`	`u8` load / `i8` store	`u8` load / `i8` store (unchanged)
64-bit pointer path	`dt` directly	`dt` directly (unchanged)

Good points

Targeted fix for a real, reproducible miscompile (autodiff returning 0 instead of the gradient on Vulkan).
Removes the aliasing question entirely rather than papering over it with Aliased decorations or memory barriers — plain load/store and the atomic now share a single
binding.
Small, contained diff (one file, +30 / −11) that touches only the view-selection logic; the surrounding load/store machinery is unchanged.
Refactor improves readability: the previously duplicated chain of ifs in load_buffer and store_buffer is now a single pick_buffer_access_type helper, making it
obvious that the two paths agree.
Existing carve-outs preserved: u1 still maps to u8/i8, and the u64 pointer path still uses dt directly, so no regressions on those code paths.
Documented: a substantial comment explains the SPIR-V aliasing model, why the bug occurred, and which test reproduces it — useful for future readers and for anyone
tempted to "simplify" the helper.

Bad points / risks

Asymmetry between load and store for u1: load uses u8, store uses i8. This matches the previous behavior, but the helper does not encode it — store_buffer still
has to override the helper's result with a special-case if for u1. A cleaner design would push the load/store distinction into the helper (or at least into a named
constant) so the two sites can't drift.
No new test added. The fix relies on the existing test_ad_dynamic_index.py::test_matrix_non_constant_index[arch=vulkan] to catch regressions; a more direct unit test
of the SPIR-V output (e.g. checking that the load and the atomic resolve to the same binding) would harden against future refactors.
Possible compatibility surface for native float storage views. Switching f16 / f64 loads/stores to native views requires the corresponding SPIR-V capabilities
(StorageBuffer16BitAccess, Float64, etc.) to be requested wherever those types are used. If any code path emits f16/f64 load/store without already requesting these
capabilities, this change could surface a validation error on devices that previously worked via the u32 punning path. Worth confirming the capability-request logic covers
all is_real(dt) cases.
Increased binding count. Because at_buffer allocates a binding per (buffer, element_type) pair, kernels that previously only used the u32 view for floats will now
also allocate the native float binding. This is almost certainly negligible, but on drivers with tight descriptor limits it's a (very small) extra cost.
is_real(dt) is the trigger — if any non-IEEE "real-like" type is added later (e.g. bf16 with no native SPIR-V storage type), the helper will route it down the
native-view path and break. A short assertion or whitelist (f16/f32/f64) would be safer than the open-ended predicate.
Fix is Vulkan/SPIR-V only. If the same aliasing pattern exists in other backends that pun float buffers through integer views, those remain unfixed; the PR doesn't claim
to address them, but it's worth noting for triage.

hughperkins · 2026-04-19T11:54:06Z

Most of hte bad points surfaced by Opus above look like AI-flation to me. But I'm wondering your thoughts on the following observation? :

"Asymmetry between load and store for u1: load uses u8, store uses i8. This matches the previous behavior, but the helper does not encode it — store_buffer still
has to override the helper's result with a special-case if for u1. A cleaner design would push the load/store distinction into the helper (or at least into a named
constant) so the two sites can't drift."

duburcqa · 2026-04-24T10:01:09Z

@claude review

claude

My prior concern about cross-view aliasing on the CAS and non-add float atomic paths is resolved by fix #3 (the Aliased decoration on multi-view buffers in get_buffer_value) — that is the right fix. Given this now bundles 6 independent SPIR-V correctness fixes across codegen and the Vulkan device creator, it warrants a human look.

Extended reasoning...

Overview

The PR has grown from a single fix (route plain float loads/stores through the native float buffer view) to six bundled SPIR-V correctness fixes: (1) native float view for load/store; (2) ArrayStride decoration on PSB scalar/vector pointers; (3) Aliased decoration on every buffer that gets a second type view; (4) StorageBuffer{8,16}BitAccess capabilities plus SPV_KHR_{8,16}bit_storage extensions, gated on queried Vulkan feature bits; (5) route u1→i8 store widening through IRBuilder::cast (OpSelect) instead of spec-invalid OpBitcast %char %bool; (6) move validation-layer availability check above the cached-VkInstance short-circuit in create_instance. Touches spirv_codegen.cpp, spirv_ir_builder.cpp, the TaskCodegen header, rhi_constants.inc.h, and vulkan_device_creator.cpp.

Security risks

None that I can identify. These are correctness fixes in the SPIR-V code generator and the Vulkan device creator; they do not touch auth, crypto, permissions, or user-input boundaries. The narrow-type storage capabilities are only emitted when the underlying Vulkan device feature is queried true, so they cannot be forged into a shader that the driver would reject only at pipeline creation.

Level of scrutiny

High. SPIR-V codegen is production-critical — miscompiles here manifest as silent gradient corruption (fix #1 / #3), wrong ndarray reads (fix #2), pipeline-compiler crashes (fix #5), or spec-invalid shaders that validate only on lenient drivers (fix #4). The aliasing argument is subtle, and the decoration strategy in get_buffer_value relies on the invariant that every at_buffer call funnels through this path. No new tests are added — every fix leans on an existing regression. Worth a human with SPIR-V / Vulkan context walking through the aliasing argument and confirming the decoration site sweeps in every pairing (native-atomic vs CAS-emulation atomic vs plain load, per buffer, per type).

Other factors

My prior flagged bug has been directly addressed by fix #3, which is a meaningful progression. The bug hunting system found no new issues on this revision. The PR is labeled lowpri but the changes are not — they are correctness fixes with concrete reproducers documented in the description, which is exactly the kind of PR that benefits from a human signing off on the mechanism rather than shadow-approval.

hughperkins · 2026-04-24T12:32:51Z

checklist:

bug fixes only => no user doc changes needed
claude review was requested and claude appears happy

=> ok to merge

…with atomics

…AccessChain scales correctly SPV_KHR_physical_storage_buffer requires an explicit `ArrayStride` decoration on `OpTypePointer PhysicalStorageBuffer` when the pointer is used with `OpPtrAccessChain`: the `Element` index is multiplied by that stride to produce the byte offset from the base address. Without the decoration the stride is undefined, and strict drivers collapse every indexed access back to the base - every `arr[i]` read returns `arr[0]` across the whole kernel, and every indexed ndarray write lands on slot 0. Narrow the fix to scalar/vector pointees. Struct and array pointees already carry explicit layout decorations (each member's `Offset`, array `ArrayStride`), so adding a top-level `ArrayStride` on the pointer to those is redundant; for scalars/vectors the natural stride is just the pointee's byte size. Pointers in Uniform / StorageBuffer / Input / Output / Workgroup storage classes don't use PSB arithmetic at all, so the decoration is skipped there. This stacks on the load/store aliasing fix: `pick_buffer_access_type` unblocks reverse-mode atomics on devices with `shaderBufferFloat32AtomicAdd`; the stride decoration unblocks indexed reads/writes through PSB on every Vulkan device. The two together drop our Vulkan test failure count from 506 to 84 across `tests/python/` (excluding the adstack / ndarray suites, which go from 12 failing to fully green on their own). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ss-view accesses are ordered Review feedback on the native-float-view fix: that fix only closes the aliasing hazard when the atomic happens to use the native float view, which is true only for `OpAtomicFAddEXT` on devices that advertise `shaderBufferFloat32AtomicAdd` / `shaderBufferFloat64AtomicAdd` / `shaderBufferFloat16AtomicAdd`. Three other paths still emit the atomic through the u32-punned view while plain load/store go through the native float view: - CAS emulation on devices without the EXT (NVIDIA T4, our CI runner among them): `visit(AtomicOpStmt)` routes f32-add through `at_buffer(dest, get_quadrants_uint_type(dt))` and `atomic_operation` then runs `OpAtomicLoad` / `OpAtomicCompareExchange` on the u32 view. The plain load now on the float view is unordered against that CAS loop. - Non-add float atomics (min / max / mul) on every device: these never go through `OpAtomicFAddEXT`, always take the CAS path, always bind to the u32 view. - f16 / f64 add when `spirv_has_atomic_float16_add` / `spirv_has_atomic_float64_add` is not set: same CAS-on-u32 path. Close all of them structurally with an `Aliased` decoration on every buffer `OpVariable` that gets a second type view. `Aliased` is the SPIR-V signal that accesses through a variable may touch the same memory as accesses through another variable in the same storage class -- the driver must therefore preserve ordering across views, which is exactly what we need for the load-and-clear reverse-mode pattern to read back a freshly-atomic-added gradient. Decorate lazily: single-view buffers stay un-decorated so the compiler can still apply cross-variable scheduling on them. The decoration is applied when (and only when) a second distinct type view is minted, and it covers the newly-minted view plus every pre-existing peer in one sweep (tracked via `buffer_views_by_buffer_` + `aliased_decorated_buffer_ids_` to avoid emitting the decoration twice on the same id). `test_ad_dynamic_index.py::test_matrix_non_constant_index[arch=vulkan]` still passes; the wider Vulkan sweep on `tests/python/` (with `test_adstack.py` / `test_ndarray.py` excluded for independence) stays at 83 failing / 1778 passing, same delta as the native-float-view fix alone on this device. The decoration doesn't change the count on an AMD RX 7900 XTX because that device exposes `shaderBufferFloat32AtomicAdd` and hits the native-float-view path for every failing case in the suite; the decoration's job is to keep the CAS / non-add / f16-f64-no-native-add paths correct on devices that don't. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… routing to an explicit float whitelist Review feedback on the native-float-view fix: - Switching `f16` loads / stores to the native `f16` descriptor view requires the `CapabilityStorageBuffer16BitAccess` SPIR-V capability (paired with the `SPV_KHR_16bit_storage` extension). That cap was never emitted by `spirv_ir_builder.cpp`, so routing `f16` native produced shaders that strict Vulkan drivers reject at pipeline creation -- and even for the pre-existing uint-punning path, every `i16` / `u16` load through a StorageBuffer already required the same capability on spec-strict drivers, so the gap was latent regardless. The same story applies to `i8` / `u8` and `CapabilityStorageBuffer8BitAccess`. - The `is_real(dt)` predicate in `pick_buffer_access_type` is too open-ended: a future real-like primitive (bfloat16, an fp8 variant, anything else `is_real` eventually admits) would silently fall into the native-view branch before its storage-capability story has been audited. Fix both in one commit: 1. Add `spirv_has_storage_buffer_{8,16}bit_access` device caps to `rhi_constants.inc.h`. Query them from the existing `VkPhysicalDevice{8,16}BitStorageFeatures` structs in `vulkan_device_creator.cpp`, strictly gated on `storageBuffer{8,16}BitAccess` feature bits (the Vulkan 1.2-core `VK_KHR_{8,16}bit_storage` promotion does not imply the feature is supported, matching the pattern already used for `bufferDeviceAddress`). 2. Emit `CapabilityStorageBuffer{8,16}BitAccess` + the matching `SPV_KHR_{8,16}bit_storage` extension in the SPIR-V header whenever the device caps are set. Unconditional relative to the current kernel's type use -- the header cost is one extra `OpCapability` + `OpExtension` per bit width, negligible against the benefit of spec-compliant narrow StorageBuffer access for every kernel that declares an `i8` / `i16` / `f16` field or ndarray. 3. Replace the `is_real(dt)` branch in `pick_buffer_access_type` with an explicit `{f16, f32, f64}` whitelist. The three primitive floats that exist today are the ones we've audited the storage-capability story for; anything else must be added here deliberately. All existing Vulkan tests that exercised the uint-punned narrow storage access (`test_ndarray.py` dtype parametrizations on `i8` / `i16` / `f16`, etc.) retain their current behavior because the capability emission is additive and gated on a feature the device already exposes. The native-view routing scope shrinks from "every real type" to "every real type we currently support", which eliminates the silent-regression surface for future types. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… accepts the cast chain `TaskCodegen::store_buffer` widened a `u1` (bool) value to its backing `i8` slot by emitting `OpBitcast %char %bool_val`. That's ill-formed SPIR-V: `OpBitcast` requires a numerical scalar / vector or pointer operand and rejects booleans (`OpTypeBool` has no defined bit pattern, so a bitcast can't be defined either). `spirv-val` flagged the exact message: Expected input to be a pointer or int or float vector or scalar: Bitcast %46 = OpBitcast %char %tmp3_u1 Most drivers just crash in the pipeline compiler on the ill-formed input rather than surface a validation error. On Mesa RADV (AMD RX 7900 XTX) the symptom was a hard `SIGSEGV` deep inside `libvulkan_radeon.so::create_compute_pipeline` the moment any kernel storing to a `u1` field / ndarray / struct member was registered. Python-side, that looked like `timeout: the monitored command dumped core` on every one of the `test_tensor_consistency`, `test_pickle[vulkan-u1]`, `test_matrixfree_{cg,bicgstab}`, `test_struct_field_with_bool`, `test_dual_return_spirv`, and `test_offload_cross` tests. Route the `u1 -> i8` widening through `IRBuilder::cast`, which lowers `bool -> int` to the canonical `OpSelect(cond, 1, 0)` with the 1 / 0 constants produced at the target integer type. That's the same route `load_buffer`'s reverse path already uses on the read side, matches what `IRBuilder::cast` does in every other `u1` context in the codegen, and preserves the "`u1` serialises as 0 / 1" behaviour every `to_numpy()` / `from_numpy()` user depends on. Non-`u1` stores keep the existing `OpBitcast` path unchanged, so no other type path changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…or init, not just the first `VulkanDeviceCreator::create_instance` normalises `params_.enable_validation_layer` against `check_validation_layer_support()` - flipping it to `false` when the host has no `VK_LAYER_KHRONOS_validation` - and then later gates the `spirv_has_non_semantic_info` cap and the logical-device layer array on that flag. The normalisation used to live below the cached-`VkInstance` short-circuit that the same function takes to work around an NVIDIA driver bug around repeated `vkDestroyInstance`/`vkCreateInstance` cycles (the instance is kept alive in the `VulkanLoader` singleton for process lifetime). Every re-init after the first one then read `params_.enable_validation_layer` as `true` (whatever the caller passed in, usually `config.debug`), jumped straight to the cached-instance return, and skipped the flip entirely - even on a host where the first `create_instance` had flipped it to `false`. The visible symptom in `test_overflow.py` / `test_print.py` was the first parametrisation passing (correct cap = 0) and every subsequent one in the same pytest session failing (stale cap = 1). Each test's `@test_utils.test(... debug=True)` wrapper calls `qd.reset()` + `qd.init(...)` which reconstructs the `VulkanDeviceCreator` with fresh params, so the first cycle flipped to `false`, but the second / third / ... cycles preserved `true`. The `spirv_has_non_semantic_info` cap then got set, the `NonSemantic.DebugPrintf` extinst was emitted in the shader, the validation layer wasn't actually loaded to intercept the output, and the `capfd`-based assertion against the `Addition overflow detected` string fails with an empty capture buffer. Move the layer-availability check above the cached-instance return. Re-running `vkEnumerateInstanceLayerProperties` on every cycle is microseconds and keeps the flag consistent whether the instance was freshly created or reused. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude Bot reviewed Apr 19, 2026

View reviewed changes

Comment thread quadrants/codegen/spirv/spirv_codegen.cpp

hughperkins added the lowpri label Apr 20, 2026

duburcqa force-pushed the duburcqa/fix_spirv_float_atomic_aliasing branch from cc6d8ea to 09976fc Compare April 24, 2026 07:27

duburcqa changed the title ~~[SPIRV] Use native float view in load/store_buffer to avoid aliasing with atomics~~ [SPIRV] Two Vulkan SPIR-V correctness fixes: native-float view for atomics + PSB pointer ArrayStride Apr 24, 2026

duburcqa force-pushed the duburcqa/fix_spirv_float_atomic_aliasing branch from ab96033 to 547a822 Compare April 24, 2026 08:30

duburcqa changed the title ~~[SPIRV] Two Vulkan SPIR-V correctness fixes: native-float view for atomics + PSB pointer ArrayStride~~ [SPIRV] Vulkan SPIR-V correctness: atomic-view aliasing, PSB pointer stride, narrow-type storage caps Apr 24, 2026

duburcqa changed the title ~~[SPIRV] Vulkan SPIR-V correctness: atomic-view aliasing, PSB pointer stride, narrow-type storage caps~~ [SPIRV] Vulkan SPIR-V correctness: atomic-view aliasing, PSB stride, narrow storage caps, u1 cast, per-init layer recheck Apr 24, 2026

claude Bot reviewed Apr 24, 2026

View reviewed changes

hughperkins added the ok-to-merge label Apr 24, 2026

duburcqa force-pushed the duburcqa/fix_spirv_float_atomic_aliasing branch from 4876859 to 731c547 Compare April 24, 2026 12:36

duburcqa and others added 6 commits April 24, 2026 15:52

[SPIRV] Use native float view in load/store_buffer to avoid aliasing …

238ee67

…with atomics

duburcqa force-pushed the duburcqa/fix_spirv_float_atomic_aliasing branch from 731c547 to 9fa89c1 Compare April 24, 2026 13:52

duburcqa merged commit 41b5086 into main Apr 24, 2026
48 checks passed

duburcqa deleted the duburcqa/fix_spirv_float_atomic_aliasing branch April 24, 2026 15:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPIRV] Vulkan SPIR-V correctness: atomic-view aliasing, PSB stride, narrow storage caps, u1 cast, per-init layer recheck#513

[SPIRV] Vulkan SPIR-V correctness: atomic-view aliasing, PSB stride, narrow storage caps, u1 cast, per-init layer recheck#513
duburcqa merged 6 commits intomainfrom
duburcqa/fix_spirv_float_atomic_aliasing

duburcqa commented Apr 19, 2026 •

edited

Loading

Uh oh!

hughperkins commented Apr 19, 2026

Uh oh!

hughperkins commented Apr 19, 2026

Uh oh!

hughperkins commented Apr 19, 2026

Uh oh!

Uh oh!

duburcqa commented Apr 24, 2026

Uh oh!

claude Bot left a comment

Uh oh!

hughperkins commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

duburcqa commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Vulkan SPIR-V correctness: atomic-view aliasing, PSB pointer stride, narrow-type storage caps, u1 storage cast, per-init validation-layer re-check

TL;DR

1. Plain float loads / stores went through the u32-punned view of the buffer instead of the native f32 view

2. OpTypePointer in PhysicalStorageBuffer was missing ArrayStride

3. Type-punned views of the same buffer weren't flagged as mutually Aliased

4. Narrow-type StorageBuffer access requires StorageBuffer{8,16}BitAccess capabilities that weren't emitted

5. Widening a u1 value to its i8 storage slot used OpBitcast, which is ill-formed on booleans

6. VulkanDeviceCreator::create_instance skipped the validation-layer re-check on re-init

Why

Mechanism

Fix 1: pick_buffer_access_type routes whitelisted float types through the native view

Fix 2: get_pointer_type emits ArrayStride for PSB scalar / vector pointees

Fix 3: get_buffer_value decorates multi-view buffer variables with Aliased

Fix 4: narrow-type storage caps gated on Vulkan-queried feature bits

Fix 5: store_buffer routes u1 -> i8 through IRBuilder::cast

Fix 6: create_instance runs the layer-availability check before the cached-instance return

Per-backend coverage matrix

Tests

Side-effect audit

Uh oh!

hughperkins commented Apr 19, 2026

Uh oh!

hughperkins commented Apr 19, 2026

Summary

Root cause

Behavior matrix

Good points

Bad points / risks

Uh oh!

hughperkins commented Apr 19, 2026

Uh oh!

Uh oh!

duburcqa commented Apr 24, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

hughperkins commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

duburcqa commented Apr 19, 2026 •

edited

Loading

Vulkan SPIR-V correctness: atomic-view aliasing, PSB pointer stride, narrow-type storage caps, `u1` storage cast, per-init validation-layer re-check

1. Plain float loads / stores went through the `u32`-punned view of the buffer instead of the native `f32` view

2. `OpTypePointer` in `PhysicalStorageBuffer` was missing `ArrayStride`

3. Type-punned views of the same buffer weren't flagged as mutually `Aliased`

4. Narrow-type `StorageBuffer` access requires `StorageBuffer{8,16}BitAccess` capabilities that weren't emitted

5. Widening a `u1` value to its `i8` storage slot used `OpBitcast`, which is ill-formed on booleans

6. `VulkanDeviceCreator::create_instance` skipped the validation-layer re-check on re-init

Fix 1: `pick_buffer_access_type` routes whitelisted float types through the native view

Fix 2: `get_pointer_type` emits `ArrayStride` for PSB scalar / vector pointees

Fix 3: `get_buffer_value` decorates multi-view buffer variables with `Aliased`

Fix 5: `store_buffer` routes `u1 -> i8` through `IRBuilder::cast`

Fix 6: `create_instance` runs the layer-availability check before the cached-instance return