Skip to content

[SPIRV] Vulkan SPIR-V correctness: atomic-view aliasing, PSB stride, narrow storage caps, u1 cast, per-init layer recheck#513

Merged
duburcqa merged 6 commits intomainfrom
duburcqa/fix_spirv_float_atomic_aliasing
Apr 24, 2026
Merged

[SPIRV] Vulkan SPIR-V correctness: atomic-view aliasing, PSB stride, narrow storage caps, u1 cast, per-init layer recheck#513
duburcqa merged 6 commits intomainfrom
duburcqa/fix_spirv_float_atomic_aliasing

Conversation

@duburcqa
Copy link
Copy Markdown
Contributor

@duburcqa duburcqa commented Apr 19, 2026

Vulkan SPIR-V correctness: atomic-view aliasing, PSB pointer stride, narrow-type storage caps, u1 storage cast, per-init validation-layer re-check

Six small SPIR-V codegen / RHI changes, all silent-corruption or latent validation fixes on Vulkan, stacked so each builds on the last. None regresses any other backend.

TL;DR

Six independent correctness gaps, all with observable silent-corruption failure modes on Vulkan:

1. Plain float loads / stores went through the u32-punned view of the buffer instead of the native f32 view

OpAtomicFAddEXT has always used the native f32 view. Plain OpLoad / OpStore went through the u32 view. Each (buffer, element_type) pair gets its own OpVariable with a fresh DescriptorSet / Binding, so those two views are different SPIR-V variables pointing at the same VkBuffer. Without an Aliased decoration, a driver is free to assume they don't alias, so the plain load is not memory-ordered against the preceding atomic. Reverse-mode AD's load-and-clear pattern (x.grad[i] += delta; tmp = x.grad[i]; x.grad[i] = 0; y.grad += tmp * factor) reads the stale zero, silent gradient drop.

Originally discovered via tests/python/test_ad_dynamic_index.py::test_matrix_non_constant_index[arch=vulkan] asserting 0.0 == 1.0.

2. OpTypePointer in PhysicalStorageBuffer was missing ArrayStride

OpPtrAccessChain on a PSB pointer multiplies Element by ArrayStride to produce the byte offset. We were emitting bare pointers; strict drivers collapsed indexed access back to the base. Symptom: every arr[i] ndarray read returns arr[0].

@qd.kernel
def kern(arr_a: qd.types.ndarray(dtype=qd.i32, ndim=1)):
    for _d in range(1):
        out[0] = arr_a[0]; out[1] = arr_a[1]; out[2] = arr_a[2]
# arr_a=[10,20,30] -> out=[10,10,10] pre-fix, [10,20,30] post-fix.

3. Type-punned views of the same buffer weren't flagged as mutually Aliased

Fix 1 closes the aliasing hazard only for the native-atomic-add hot path. CAS-emulation atomic on devices without shaderBufferFloat32AtomicAdd, non-add float atomics (min/max/mul), and f16 / f64-without-native-add each route the atomic through u32 while plain load / store are now on the native float view -- same cross-view aliasing, different pairing. Close all of them with an Aliased decoration on every buffer OpVariable that gets a second type view. Lazy: single-view buffers stay undecorated.

4. Narrow-type StorageBuffer access requires StorageBuffer{8,16}BitAccess capabilities that weren't emitted

The uint-punning path has always relied on OpLoad / OpStore through u16 / u8 descriptor-bound pointers. Per SPV_KHR_16bit_storage / SPV_KHR_8bit_storage, that needs CapabilityStorageBuffer16BitAccess / CapabilityStorageBuffer8BitAccess in the header. Neither capability nor its extension was emitted, so strict drivers were within their rights to reject every Quadrants shader touching an i8 / i16 / f16 / u8 / u16 field. Also narrow the is_real(dt) predicate in pick_buffer_access_type to an explicit {f16, f32, f64} whitelist so a future bfloat16 / fp8 doesn't silently fall through.

5. Widening a u1 value to its i8 storage slot used OpBitcast, which is ill-formed on booleans

TaskCodegen::store_buffer widened u1 through OpBitcast %char %bool_val. spirv-val: Expected input to be a pointer or int or float vector or scalar: Bitcast. Mesa RADV (AMD RX 7900 XTX) crashed deep inside libvulkan_radeon.so::create_compute_pipeline with a raw SIGSEGV the moment a u1 field / ndarray / struct member store was registered. Surfaced as timeout: the monitored command dumped core across test_tensor_consistency, test_pickle[vulkan-u1], test_matrixfree_{cg,bicgstab}, test_struct_field_with_bool, test_dual_return_spirv, and test_offload_cross.

Route the u1 -> i8 widening through IRBuilder::cast, which lowers bool -> int to the canonical OpSelect(cond, 1, 0) at the target integer type. That's what the load side already does, and it preserves "u1 serialises as 0 / 1" for to_numpy() / from_numpy().

6. VulkanDeviceCreator::create_instance skipped the validation-layer re-check on re-init

The instance-reuse short-circuit (kept around an NVIDIA driver bug with repeated vkDestroyInstance / vkCreateInstance) used to return before check_validation_layer_support() ran, so every re-init after the first one read params_.enable_validation_layer as whatever the caller passed (True for debug=True) instead of the flipped-False reflecting the host's actual layer availability. Downstream, the spirv_has_non_semantic_info cap followed the stale True, the shader emitted NonSemantic.DebugPrintf extinsts with no validation layer loaded to route them, and every capfd-based assertion after the first parametrisation in the same pytest session failed with an empty capture buffer. test_overflow.py::test_shl_overflow[arch=vulkan-ty0-6] passing + test_shl_overflow[arch=vulkan-ty1-7] failing in the same -n 1 session was the fingerprint.

Move the check above the cached-instance return. Re-running vkEnumerateInstanceLayerProperties on every cycle costs microseconds and keeps the flag consistent.

Why

Every fix is a spec-compliance gap, not a driver workaround:

None of the six regresses any other backend. Metal's MoltenVK path goes through SPIRV-Cross -> MSL, which ignores the aliasing question, the ArrayStride, the narrow-storage caps, and the device-creator-path ordering. AMDGPU / CUDA / CPU don't use the SPIR-V codegen. Fix 5 is a bit-identical refactor on every other type path (only u1 changes); fix 6 runs only inside VulkanDeviceCreator.

Mechanism

Fix 1: pick_buffer_access_type routes whitelisted float types through the native view

quadrants/codegen/spirv/spirv_codegen.cpp:

static DataType pick_buffer_access_type(DataType dt, const spirv::Value &ptr_val, spirv::IRBuilder &ir) {
  if (dt->is_primitive(PrimitiveTypeID::u1)) return PrimitiveType::u8;
  if (ptr_val.stype.dt == PrimitiveType::u64) return dt;
  if (dt->is_primitive(PrimitiveTypeID::f16) || dt->is_primitive(PrimitiveTypeID::f32) ||
      dt->is_primitive(PrimitiveTypeID::f64)) return dt;
  return ir.get_quadrants_uint_type(dt);
}

Fix 2: get_pointer_type emits ArrayStride for PSB scalar / vector pointees

quadrants/codegen/spirv/spirv_ir_builder.cpp::get_pointer_type. When storage_class == PhysicalStorageBuffer and the pointee is a primitive scalar / vector, decorate with ArrayStride = sizeof(pointee). Struct / array pointees already carry full per-member layout, so no double-decoration. Non-PSB storage classes skipped.

Fix 3: get_buffer_value decorates multi-view buffer variables with Aliased

quadrants/codegen/spirv/spirv_codegen.cpp::get_buffer_value. Track per-BufferInfo list of existing type views. When a second (or later) view is minted, retroactively decorate every peer with Aliased; dedupe via an id-set. Single-view buffers stay undecorated.

Fix 4: narrow-type storage caps gated on Vulkan-queried feature bits

Three-part change:

  1. quadrants/inc/rhi_constants.inc.h: add spirv_has_storage_buffer_{8,16}bit_access to the DeviceCapability enum.
  2. quadrants/rhi/vulkan/vulkan_device_creator.cpp: query VkPhysicalDevice{8,16}BitStorageFeatures::storageBuffer{8,16}BitAccess and set the device caps. Strictly gated on the feature bit (the Vulkan 1.2-core VK_KHR_{8,16}bit_storage promotion doesn't imply the feature is supported).
  3. quadrants/codegen/spirv/spirv_ir_builder.cpp: emit CapabilityStorageBuffer{8,16}BitAccess + the matching SPV_KHR_{8,16}bit_storage extension when the device caps are set.

Fix 5: store_buffer routes u1 -> i8 through IRBuilder::cast

quadrants/codegen/spirv/spirv_codegen.cpp::TaskCodegen::store_buffer. Widened the three-way logic:

if (val.stype.dt == ti_buffer_type) {
  val_bits = val;
} else if (val.stype.dt->is_primitive(PrimitiveTypeID::u1)) {
  val_bits = ir_->cast(ir_->get_primitive_type(ti_buffer_type), val);   // -> OpSelect(1, 0)
} else {
  val_bits = ir_->make_value(spv::OpBitcast, ir_->get_primitive_type(ti_buffer_type), val);
}

IRBuilder::cast(int, bool) already emits OpSelect(cond, int_immediate(1), int_immediate(0)) at the target type -- matches the spec-compliant bool -> int lowering, matches what load_buffer does on the reverse path, and keeps the serialisation convention.

Fix 6: create_instance runs the layer-availability check before the cached-instance return

quadrants/rhi/vulkan/vulkan_device_creator.cpp::create_instance. Moved:

if (params_.enable_validation_layer && !check_validation_layer_support()) {
  RHI_LOG_ERROR("Validation layers requested but not available, turning off... ...");
  params_.enable_validation_layer = false;
}

from after the cached-VkInstance short-circuit to before it. Dropped the duplicate check that used to live lower in the function.

Per-backend coverage matrix

Backend #1 (float view) #2 (ArrayStride) #3 (Aliased) #4 (storage caps) #5 (u1 cast) #6 (layer recheck)
CPU N/A - no SPIR-V N/A N/A N/A N/A N/A
CUDA N/A N/A N/A N/A N/A N/A
AMDGPU N/A N/A N/A N/A N/A N/A
Vulkan (native) Fixes silent-zero gradient on devices with shaderBufferFloat32AtomicAdd Fixes arr[i] -> arr[0] collapse on strict-PSB drivers Fixes the CAS / non-add / narrow-float paths #1 doesn't cover Fixes pipeline rejection on strict SPV_KHR_{8,16}bit_storage drivers Fixes u1 store pipeline crash (RADV SIGSEGV; spec-invalid OpBitcast %bool) Fixes stale spirv_has_non_semantic_info on re-init across a pytest session
Metal / MoltenVK No-op No-op No-op No-op No-op - MSL's bool -> int cast was already correct N/A - MoltenVK has its own validation-layer story

Tests

No new tests added in this PR. Every fix is pinned by an existing regression whose assertion already covered the failure mode:

Measured on an AMD RX 7900 XTX (advertises shaderBufferFloat32AtomicAdd, strict PSB, storageBuffer{8,16}BitAccess), across tests/python/ with test_adstack.py and test_ndarray.py excluded for measurement independence: 506 failing pre-series, 22 failing post-series. On test_adstack.py specifically this takes Vulkan from 12 failing -> fully green; on test_ndarray.py, from 3 failing -> fully green. The remaining 22 are all in test_scan.py -- a pre-series parallel-scan correctness issue unrelated to any of the six gaps here, left for a follow-up PR.

Side-effect audit

Concern Verdict
Integer ndarray / field access (#1) Unchanged - integer branch of pick_buffer_access_type returns get_quadrants_uint_type(dt) as before
PSB pointer path in at_buffer (#1) Unchanged - ptr_val.stype.dt == u64 branch preserved
Future real-like primitive (bf16 / fp8) (#1) Explicit {f16, f32, f64} whitelist replaces is_real(dt) -- unknown reals fall into uint view
Struct / array PSB pointer ArrayStride (#2) Not decorated - struct / array pointees carry full layout elsewhere
Non-PSB pointer types (#2) Skipped - Uniform / StorageBuffer / Input / Output / Workgroup don't use PSB arithmetic
Single-view buffer performance (#3) Undecorated - compiler scheduling freedom preserved
Double-decoration on revisit of a buffer view (#3) Prevented via aliased_decorated_buffer_ids_ id-set
Devices without storageBuffer{8,16}BitAccess (#4) SPIR-V cap not emitted - shaders remain acceptable
SPIR-V header without matching extension (#4) Capability and extension always emitted together
Non-u1 stores (#5) Unchanged - they take the else-branch OpBitcast path as before
u1 loads (#5) Already correctly routed through IRBuilder::cast pre-fix
First-cycle validation-layer flip (#6) Unchanged - the first init sees the same flip as before, just at a different line
Duplicate re-check inside the post-return block (#6) Removed - would have been dead code now that the check is above
MoltenVK / Metal output No-op for all six fixes
Offline cache key SPIR-V blob differs, cache invalidates once on upgrade
LLVM / CUDA / AMDGPU codegen Files outside quadrants/codegen/spirv/, quadrants/rhi/vulkan/, quadrants/inc/ untouched

@hughperkins
Copy link
Copy Markdown
Collaborator

Semi-orthogonal, but related: I wonder if we should start considering a CI that runs on a better GPU. If we want CI for Genesis, we will certainly need this. AMD GPU cloud does provide such GPUs I think. (or there is packet.ai, that @v01dXYZ discovered)

@hughperkins
Copy link
Copy Markdown
Collaborator

Opus summary:

Summary

Fixes a SPIR-V codegen bug where plain loads/stores of float buffer elements could alias incorrectly with atomic float operations (OpAtomicFAddEXT) on the same memory,
causing reverse-mode autodiff to read stale values on Vulkan.
The fix changes load_buffer / store_buffer to access primitive float types through their native float view of the storage buffer, rather than the uint-punned view.
The view-selection logic is extracted into a new helper pick_buffer_access_type(dt, ptr_val, ir) shared by both functions.

Root cause

In SPIR-V / Vulkan, each (descriptor_set, binding) is a distinct variable. at_buffer creates a new binding per (buffer, element_type) pair, so the u32 view and the
f32 view of the same buffer are different variables aliasing the same memory. Without an Aliased decoration, the driver / SPIRV-Tools is free to assume they do not alias,
so an OpLoad through the u32 view is not ordered against a preceding OpAtomicFAddEXT through the f32 view at the same address.
The reverse-mode pattern

m.grad[i][j, k] += loss.grad
tmp = m.grad[i][j, k]
m.grad[i][j, k] = 0
n.grad += tmp * factor

hits this exactly: the load reads the stale zero initial value, tmp == 0, and the adjoint never propagates.
test_ad_dynamic_index.py::test_matrix_non_constant_index[arch=vulkan] asserts 0.0 == 1.0 as a result.

Behavior matrix

dt Before After
f16 / f32 / f64 uint view (u32) native float view
i* / u* (≥ 8-bit) uint view uint view (unchanged)
u1 u8 load / i8 store u8 load / i8 store (unchanged)
64-bit pointer path dt directly dt directly (unchanged)

Good points

  • Targeted fix for a real, reproducible miscompile (autodiff returning 0 instead of the gradient on Vulkan).
  • Removes the aliasing question entirely rather than papering over it with Aliased decorations or memory barriers — plain load/store and the atomic now share a single
    binding.
  • Small, contained diff (one file, +30 / −11) that touches only the view-selection logic; the surrounding load/store machinery is unchanged.
  • Refactor improves readability: the previously duplicated chain of ifs in load_buffer and store_buffer is now a single pick_buffer_access_type helper, making it
    obvious that the two paths agree.
  • Existing carve-outs preserved: u1 still maps to u8/i8, and the u64 pointer path still uses dt directly, so no regressions on those code paths.
  • Documented: a substantial comment explains the SPIR-V aliasing model, why the bug occurred, and which test reproduces it — useful for future readers and for anyone
    tempted to "simplify" the helper.

Bad points / risks

  • Asymmetry between load and store for u1: load uses u8, store uses i8. This matches the previous behavior, but the helper does not encode it — store_buffer still
    has to override the helper's result with a special-case if for u1. A cleaner design would push the load/store distinction into the helper (or at least into a named
    constant) so the two sites can't drift.
  • No new test added. The fix relies on the existing test_ad_dynamic_index.py::test_matrix_non_constant_index[arch=vulkan] to catch regressions; a more direct unit test
    of the SPIR-V output (e.g. checking that the load and the atomic resolve to the same binding) would harden against future refactors.
  • Possible compatibility surface for native float storage views. Switching f16 / f64 loads/stores to native views requires the corresponding SPIR-V capabilities
    (StorageBuffer16BitAccess, Float64, etc.) to be requested wherever those types are used. If any code path emits f16/f64 load/store without already requesting these
    capabilities, this change could surface a validation error on devices that previously worked via the u32 punning path. Worth confirming the capability-request logic covers
    all is_real(dt) cases.
  • Increased binding count. Because at_buffer allocates a binding per (buffer, element_type) pair, kernels that previously only used the u32 view for floats will now
    also allocate the native float binding. This is almost certainly negligible, but on drivers with tight descriptor limits it's a (very small) extra cost.
  • is_real(dt) is the trigger — if any non-IEEE "real-like" type is added later (e.g. bf16 with no native SPIR-V storage type), the helper will route it down the
    native-view path and break. A short assertion or whitelist (f16/f32/f64) would be safer than the open-ended predicate.
  • Fix is Vulkan/SPIR-V only. If the same aliasing pattern exists in other backends that pun float buffers through integer views, those remain unfixed; the PR doesn't claim
    to address them, but it's worth noting for triage.

@hughperkins
Copy link
Copy Markdown
Collaborator

Most of hte bad points surfaced by Opus above look like AI-flation to me. But I'm wondering your thoughts on the following observation? :

"Asymmetry between load and store for u1: load uses u8, store uses i8. This matches the previous behavior, but the helper does not encode it — store_buffer still
has to override the helper's result with a special-case if for u1. A cleaner design would push the load/store distinction into the helper (or at least into a named
constant) so the two sites can't drift."

Comment thread quadrants/codegen/spirv/spirv_codegen.cpp
@duburcqa duburcqa force-pushed the duburcqa/fix_spirv_float_atomic_aliasing branch from cc6d8ea to 09976fc Compare April 24, 2026 07:27
@duburcqa duburcqa changed the title [SPIRV] Use native float view in load/store_buffer to avoid aliasing with atomics [SPIRV] Two Vulkan SPIR-V correctness fixes: native-float view for atomics + PSB pointer ArrayStride Apr 24, 2026
@duburcqa duburcqa force-pushed the duburcqa/fix_spirv_float_atomic_aliasing branch from ab96033 to 547a822 Compare April 24, 2026 08:30
@duburcqa duburcqa changed the title [SPIRV] Two Vulkan SPIR-V correctness fixes: native-float view for atomics + PSB pointer ArrayStride [SPIRV] Vulkan SPIR-V correctness: atomic-view aliasing, PSB pointer stride, narrow-type storage caps Apr 24, 2026
@duburcqa duburcqa changed the title [SPIRV] Vulkan SPIR-V correctness: atomic-view aliasing, PSB pointer stride, narrow-type storage caps [SPIRV] Vulkan SPIR-V correctness: atomic-view aliasing, PSB stride, narrow storage caps, u1 cast, per-init layer recheck Apr 24, 2026
@duburcqa
Copy link
Copy Markdown
Contributor Author

@claude review

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My prior concern about cross-view aliasing on the CAS and non-add float atomic paths is resolved by fix #3 (the Aliased decoration on multi-view buffers in get_buffer_value) — that is the right fix. Given this now bundles 6 independent SPIR-V correctness fixes across codegen and the Vulkan device creator, it warrants a human look.

Extended reasoning...

Overview

The PR has grown from a single fix (route plain float loads/stores through the native float buffer view) to six bundled SPIR-V correctness fixes: (1) native float view for load/store; (2) ArrayStride decoration on PSB scalar/vector pointers; (3) Aliased decoration on every buffer that gets a second type view; (4) StorageBuffer{8,16}BitAccess capabilities plus SPV_KHR_{8,16}bit_storage extensions, gated on queried Vulkan feature bits; (5) route u1→i8 store widening through IRBuilder::cast (OpSelect) instead of spec-invalid OpBitcast %char %bool; (6) move validation-layer availability check above the cached-VkInstance short-circuit in create_instance. Touches spirv_codegen.cpp, spirv_ir_builder.cpp, the TaskCodegen header, rhi_constants.inc.h, and vulkan_device_creator.cpp.

Security risks

None that I can identify. These are correctness fixes in the SPIR-V code generator and the Vulkan device creator; they do not touch auth, crypto, permissions, or user-input boundaries. The narrow-type storage capabilities are only emitted when the underlying Vulkan device feature is queried true, so they cannot be forged into a shader that the driver would reject only at pipeline creation.

Level of scrutiny

High. SPIR-V codegen is production-critical — miscompiles here manifest as silent gradient corruption (fix #1 / #3), wrong ndarray reads (fix #2), pipeline-compiler crashes (fix #5), or spec-invalid shaders that validate only on lenient drivers (fix #4). The aliasing argument is subtle, and the decoration strategy in get_buffer_value relies on the invariant that every at_buffer call funnels through this path. No new tests are added — every fix leans on an existing regression. Worth a human with SPIR-V / Vulkan context walking through the aliasing argument and confirming the decoration site sweeps in every pairing (native-atomic vs CAS-emulation atomic vs plain load, per buffer, per type).

Other factors

My prior flagged bug has been directly addressed by fix #3, which is a meaningful progression. The bug hunting system found no new issues on this revision. The PR is labeled lowpri but the changes are not — they are correctness fixes with concrete reproducers documented in the description, which is exactly the kind of PR that benefits from a human signing off on the mechanism rather than shadow-approval.

@hughperkins
Copy link
Copy Markdown
Collaborator

checklist:

  • bug fixes only => no user doc changes needed
  • claude review was requested and claude appears happy

=> ok to merge

@duburcqa duburcqa force-pushed the duburcqa/fix_spirv_float_atomic_aliasing branch from 4876859 to 731c547 Compare April 24, 2026 12:36
duburcqa and others added 6 commits April 24, 2026 15:52
…AccessChain scales correctly

SPV_KHR_physical_storage_buffer requires an explicit `ArrayStride`
decoration on `OpTypePointer PhysicalStorageBuffer` when the pointer
is used with `OpPtrAccessChain`: the `Element` index is multiplied by
that stride to produce the byte offset from the base address. Without
the decoration the stride is undefined, and strict drivers collapse
every indexed access back to the base - every `arr[i]` read returns
`arr[0]` across the whole kernel, and every indexed ndarray write lands
on slot 0.

Narrow the fix to scalar/vector pointees. Struct and array pointees
already carry explicit layout decorations (each member's `Offset`,
array `ArrayStride`), so adding a top-level `ArrayStride` on the
pointer to those is redundant; for scalars/vectors the natural stride
is just the pointee's byte size. Pointers in Uniform / StorageBuffer /
Input / Output / Workgroup storage classes don't use PSB arithmetic at
all, so the decoration is skipped there.

This stacks on the load/store aliasing fix: `pick_buffer_access_type`
unblocks reverse-mode atomics on devices with
`shaderBufferFloat32AtomicAdd`; the stride decoration unblocks
indexed reads/writes through PSB on every Vulkan device. The two
together drop our Vulkan test failure count from 506 to 84 across
`tests/python/` (excluding the adstack / ndarray suites, which go
from 12 failing to fully green on their own).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ss-view accesses are ordered

Review feedback on the native-float-view fix: that fix only closes the
aliasing hazard when the atomic happens to use the native float view,
which is true only for `OpAtomicFAddEXT` on devices that advertise
`shaderBufferFloat32AtomicAdd` / `shaderBufferFloat64AtomicAdd` /
`shaderBufferFloat16AtomicAdd`. Three other paths still emit the
atomic through the u32-punned view while plain load/store go through
the native float view:

- CAS emulation on devices without the EXT (NVIDIA T4, our CI runner
  among them): `visit(AtomicOpStmt)` routes f32-add through
  `at_buffer(dest, get_quadrants_uint_type(dt))` and
  `atomic_operation` then runs `OpAtomicLoad` / `OpAtomicCompareExchange`
  on the u32 view. The plain load now on the float view is unordered
  against that CAS loop.
- Non-add float atomics (min / max / mul) on every device: these never
  go through `OpAtomicFAddEXT`, always take the CAS path, always bind
  to the u32 view.
- f16 / f64 add when `spirv_has_atomic_float16_add` /
  `spirv_has_atomic_float64_add` is not set: same CAS-on-u32 path.

Close all of them structurally with an `Aliased` decoration on every
buffer `OpVariable` that gets a second type view. `Aliased` is the
SPIR-V signal that accesses through a variable may touch the same
memory as accesses through another variable in the same storage class
-- the driver must therefore preserve ordering across views, which is
exactly what we need for the load-and-clear reverse-mode pattern to
read back a freshly-atomic-added gradient.

Decorate lazily: single-view buffers stay un-decorated so the compiler
can still apply cross-variable scheduling on them. The decoration is
applied when (and only when) a second distinct type view is minted,
and it covers the newly-minted view plus every pre-existing peer in
one sweep (tracked via `buffer_views_by_buffer_` +
`aliased_decorated_buffer_ids_` to avoid emitting the decoration
twice on the same id).

`test_ad_dynamic_index.py::test_matrix_non_constant_index[arch=vulkan]`
still passes; the wider Vulkan sweep on `tests/python/` (with
`test_adstack.py` / `test_ndarray.py` excluded for independence) stays
at 83 failing / 1778 passing, same delta as the native-float-view fix
alone on this device. The decoration doesn't change the count on an
AMD RX 7900 XTX because that device exposes
`shaderBufferFloat32AtomicAdd` and hits the native-float-view path for
every failing case in the suite; the decoration's job is to keep the
CAS / non-add / f16-f64-no-native-add paths correct on devices that
don't.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… routing to an explicit float whitelist

Review feedback on the native-float-view fix:

- Switching `f16` loads / stores to the native `f16` descriptor view
  requires the `CapabilityStorageBuffer16BitAccess` SPIR-V capability
  (paired with the `SPV_KHR_16bit_storage` extension). That cap was
  never emitted by `spirv_ir_builder.cpp`, so routing `f16` native
  produced shaders that strict Vulkan drivers reject at pipeline
  creation -- and even for the pre-existing uint-punning path, every
  `i16` / `u16` load through a StorageBuffer already required the
  same capability on spec-strict drivers, so the gap was latent
  regardless. The same story applies to `i8` / `u8` and
  `CapabilityStorageBuffer8BitAccess`.
- The `is_real(dt)` predicate in `pick_buffer_access_type` is too
  open-ended: a future real-like primitive (bfloat16, an fp8 variant,
  anything else `is_real` eventually admits) would silently fall into
  the native-view branch before its storage-capability story has been
  audited.

Fix both in one commit:

1. Add `spirv_has_storage_buffer_{8,16}bit_access` device caps to
   `rhi_constants.inc.h`. Query them from the existing
   `VkPhysicalDevice{8,16}BitStorageFeatures` structs in
   `vulkan_device_creator.cpp`, strictly gated on
   `storageBuffer{8,16}BitAccess` feature bits (the Vulkan 1.2-core
   `VK_KHR_{8,16}bit_storage` promotion does not imply the feature is
   supported, matching the pattern already used for
   `bufferDeviceAddress`).
2. Emit `CapabilityStorageBuffer{8,16}BitAccess` + the matching
   `SPV_KHR_{8,16}bit_storage` extension in the SPIR-V header
   whenever the device caps are set. Unconditional relative to the
   current kernel's type use -- the header cost is one extra
   `OpCapability` + `OpExtension` per bit width, negligible against
   the benefit of spec-compliant narrow StorageBuffer access for every
   kernel that declares an `i8` / `i16` / `f16` field or ndarray.
3. Replace the `is_real(dt)` branch in `pick_buffer_access_type` with
   an explicit `{f16, f32, f64}` whitelist. The three primitive
   floats that exist today are the ones we've audited the
   storage-capability story for; anything else must be added here
   deliberately.

All existing Vulkan tests that exercised the uint-punned narrow
storage access (`test_ndarray.py` dtype parametrizations on `i8` /
`i16` / `f16`, etc.) retain their current behavior because the
capability emission is additive and gated on a feature the device
already exposes. The native-view routing scope shrinks from "every
real type" to "every real type we currently support", which
eliminates the silent-regression surface for future types.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… accepts the cast chain

`TaskCodegen::store_buffer` widened a `u1` (bool) value to its backing
`i8` slot by emitting `OpBitcast %char %bool_val`. That's ill-formed
SPIR-V: `OpBitcast` requires a numerical scalar / vector or pointer
operand and rejects booleans (`OpTypeBool` has no defined bit pattern,
so a bitcast can't be defined either). `spirv-val` flagged the exact
message:

    Expected input to be a pointer or int or float vector or scalar: Bitcast
      %46 = OpBitcast %char %tmp3_u1

Most drivers just crash in the pipeline compiler on the ill-formed
input rather than surface a validation error. On Mesa RADV (AMD RX
7900 XTX) the symptom was a hard `SIGSEGV` deep inside
`libvulkan_radeon.so::create_compute_pipeline` the moment any kernel
storing to a `u1` field / ndarray / struct member was registered.
Python-side, that looked like `timeout: the monitored command dumped
core` on every one of the `test_tensor_consistency`,
`test_pickle[vulkan-u1]`, `test_matrixfree_{cg,bicgstab}`,
`test_struct_field_with_bool`, `test_dual_return_spirv`, and
`test_offload_cross` tests.

Route the `u1 -> i8` widening through `IRBuilder::cast`, which lowers
`bool -> int` to the canonical `OpSelect(cond, 1, 0)` with the 1 / 0
constants produced at the target integer type. That's the same route
`load_buffer`'s reverse path already uses on the read side, matches
what `IRBuilder::cast` does in every other `u1` context in the
codegen, and preserves the "`u1` serialises as 0 / 1" behaviour every
`to_numpy()` / `from_numpy()` user depends on. Non-`u1` stores keep
the existing `OpBitcast` path unchanged, so no other type path
changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…or init, not just the first

`VulkanDeviceCreator::create_instance` normalises
`params_.enable_validation_layer` against
`check_validation_layer_support()` - flipping it to `false` when the
host has no `VK_LAYER_KHRONOS_validation` - and then later gates the
`spirv_has_non_semantic_info` cap and the logical-device layer array
on that flag. The normalisation used to live below the
cached-`VkInstance` short-circuit that the same function takes to
work around an NVIDIA driver bug around repeated
`vkDestroyInstance`/`vkCreateInstance` cycles (the instance is kept
alive in the `VulkanLoader` singleton for process lifetime). Every
re-init after the first one then read `params_.enable_validation_layer`
as `true` (whatever the caller passed in, usually `config.debug`),
jumped straight to the cached-instance return, and skipped the flip
entirely - even on a host where the first `create_instance` had
flipped it to `false`.

The visible symptom in `test_overflow.py` / `test_print.py` was the
first parametrisation passing (correct cap = 0) and every subsequent
one in the same pytest session failing (stale cap = 1). Each test's
`@test_utils.test(... debug=True)` wrapper calls `qd.reset()` +
`qd.init(...)` which reconstructs the `VulkanDeviceCreator` with fresh
params, so the first cycle flipped to `false`, but the second / third
/ ... cycles preserved `true`. The
`spirv_has_non_semantic_info` cap then got set, the
`NonSemantic.DebugPrintf` extinst was emitted in the shader, the
validation layer wasn't actually loaded to intercept the output, and
the `capfd`-based assertion against the `Addition overflow detected`
string fails with an empty capture buffer.

Move the layer-availability check above the cached-instance return.
Re-running `vkEnumerateInstanceLayerProperties` on every cycle is
microseconds and keeps the flag consistent whether the instance was
freshly created or reused.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@duburcqa duburcqa force-pushed the duburcqa/fix_spirv_float_atomic_aliasing branch from 731c547 to 9fa89c1 Compare April 24, 2026 13:52
@duburcqa duburcqa merged commit 41b5086 into main Apr 24, 2026
48 checks passed
@duburcqa duburcqa deleted the duburcqa/fix_spirv_float_atomic_aliasing branch April 24, 2026 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants