Conversation
|
Semi-orthogonal, but related: I wonder if we should start considering a CI that runs on a better GPU. If we want CI for Genesis, we will certainly need this. AMD GPU cloud does provide such GPUs I think. (or there is packet.ai, that @v01dXYZ discovered) |
|
Opus summary: SummaryFixes a SPIR-V codegen bug where plain loads/stores of float buffer elements could alias incorrectly with atomic float operations ( Root causeIn SPIR-V / Vulkan, each m.grad[i][j, k] += loss.grad
tmp = m.grad[i][j, k]
m.grad[i][j, k] = 0
n.grad += tmp * factorhits this exactly: the load reads the stale zero initial value, Behavior matrix
Good points
Bad points / risks
|
|
Most of hte bad points surfaced by Opus above look like AI-flation to me. But I'm wondering your thoughts on the following observation? : "Asymmetry between load and store for u1: load uses u8, store uses i8. This matches the previous behavior, but the helper does not encode it — store_buffer still |
cc6d8ea to
09976fc
Compare
ab96033 to
547a822
Compare
|
@claude review |
There was a problem hiding this comment.
My prior concern about cross-view aliasing on the CAS and non-add float atomic paths is resolved by fix #3 (the Aliased decoration on multi-view buffers in get_buffer_value) — that is the right fix. Given this now bundles 6 independent SPIR-V correctness fixes across codegen and the Vulkan device creator, it warrants a human look.
Extended reasoning...
Overview
The PR has grown from a single fix (route plain float loads/stores through the native float buffer view) to six bundled SPIR-V correctness fixes: (1) native float view for load/store; (2) ArrayStride decoration on PSB scalar/vector pointers; (3) Aliased decoration on every buffer that gets a second type view; (4) StorageBuffer{8,16}BitAccess capabilities plus SPV_KHR_{8,16}bit_storage extensions, gated on queried Vulkan feature bits; (5) route u1→i8 store widening through IRBuilder::cast (OpSelect) instead of spec-invalid OpBitcast %char %bool; (6) move validation-layer availability check above the cached-VkInstance short-circuit in create_instance. Touches spirv_codegen.cpp, spirv_ir_builder.cpp, the TaskCodegen header, rhi_constants.inc.h, and vulkan_device_creator.cpp.
Security risks
None that I can identify. These are correctness fixes in the SPIR-V code generator and the Vulkan device creator; they do not touch auth, crypto, permissions, or user-input boundaries. The narrow-type storage capabilities are only emitted when the underlying Vulkan device feature is queried true, so they cannot be forged into a shader that the driver would reject only at pipeline creation.
Level of scrutiny
High. SPIR-V codegen is production-critical — miscompiles here manifest as silent gradient corruption (fix #1 / #3), wrong ndarray reads (fix #2), pipeline-compiler crashes (fix #5), or spec-invalid shaders that validate only on lenient drivers (fix #4). The aliasing argument is subtle, and the decoration strategy in get_buffer_value relies on the invariant that every at_buffer call funnels through this path. No new tests are added — every fix leans on an existing regression. Worth a human with SPIR-V / Vulkan context walking through the aliasing argument and confirming the decoration site sweeps in every pairing (native-atomic vs CAS-emulation atomic vs plain load, per buffer, per type).
Other factors
My prior flagged bug has been directly addressed by fix #3, which is a meaningful progression. The bug hunting system found no new issues on this revision. The PR is labeled lowpri but the changes are not — they are correctness fixes with concrete reproducers documented in the description, which is exactly the kind of PR that benefits from a human signing off on the mechanism rather than shadow-approval.
|
checklist:
=> ok to merge |
4876859 to
731c547
Compare
…AccessChain scales correctly SPV_KHR_physical_storage_buffer requires an explicit `ArrayStride` decoration on `OpTypePointer PhysicalStorageBuffer` when the pointer is used with `OpPtrAccessChain`: the `Element` index is multiplied by that stride to produce the byte offset from the base address. Without the decoration the stride is undefined, and strict drivers collapse every indexed access back to the base - every `arr[i]` read returns `arr[0]` across the whole kernel, and every indexed ndarray write lands on slot 0. Narrow the fix to scalar/vector pointees. Struct and array pointees already carry explicit layout decorations (each member's `Offset`, array `ArrayStride`), so adding a top-level `ArrayStride` on the pointer to those is redundant; for scalars/vectors the natural stride is just the pointee's byte size. Pointers in Uniform / StorageBuffer / Input / Output / Workgroup storage classes don't use PSB arithmetic at all, so the decoration is skipped there. This stacks on the load/store aliasing fix: `pick_buffer_access_type` unblocks reverse-mode atomics on devices with `shaderBufferFloat32AtomicAdd`; the stride decoration unblocks indexed reads/writes through PSB on every Vulkan device. The two together drop our Vulkan test failure count from 506 to 84 across `tests/python/` (excluding the adstack / ndarray suites, which go from 12 failing to fully green on their own). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ss-view accesses are ordered Review feedback on the native-float-view fix: that fix only closes the aliasing hazard when the atomic happens to use the native float view, which is true only for `OpAtomicFAddEXT` on devices that advertise `shaderBufferFloat32AtomicAdd` / `shaderBufferFloat64AtomicAdd` / `shaderBufferFloat16AtomicAdd`. Three other paths still emit the atomic through the u32-punned view while plain load/store go through the native float view: - CAS emulation on devices without the EXT (NVIDIA T4, our CI runner among them): `visit(AtomicOpStmt)` routes f32-add through `at_buffer(dest, get_quadrants_uint_type(dt))` and `atomic_operation` then runs `OpAtomicLoad` / `OpAtomicCompareExchange` on the u32 view. The plain load now on the float view is unordered against that CAS loop. - Non-add float atomics (min / max / mul) on every device: these never go through `OpAtomicFAddEXT`, always take the CAS path, always bind to the u32 view. - f16 / f64 add when `spirv_has_atomic_float16_add` / `spirv_has_atomic_float64_add` is not set: same CAS-on-u32 path. Close all of them structurally with an `Aliased` decoration on every buffer `OpVariable` that gets a second type view. `Aliased` is the SPIR-V signal that accesses through a variable may touch the same memory as accesses through another variable in the same storage class -- the driver must therefore preserve ordering across views, which is exactly what we need for the load-and-clear reverse-mode pattern to read back a freshly-atomic-added gradient. Decorate lazily: single-view buffers stay un-decorated so the compiler can still apply cross-variable scheduling on them. The decoration is applied when (and only when) a second distinct type view is minted, and it covers the newly-minted view plus every pre-existing peer in one sweep (tracked via `buffer_views_by_buffer_` + `aliased_decorated_buffer_ids_` to avoid emitting the decoration twice on the same id). `test_ad_dynamic_index.py::test_matrix_non_constant_index[arch=vulkan]` still passes; the wider Vulkan sweep on `tests/python/` (with `test_adstack.py` / `test_ndarray.py` excluded for independence) stays at 83 failing / 1778 passing, same delta as the native-float-view fix alone on this device. The decoration doesn't change the count on an AMD RX 7900 XTX because that device exposes `shaderBufferFloat32AtomicAdd` and hits the native-float-view path for every failing case in the suite; the decoration's job is to keep the CAS / non-add / f16-f64-no-native-add paths correct on devices that don't. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… routing to an explicit float whitelist
Review feedback on the native-float-view fix:
- Switching `f16` loads / stores to the native `f16` descriptor view
requires the `CapabilityStorageBuffer16BitAccess` SPIR-V capability
(paired with the `SPV_KHR_16bit_storage` extension). That cap was
never emitted by `spirv_ir_builder.cpp`, so routing `f16` native
produced shaders that strict Vulkan drivers reject at pipeline
creation -- and even for the pre-existing uint-punning path, every
`i16` / `u16` load through a StorageBuffer already required the
same capability on spec-strict drivers, so the gap was latent
regardless. The same story applies to `i8` / `u8` and
`CapabilityStorageBuffer8BitAccess`.
- The `is_real(dt)` predicate in `pick_buffer_access_type` is too
open-ended: a future real-like primitive (bfloat16, an fp8 variant,
anything else `is_real` eventually admits) would silently fall into
the native-view branch before its storage-capability story has been
audited.
Fix both in one commit:
1. Add `spirv_has_storage_buffer_{8,16}bit_access` device caps to
`rhi_constants.inc.h`. Query them from the existing
`VkPhysicalDevice{8,16}BitStorageFeatures` structs in
`vulkan_device_creator.cpp`, strictly gated on
`storageBuffer{8,16}BitAccess` feature bits (the Vulkan 1.2-core
`VK_KHR_{8,16}bit_storage` promotion does not imply the feature is
supported, matching the pattern already used for
`bufferDeviceAddress`).
2. Emit `CapabilityStorageBuffer{8,16}BitAccess` + the matching
`SPV_KHR_{8,16}bit_storage` extension in the SPIR-V header
whenever the device caps are set. Unconditional relative to the
current kernel's type use -- the header cost is one extra
`OpCapability` + `OpExtension` per bit width, negligible against
the benefit of spec-compliant narrow StorageBuffer access for every
kernel that declares an `i8` / `i16` / `f16` field or ndarray.
3. Replace the `is_real(dt)` branch in `pick_buffer_access_type` with
an explicit `{f16, f32, f64}` whitelist. The three primitive
floats that exist today are the ones we've audited the
storage-capability story for; anything else must be added here
deliberately.
All existing Vulkan tests that exercised the uint-punned narrow
storage access (`test_ndarray.py` dtype parametrizations on `i8` /
`i16` / `f16`, etc.) retain their current behavior because the
capability emission is additive and gated on a feature the device
already exposes. The native-view routing scope shrinks from "every
real type" to "every real type we currently support", which
eliminates the silent-regression surface for future types.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… accepts the cast chain
`TaskCodegen::store_buffer` widened a `u1` (bool) value to its backing
`i8` slot by emitting `OpBitcast %char %bool_val`. That's ill-formed
SPIR-V: `OpBitcast` requires a numerical scalar / vector or pointer
operand and rejects booleans (`OpTypeBool` has no defined bit pattern,
so a bitcast can't be defined either). `spirv-val` flagged the exact
message:
Expected input to be a pointer or int or float vector or scalar: Bitcast
%46 = OpBitcast %char %tmp3_u1
Most drivers just crash in the pipeline compiler on the ill-formed
input rather than surface a validation error. On Mesa RADV (AMD RX
7900 XTX) the symptom was a hard `SIGSEGV` deep inside
`libvulkan_radeon.so::create_compute_pipeline` the moment any kernel
storing to a `u1` field / ndarray / struct member was registered.
Python-side, that looked like `timeout: the monitored command dumped
core` on every one of the `test_tensor_consistency`,
`test_pickle[vulkan-u1]`, `test_matrixfree_{cg,bicgstab}`,
`test_struct_field_with_bool`, `test_dual_return_spirv`, and
`test_offload_cross` tests.
Route the `u1 -> i8` widening through `IRBuilder::cast`, which lowers
`bool -> int` to the canonical `OpSelect(cond, 1, 0)` with the 1 / 0
constants produced at the target integer type. That's the same route
`load_buffer`'s reverse path already uses on the read side, matches
what `IRBuilder::cast` does in every other `u1` context in the
codegen, and preserves the "`u1` serialises as 0 / 1" behaviour every
`to_numpy()` / `from_numpy()` user depends on. Non-`u1` stores keep
the existing `OpBitcast` path unchanged, so no other type path
changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…or init, not just the first `VulkanDeviceCreator::create_instance` normalises `params_.enable_validation_layer` against `check_validation_layer_support()` - flipping it to `false` when the host has no `VK_LAYER_KHRONOS_validation` - and then later gates the `spirv_has_non_semantic_info` cap and the logical-device layer array on that flag. The normalisation used to live below the cached-`VkInstance` short-circuit that the same function takes to work around an NVIDIA driver bug around repeated `vkDestroyInstance`/`vkCreateInstance` cycles (the instance is kept alive in the `VulkanLoader` singleton for process lifetime). Every re-init after the first one then read `params_.enable_validation_layer` as `true` (whatever the caller passed in, usually `config.debug`), jumped straight to the cached-instance return, and skipped the flip entirely - even on a host where the first `create_instance` had flipped it to `false`. The visible symptom in `test_overflow.py` / `test_print.py` was the first parametrisation passing (correct cap = 0) and every subsequent one in the same pytest session failing (stale cap = 1). Each test's `@test_utils.test(... debug=True)` wrapper calls `qd.reset()` + `qd.init(...)` which reconstructs the `VulkanDeviceCreator` with fresh params, so the first cycle flipped to `false`, but the second / third / ... cycles preserved `true`. The `spirv_has_non_semantic_info` cap then got set, the `NonSemantic.DebugPrintf` extinst was emitted in the shader, the validation layer wasn't actually loaded to intercept the output, and the `capfd`-based assertion against the `Addition overflow detected` string fails with an empty capture buffer. Move the layer-availability check above the cached-instance return. Re-running `vkEnumerateInstanceLayerProperties` on every cycle is microseconds and keeps the flag consistent whether the instance was freshly created or reused. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
731c547 to
9fa89c1
Compare
Vulkan SPIR-V correctness: atomic-view aliasing, PSB pointer stride, narrow-type storage caps,
u1storage cast, per-init validation-layer re-checkTL;DR
Six independent correctness gaps, all with observable silent-corruption failure modes on Vulkan:
1. Plain float loads / stores went through the
u32-punned view of the buffer instead of the nativef32viewOpAtomicFAddEXThas always used the nativef32view. PlainOpLoad/OpStorewent through theu32view. Each(buffer, element_type)pair gets its ownOpVariablewith a freshDescriptorSet/Binding, so those two views are different SPIR-V variables pointing at the sameVkBuffer. Without anAliaseddecoration, a driver is free to assume they don't alias, so the plain load is not memory-ordered against the preceding atomic. Reverse-mode AD's load-and-clear pattern (x.grad[i] += delta; tmp = x.grad[i]; x.grad[i] = 0; y.grad += tmp * factor) reads the stale zero, silent gradient drop.Originally discovered via
tests/python/test_ad_dynamic_index.py::test_matrix_non_constant_index[arch=vulkan]asserting0.0 == 1.0.2.
OpTypePointerinPhysicalStorageBufferwas missingArrayStrideOpPtrAccessChainon a PSB pointer multipliesElementbyArrayStrideto produce the byte offset. We were emitting bare pointers; strict drivers collapsed indexed access back to the base. Symptom: everyarr[i]ndarray read returnsarr[0].3. Type-punned views of the same buffer weren't flagged as mutually
AliasedFix 1 closes the aliasing hazard only for the native-atomic-add hot path. CAS-emulation atomic on devices without
shaderBufferFloat32AtomicAdd, non-add float atomics (min/max/mul), andf16/f64-without-native-add each route the atomic throughu32while plain load / store are now on the native float view -- same cross-view aliasing, different pairing. Close all of them with anAliaseddecoration on every bufferOpVariablethat gets a second type view. Lazy: single-view buffers stay undecorated.4. Narrow-type
StorageBufferaccess requiresStorageBuffer{8,16}BitAccesscapabilities that weren't emittedThe uint-punning path has always relied on
OpLoad/OpStorethroughu16/u8descriptor-bound pointers. PerSPV_KHR_16bit_storage/SPV_KHR_8bit_storage, that needsCapabilityStorageBuffer16BitAccess/CapabilityStorageBuffer8BitAccessin the header. Neither capability nor its extension was emitted, so strict drivers were within their rights to reject every Quadrants shader touching ani8/i16/f16/u8/u16field. Also narrow theis_real(dt)predicate inpick_buffer_access_typeto an explicit{f16, f32, f64}whitelist so a futurebfloat16/fp8doesn't silently fall through.5. Widening a
u1value to itsi8storage slot usedOpBitcast, which is ill-formed on booleansTaskCodegen::store_bufferwidenedu1throughOpBitcast %char %bool_val.spirv-val:Expected input to be a pointer or int or float vector or scalar: Bitcast. Mesa RADV (AMD RX 7900 XTX) crashed deep insidelibvulkan_radeon.so::create_compute_pipelinewith a rawSIGSEGVthe moment au1field / ndarray / struct member store was registered. Surfaced astimeout: the monitored command dumped coreacrosstest_tensor_consistency,test_pickle[vulkan-u1],test_matrixfree_{cg,bicgstab},test_struct_field_with_bool,test_dual_return_spirv, andtest_offload_cross.Route the
u1 -> i8widening throughIRBuilder::cast, which lowersbool -> intto the canonicalOpSelect(cond, 1, 0)at the target integer type. That's what the load side already does, and it preserves "u1serialises as 0 / 1" forto_numpy()/from_numpy().6.
VulkanDeviceCreator::create_instanceskipped the validation-layer re-check on re-initThe instance-reuse short-circuit (kept around an NVIDIA driver bug with repeated
vkDestroyInstance/vkCreateInstance) used to return beforecheck_validation_layer_support()ran, so every re-init after the first one readparams_.enable_validation_layeras whatever the caller passed (Truefordebug=True) instead of the flipped-Falsereflecting the host's actual layer availability. Downstream, thespirv_has_non_semantic_infocap followed the staleTrue, the shader emittedNonSemantic.DebugPrintfextinsts with no validation layer loaded to route them, and everycapfd-based assertion after the first parametrisation in the same pytest session failed with an empty capture buffer.test_overflow.py::test_shl_overflow[arch=vulkan-ty0-6]passing +test_shl_overflow[arch=vulkan-ty1-7]failing in the same-n 1session was the fingerprint.Move the check above the cached-instance return. Re-running
vkEnumerateInstanceLayerPropertieson every cycle costs microseconds and keeps the flag consistent.Why
Every fix is a spec-compliance gap, not a driver workaround:
shaderBufferFloat32AtomicAdd. CI's T4 falls back to CAS emulation, which matched the oldu32plain-load path, so the aliasing gap didn't exist there and the bug has been latent since native-atomic lowering landed on non-T4 drivers.u1store path. Dozens of existing tests were observed to segfault; the only reason this wasn't caught earlier is that Quadrants' otheru1code paths (loads, casts, atomics) already went throughIRBuilder::castand didn't exercise the broken store widening.debug=Truetest that expectsDebugPrintfoutput incapfdis at risk.None of the six regresses any other backend. Metal's MoltenVK path goes through SPIRV-Cross -> MSL, which ignores the aliasing question, the
ArrayStride, the narrow-storage caps, and the device-creator-path ordering. AMDGPU / CUDA / CPU don't use the SPIR-V codegen. Fix 5 is a bit-identical refactor on every other type path (onlyu1changes); fix 6 runs only insideVulkanDeviceCreator.Mechanism
Fix 1:
pick_buffer_access_typeroutes whitelisted float types through the native viewquadrants/codegen/spirv/spirv_codegen.cpp:Fix 2:
get_pointer_typeemitsArrayStridefor PSB scalar / vector pointeesquadrants/codegen/spirv/spirv_ir_builder.cpp::get_pointer_type. Whenstorage_class == PhysicalStorageBufferand the pointee is a primitive scalar / vector, decorate withArrayStride = sizeof(pointee). Struct / array pointees already carry full per-member layout, so no double-decoration. Non-PSB storage classes skipped.Fix 3:
get_buffer_valuedecorates multi-view buffer variables withAliasedquadrants/codegen/spirv/spirv_codegen.cpp::get_buffer_value. Track per-BufferInfolist of existing type views. When a second (or later) view is minted, retroactively decorate every peer withAliased; dedupe via an id-set. Single-view buffers stay undecorated.Fix 4: narrow-type storage caps gated on Vulkan-queried feature bits
Three-part change:
quadrants/inc/rhi_constants.inc.h: addspirv_has_storage_buffer_{8,16}bit_accessto theDeviceCapabilityenum.quadrants/rhi/vulkan/vulkan_device_creator.cpp: queryVkPhysicalDevice{8,16}BitStorageFeatures::storageBuffer{8,16}BitAccessand set the device caps. Strictly gated on the feature bit (the Vulkan 1.2-coreVK_KHR_{8,16}bit_storagepromotion doesn't imply the feature is supported).quadrants/codegen/spirv/spirv_ir_builder.cpp: emitCapabilityStorageBuffer{8,16}BitAccess+ the matchingSPV_KHR_{8,16}bit_storageextension when the device caps are set.Fix 5:
store_bufferroutesu1 -> i8throughIRBuilder::castquadrants/codegen/spirv/spirv_codegen.cpp::TaskCodegen::store_buffer. Widened the three-way logic:IRBuilder::cast(int, bool)already emitsOpSelect(cond, int_immediate(1), int_immediate(0))at the target type -- matches the spec-compliantbool -> intlowering, matches whatload_bufferdoes on the reverse path, and keeps the serialisation convention.Fix 6:
create_instanceruns the layer-availability check before the cached-instance returnquadrants/rhi/vulkan/vulkan_device_creator.cpp::create_instance. Moved:from after the cached-
VkInstanceshort-circuit to before it. Dropped the duplicate check that used to live lower in the function.Per-backend coverage matrix
u1cast)shaderBufferFloat32AtomicAddarr[i] -> arr[0]collapse on strict-PSB driversSPV_KHR_{8,16}bit_storagedriversu1store pipeline crash (RADVSIGSEGV; spec-invalidOpBitcast %bool)spirv_has_non_semantic_infoon re-init across a pytest sessionbool -> intcast was already correctTests
No new tests added in this PR. Every fix is pinned by an existing regression whose assertion already covered the failure mode:
tests/python/test_ad_dynamic_index.py::test_matrix_non_constant_index[arch=vulkan].tests/python/test_ndarray.py::test_ndarray_1d[arch=vulkan]+ every cross-suite test that doesarr[i]indexed access.test_ndarray.py's dtype-parametrised suite on narrow primitives.test_tensor_consistency,test_pickle[vulkan-u1],test_matrixfree_{cg,bicgstab},test_struct_field_with_bool,test_dual_return_spirv,test_offload_crossall disappear post-fix.test_overflow.py/test_print.pyfirst-passes-then-fails pattern across a single pytest session.Measured on an AMD RX 7900 XTX (advertises
shaderBufferFloat32AtomicAdd, strict PSB,storageBuffer{8,16}BitAccess), acrosstests/python/withtest_adstack.pyandtest_ndarray.pyexcluded for measurement independence: 506 failing pre-series, 22 failing post-series. Ontest_adstack.pyspecifically this takes Vulkan from 12 failing -> fully green; ontest_ndarray.py, from 3 failing -> fully green. The remaining 22 are all intest_scan.py-- a pre-series parallel-scan correctness issue unrelated to any of the six gaps here, left for a follow-up PR.Side-effect audit
pick_buffer_access_typereturnsget_quadrants_uint_type(dt)as beforeat_buffer(#1)ptr_val.stype.dt == u64branch preserved{f16, f32, f64}whitelist replacesis_real(dt)-- unknown reals fall into uint viewaliased_decorated_buffer_ids_id-setstorageBuffer{8,16}BitAccess(#4)u1stores (#5)OpBitcastpath as beforeu1loads (#5)IRBuilder::castpre-fixquadrants/codegen/spirv/,quadrants/rhi/vulkan/,quadrants/inc/untouched