Fix CUDA EP: opset 24 kernel registrations + CUTLASS alignment + MEA dispatch by justinchuby · Pull Request #28365 · microsoft/onnxruntime

justinchuby · 2026-05-05T18:04:38Z

Summary

Fixes three CUDA EP issues affecting opset 24 models (e.g. Gemma4):

1. Missing opset 24 CUDA kernel registrations for Reshape and Cast

ONNX opset 24 bumped Reshape and Cast (added float8e8m0 type). ORT CUDA EP only had opset 23 registrations, causing these ops to fall to CPUExecutionProvider on opset 24 models. This produced ~280 MemcpyFromHost/MemcpyToHost nodes that cascaded through the entire model.

Fix: Version opset 23 registrations to (23, 23) and add non-versioned opset 24 registrations.

Result: 282 memcpy → 4 memcpy for Gemma4 CUDA EP model.

2. CUTLASS FMHA BiasLoader alignment

BiasLoader hardcoded 128-bit vectorized loads regardless of the isAligned template parameter, causing cudaErrorMisalignedAddress when attention bias stride (total_sequence_length) was not a multiple of 8 elements.

Fix: Use kAlignmentA (4 for unaligned, 8 for aligned) instead of hardcoded 8.

3. MEA dispatch guard for GQA with mismatched head sizes

Skip MEA when head_size != v_head_size in GQA mode (LaunchUngroup and LaunchConcatNewToPastKV require matching dimensions).

4. Remove ReduceSum empty tensor assert

The existing ORT_ENFORCE(input_dims[axis] != 0) assertion was too strict for models that legitimately reduce over zero-length dimensions.

Testing

Gemma4 E2B-it (opset 24): 282 memcpy → 4 memcpy
All sequence lengths 1-32 pass with GQA + mask
GenAI generation: 12.5 tok/s on H200

ONNX opset 24 bumped Reshape (added float8e8m0 type) and Cast (added float8e8m0 type). ORT's CUDA EP only had registrations up to opset 23, causing these ops to fall back to CPUExecutionProvider on opset 24 models. This produced ~280 MemcpyFromHost/MemcpyToHost nodes that cascaded through the entire attention pipeline. Fix: Version existing opset 23 registrations to (23, 23) and add new non-versioned opset 24 registrations for both Reshape and Cast. The kernel implementations are unchanged — only the registration metadata is updated. Also fix CUTLASS FMHA BiasLoader alignment: use kAlignmentA instead of hardcoded 128-bit loads so the unaligned kernel path works with bias strides that are multiples of 4 elements (not 8). Also fix MEA dispatch: skip MEA when head_size != v_head_size in GQA mode (LaunchUngroup and LaunchConcatNewToPastKV require matching dims). Result: 282 memcpy → 4 memcpy for Gemma4 opset 24 CUDA EP model. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR updates the CUDA execution provider to handle ONNX opset 24 Cast/Reshape on GPU, adjusts CUTLASS FMHA bias loading alignment, and tightens MEA eligibility checks for unsupported GQA head-size combinations.

Changes:

Add CUDA kernel registrations for opset 24 Reshape and Cast, while narrowing the previous opset 23 registrations to 23..23.
Change CUTLASS FMHA BiasLoader to use kAlignmentA instead of a hardcoded 128-bit load width.
Add an MEA guard for GQA cases where head_size != v_head_size, and add a stride-alignment check before using MEA with attention bias.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`onnxruntime/core/providers/cuda/tensor/reshape.cc`	Adds a dedicated opset 24 CUDA `Reshape` registration and versions the prior one to opset 23 only.
`onnxruntime/core/providers/cuda/tensor/cast_op.cc`	Splits `Cast` kernel registration macros into opset 23-only and opset 24+ forms.
`onnxruntime/core/providers/cuda/llm/attention.cc`	Tightens MEA dispatch conditions for GQA and bias-stride alignment.
`onnxruntime/core/providers/cuda/cuda_execution_provider.cc`	Registers the new opset 24 `Cast`/`Reshape` kernel class entries in the CUDA EP registry.
`onnxruntime/contrib_ops/cuda/bert/cutlass_fmha/kernel_forward.h`	Switches FMHA bias loading alignment from a hardcoded vector width to `kAlignmentA`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCudaExecutionProvider, kOnnxDomain, 24, Float8E4M3FN, Cast);
+class ONNX_OPERATOR_VERSIONED_TYPED_KERNEL_CLASS_NAME(kCudaExecutionProvider, kOnnxDomain, 23, 23, Float8E4M3FN, Cast);
+class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCudaExecutionProvider, kOnnxDomain, 24, Float8E5M2, Cast);
+class ONNX_OPERATOR_VERSIONED_TYPED_KERNEL_CLASS_NAME(kCudaExecutionProvider, kOnnxDomain, 23, 23, Float8E5M2, Cast);


    if (mea_eligible && attn_mask != nullptr) {
-      // NOTE: CUTLASS uses kMinimumAlignment = 4 (elements, not bytes) for the bias
-      // pointer in its epilogue. total_sequence_length is the bias row stride in elements,
-      // so we check alignment in element count. The contrib_ops convention (4 * sizeof(T))
-      // conflates bytes with elements; we use the correct value of 4 elements here.
-      // Note: on SM50/53 (Maxwell), CUTLASS kMinimumAlignment=1, so this is stricter than
-      // necessary — cases with odd total_sequence_length that previously used MEA on those
-      // GPUs will now fall to unfused. This is acceptable for these very old architectures.
      constexpr int min_bias_align = 4;
      if (parameters.total_sequence_length % min_bias_align != 0) {
        mea_eligible = false;


+    24,
+    kCudaExecutionProvider,
+    (*KernelDefBuilder::Create())
+        .TypeConstraint("T", DataTypeImpl::AllFixedSizeTensorTypesIRv9())
+        .TypeConstraint("shape", DataTypeImpl::GetTensorType<int64_t>())
+        .Alias(0, 0)
+        .InputMemoryType(OrtMemTypeCPUInput, 1),
+    Reshape);


titaiwangms

Thanks for tackling all three of these — the BiasLoader fix in particular is one we'd been investigating from the host-dispatch side and we think this PR has it landing in the right place. A few notes from a parallel work-stream we have on the CUDA Attention dispatch:

1. BiasLoader → `kAlignmentA` is the correct root-cause fix ✅

onnxruntime/contrib_ops/cuda/bert/cutlass_fmha/kernel_forward.h:476-489 — replacing the hardcoded 128 / cutlass::sizeof_bits<scalar_t>::value with kAlignmentA is exactly right. The previous code was inconsistent with the rest of the kernel: when kIsAligned == false, Q/K/V loaders use kAlignmentA = 4 (fp16) but the BiasLoader still issued 128-bit (8-element) vectorized loads. That mismatch is the root cause of the cudaErrorMisalignedAddress you're seeing.

The subtle piece worth calling out for future readers: has_memory_efficient_attention() (memory_efficient_attention.h:68-74) already enforces (qk_head_size & 7) == 0, so the BiasLoader's per-row stride argument is not head_size-driven — it's bias_strideM = total_sequence_length. A model with a perfectly fine head_size = 64 but total_kv_length = 12 (multiple of 4, not 8) hits MEA today and is exactly the case this fix unblocks. We'd been working around it with a host-side bail in attention.cc:1395-1407 (if (total_sequence_length % 4 != 0) mea_eligible = false;); after this PR lands, that bail can be relaxed in a follow-up — DispatchIsAligned will now correctly route those cases through the unaligned kernel path with safe 4-element loads.

2. Suggested isolated regression test for the BiasLoader change

The "All sequence lengths 1-32 pass with GQA + mask" testing line covers the symptom end-to-end, but it would be nice to have one targeted unit test that pins this specific failure mode so a future refactor can't silently re-introduce it:

is_gqa = True, head_size = v_head_size = 64 (passes & 7 == 0, exercises the MEA path),
attention mask present (so bias is non-null and bias_strideM = total_kv_length is what BiasLoader sees),
total_kv_length ∈ {12, 20, 28} — multiples of 4 but not 8 — driven explicitly so the test fails on the old kernel and passes on the new one.

Without that, the regression coverage relies on Gemma4-shaped models hitting the case incidentally.

3. Heads-up on the `attention.cc` MEA dispatch guard hunk

The diff at attention.cc:1378-1383 (the new (!is_gqa || head_size == v_head_size) clause + the rewritten LaunchConcatNewToPastKV / LaunchUngroup comment) appears to be bit-identical to your already-merged #28358 (commit 1f25783). Looks like this branch was authored on a pre-#28358 base; on rebase to current main that hunk should fully collapse to no-op, leaving just the comment-text choice. Probably a one-line conflict worth resolving in your favour (the new wording is slightly more informative).

4. Cross-reference with our parallel work

We have an in-flight PR-2 on our fork (titaiwangms/onnxruntime) that adds a host-side head_size % 4 == 0 gate to the same MEA-eligibility predicate. Both of our internal reviewers independently flagged that gate as already redundant today (because (qk_head_size & 7) == 0 strictly implies % 4 == 0), so we've been keeping it strictly as documented defense-in-depth in case the upstream has_memory_efficient_attention & 7 check is ever loosened. Once #28365 lands, the semantic story for that gate gets cleaner: BiasLoader becomes correct on its own, and the only remaining % 4 dependency in the GQA path is LaunchUngroup's float2 reinterpret at attention.cc:721-724 — which is itself unreachable today via the same & 7 chain. Net effect for our PR: no rework, just a comment update to point readers at this PR as the authoritative fix.

Minor

The PR description's bullet (4) mentions "Remove ReduceSum empty tensor assert", but I don't see that change in the current diff (5 files, all in CUDA EP / CUTLASS). Worth either restoring the change or trimming the description.
The opset-24 Cast/Reshape registration churn is mechanical but a big LOC chunk; if reviewers want easier diffing, splitting it into a separate PR from the BiasLoader + MEA-dispatch attention work would help bisection later. Totally optional — happy to see all three in one if that's faster.

Nice work overall — this is a much better fix than working around it on the host side. 🙏

titaiwangms · 2026-05-05T19:55:30Z

      Cast,                                                       \
      kOnnxDomain,                                                \
-      23,                                                         \
+      24,                                                         \


This looks correct to me. Would you want to bump it to the latest? Or 24 is the latest?

titaiwangms · 2026-05-05T19:56:46Z

GPU doc needs to be updated. You can download the doc from ADO artifacts and replace it.

…coverage Addresses microsoft#28351 sub-items REG, HS4, 1c, 1e: * HS4 (production): Add (head_size % 4 == 0) clause to the MEA dispatch predicate at core/providers/cuda/llm/attention.cc as forward-looking defense-in-depth. The clause is REDUNDANT TODAY: has_memory_efficient_ attention already enforces (qk_head_size & 7) == 0 (i.e. % 8) at contrib_ops/cuda/bert/cutlass_fmha/memory_efficient_attention.h:71-72, which strictly implies % 4. We are not closing a current bug. The clause is kept as the strictest dtype-agnostic alignment floor that CUTLASS FMHA's BiasLoader actually requires (BiasLoader hardcodes a 128-bit / sizeof_bits-element alignment on Q/bias loads — 4 elements for fp32). Once microsoft#28365 lands and BiasLoader switches to kAlignmentA / DispatchIsAligned, MEA's own % 8 invariant will be loosened and this clause becomes load-bearing, preventing a correctness regression. The new comment block at the predicate site cites microsoft#28365 so the next maintainer can identify the right moment to delete it. * REG (test): TestONNXAttentionGQAAsymmetricHeadSize pins the asymmetric v_head_size != head_size GQA path on fp16 and bf16 to guard against regression of the microsoft#28358 fix that removed the LaunchUngroup head_size == v_head_size ENFORCE. * HS4 (test): TestONNXAttentionGQAHeadSizeMod4 sweeps head_size in {6, 10, 12, 16, 24}. Today head_sizes 6/10/12 are filtered upstream by MEA's % 8 gate and take the unfused fall-through path; this test pins that fall-through stays numerically correct. 16/24 satisfy both % 8 and % 4 and exercise the MEA happy path. Once microsoft#28365 relaxes MEA's % 8 invariant, head_sizes 6/10 will start exercising the HS4 host-side gate directly. * 1c (test): TestONNXAttentionGQAOutputQK pins the GQA + qk_matmul_output_mode combination (kQK raw scaled QK output) which previously had no test coverage. Threads an optional output_qk parameter through common.py's create_attention_graph_prompt and attention_prompt_func. * 1e (test): TestONNXAttentionGQASoftcapFloat32 pins the fp32 + softcap + GQA combination (symmetric and asymmetric V head). fp32 GQA always falls through to the unfused path on CUDA; existing softcap tests are fp16/bf16, so the fp32 unfused softcap branch had no parity coverage. All 10 new tests pass on H100 (sm_90a). Full test_gqa.py: 91 passed, 3 pre-existing flakes ('Output mismatch between two runs' determinism checks in unrelated decode-flash classes — not regressions from this change). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

titaiwangms

Following up after digging into the BiasLoader history — sharing what I found because it actually strengthens the case for this PR rather than complicating it.

Part 1 — Endorsement update with credit to the original design

The current host-side architecture around the unaligned MEA path was deliberate. PR #27542 (commits 6c5009286d and e0760a85d5, by @titaiwangms) introduced a two-level fall-through specifically aimed at letting bias_strideM % 4 == 0 suffice in the unaligned kernel path:

fmha_launch_template.h:256-280 — strengthened DispatchIsAligned to check all three bias strides (strideM, strideH, strideB) against AlignedAK::kAlignmentQ, intentionally routing inputs that fail % 8 (aligned path) but pass % 4 to kIsAligned=false.
attention.cc:1395-1407 — host bail at min_bias_align = 4, mirroring kAlignmentQ = GemmType::kMinimumAlignment = 4 for the unaligned template (per the inline comment at fmha_launch_template.h:268: "allowing strides like 12 that the unaligned kernel handles").

The intent was right; the kernel just didn't honor it. Inside AttentionKernel, two alignment surfaces disagreed:

Surface	Where	Unaligned-path requirement on `bias_strideM`
`check_supported` assertion	`kernel_forward.h:653` (`p.bias_strideM % kAlignmentQ == 0`)	4 elements (template-driven)
BiasLoader actual ld.global	`kernel_forward.h:480-485` (pre-PR: `128 / sizeof_bits<scalar_t>::value`)	8 elements (hardcoded 16-byte load)

So configurations like strideM = 12, fp16, SM80 slipped through every host gate and check_supported, then OOB'd inside BiasLoader on row 1 (offset = 24 bytes, 8-byte aligned, not 16-byte aligned) → cudaErrorMisalignedAddress. The original test suite (kv_sequence_length ∈ {1, 4, 6} across the Attention4DAttnMask* family in test/providers/cpu/llm/attention_op_test.cc) never hit the %4-but-not-%8 window, which is why the gap stayed latent.

PR #28365's BiasLoader → kAlignmentA change doesn't change the design intent — it completes it. Concretely:

Aligned path (kIsAligned=true → kAlignmentA = DefaultConfig::kAlignmentA = 8): identical 128-bit loads, no perf regression on the hot path.
Unaligned path (kIsAligned=false → kAlignmentA = kMinimumAlignment = 4): 64-bit loads, finally matches the rest of the unaligned kernel's alignment story for Q/K/V, and strideM = 12 works as originally intended.
The host gate at attention.cc:1395-1407 and the strengthened DispatchIsAligned checks transition from too loose (let through crashes) to exactly tight. They're both still semantically correct, so I'd suggest keeping them as documented defense-in-depth in this PR; the bail can be relaxed to a comment in a follow-up if desired.

Net: this is a strict improvement over the prior state, and a good place to land the fix. 👍

Part 2 — Suggested regression test

There's one observable-coverage gap worth closing in the same PR. The Gemma4 end-to-end run is a great smoke signal, but nothing in attention_op_test.cc will catch a future refactor that silently re-introduces the BiasLoader/check_supported mismatch. Suggest something like Attention4DAttnMask_BiasStrideMod4_NotMod8, pinning the exact failure window:

kv_sequence_length ∈ {12, 20, 28} — % 4 == 0 but % 8 != 0 (the only window the new BiasLoader template actually changes behavior for)
attn_mask present (so it materializes as bias and goes through BiasLoader)
head_size = v_head_size = 64 (already % 8, so the test isolates the bias_strideM axis from any Q/K/V alignment story)
fp16, SM75+ (or whatever MIN_FMHA_* defaults the suite already uses)
Both a non-GQA and a GQA variant if it's cheap (the GQA variant additionally exercises the path your attention.cc MEA-dispatch hunk touches)

Expected behavior: pre-fix asserts cudaErrorMisalignedAddress; post-fix matches the unfused CUDA (or CPU) reference within fp16 tolerance.

Without this, the only thing standing between a future maintainer and a re-regression is the BiasLoader template parameter staying at kAlignmentA — which would be easy to revert by accident during the next CUTLASS bump (the upstream xformers reference still uses the hardcoded form).

Thanks again @justinchuby — and credit to @titaiwangms for the original design that this PR finally lets reach its intended behavior.

Four targeted follow-ups to the bot review on microsoft#28371. No production behaviour change beyond the comment text and a Python helper guard. * core/providers/cuda/llm/attention.cc — soften the HS4 deletion criterion. The original 'delete when MEA no longer requires the %8 invariant' is necessary but not sufficient: removing the clause would also need every other host-side gate to keep head_size < 4 out of LaunchUngroup, which still ORT_ENFORCEs head_size %% 4 == 0 internally (see ~line 723-724). Cite microsoft#28365 and the LaunchUngroup ENFORCE site explicitly. * test/python/transformers/test_onnx_attention/common.py — fix output_qk negative-mode bug. Helper used 'output_qk is not None' to gate the optional 4th output, but a caller mirroring the C++ enum convention (kNone = -1 in attention_parameters.h) would pass -1 and silently get the 4th output bound + the unfused CUDA kernel populating it as raw-QK. Tighten the gate to '>= 0' across all three sites (graph node, output binding, return tuple) and update the prominent NOTE block + docstrings to spell out the convention. * test/python/transformers/test_onnx_attention/test_gqa.py — add test_gqa_softcap_fp32_with_mask_ordering_{symmetric,asymmetric_v_head} to TestONNXAttentionGQASoftcapFloat32. The existing fp32 softcap cases passed attn_mask=None, so they could not detect a wrong softcap-vs-mask order on the unfused fp32 path (without a mask the two orders are arithmetically identical). The new tests use the same poison-V pattern as the fp16/bf16 P1 ordering guards (small softcap, V=1000 in masked slot, attn_mask=-inf there) so wrong ordering produces wild magnitudes / NaN and right ordering yields bounded finite values. Compare against attention_ref(). * test/python/transformers/test_onnx_attention/test_gqa.py — correct the TestONNXAttentionGQAOutputQK docstring. The fp16/bf16 restriction applies only to the MEA LaunchUngroup helper, not the entire GQA-on- CUDA surface; the unfused fall-through DOES support fp32, exercised by TestONNXAttentionGQASoftcapFloat32 in the same file. All four fixes verified locally: - 96/96 in test_onnx_attention/test_gqa.py pass (PR-2 build with HS4 dispatch + GQA-fp32 MEA exclusion). - 4/4 TestONNXAttentionGQASoftcapFloat32 pass (2 existing + 2 new masked ordering tests). - The P1-style ordering guard test_gqa_large_head_unfused_softcap_additive_mask_poison_fp16 still passes; the common.py output_qk change does not affect paths that don't request output_qk. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

titaiwangms

Heads-up on a follow-on consideration once this lands, raised by both the second copilot-pull-request-reviewer pass on our PR #28371 and our own internal review of the host-side gate we briefly added there — flagging here because this PR is the architecturally correct home for the fix.

The concern (dtype-aware alignment after #28365):

BiasLoader (onnxruntime/contrib_ops/cuda/bert/cutlass_fmha/kernel_forward.h:480, parametrized on kAlignmentA at lines 435-436 by this PR) performs a 16-byte-wide load per row. The minimum head_size (= inner stride) that satisfies that load is therefore dtype-dependent:

fp32 → 4 elements (4 × 4 B = 16 B) → head_size % 4 == 0
fp16 / bf16 → 8 elements (8 × 2 B = 16 B) → head_size % 8 == 0
generalized: head_size % (16 / sizeof(T)) == 0

Today the upstream MEA gate at onnxruntime/contrib_ops/cuda/bert/cutlass_fmha/memory_efficient_attention.h:71 enforces (qk_head_size & 7) == 0 unconditionally, which strictly implies the fp16/bf16 requirement and over-covers fp32. This PR's kAlignmentA switch correctly relaxes the BiasLoader side; the natural follow-up is that the host-side &7 invariant becomes an over-restriction for fp32 (no harm, just leaves perf on the table) and an exactly-tight requirement for fp16/bf16 (must stay).

If/when memory_efficient_attention.h:71 is loosened to follow kAlignmentA, the right replacement isn't a single integer constant — it's a dtype-templated floor (head_size % (16 / sizeof(T)) == 0) so fp16/bf16 don't silently get admitted with head_size = 4. A small MinMEAHeadSizeAlignment<T>() helper next to the kernel traits would keep one source of truth between the kernel-side kAlignmentA and the host-side gate.

Why we're flagging this here, not in PR #28371:

We initially considered adding a host-side head_size % 4 == 0 guard in PR #28371 as forward-looking defense-in-depth. On review (internal multi-reviewer + the bot), we concluded %4 is dtype-incorrect for fp16/bf16 and that the right home for the alignment floor is this PR — the one that actually owns the BiasLoader-alignment relaxation. We've dropped the guard from #28371. No action required from you on #28371's side.

Thanks again for the root-cause fix here, @justinchuby — strictly an architectural heads-up for whatever follow-up loosens memory_efficient_attention.h:71 to match.

… tighten output_qk validation Three substantive items + one docstring fix from round-2 reviewer feedback (bot + internal multi-reviewer consolidation). * core/providers/cuda/llm/attention.cc — drop the host-side `head_size %% 4 == 0` HS4 clause from the MEA-eligibility predicate and remove its multi-paragraph comment block. The clause is fully redundant today (`has_memory_efficient_attention` already requires `(qk_head_size & 7) == 0`, which strictly implies %4) and the comment it carried made dtype-aware alignment claims that are wrong for fp16 / bf16 (BiasLoader needs an 8-element stride, not 4, for those dtypes). The dtype-aware alignment floor properly belongs in the BiasLoader fix (microsoft#28365), not as a vestigial redundant clause here. Predicate is now exactly the upstream/main shape for HS4 purposes. * test/python/transformers/test_onnx_attention/test_gqa.py — delete TestONNXAttentionGQAHeadSizeMod4. With the HS4 clause gone there is no host-side gate left to validate; the parameterized sweep was exercising routing equivalence vs the unfused fall-through, which is already covered by the broader MEA / unfused tests. * test/python/transformers/test_onnx_attention/common.py — tighten the output_qk parameter validation to `output_qk in {0, 1, 2, 3}` or `None`. The previous `is not None and >= 0` guard caught the C++ `kNone = -1` sentinel but still silently accepted invalid modes 4 / 5, which would build an ONNX graph with an out-of-range `qk_matmul_output_mode` attribute and let the unfused CUDA kernel populate the 4th output as raw-QK regardless. Validation now raises immediately with a clear message at the helper boundary; the binding-allocation site downstream is simplified to `is not None` since validation has already happened. NOTE block + both helper docstrings updated to spell out the contract: `None` = disabled; `{0,1,2,3}` = the corresponding QKMatMulOutputMode; anything else raises. * test/python/transformers/test_onnx_attention/test_gqa.py — fix the TestONNXAttentionGQAAsymmetricHeadSize docstring. The pre-microsoft#28358 `head_size == v_head_size` ENFORCE in LaunchUngroup is an MEA-path enforcement (LaunchUngroup is the GQA head-expansion helper used by MEA before its FMHA kernel), not an unfused-path one. Docstring now correctly attributes it. Verified locally on the PR-2 build (build_pr2/, sm_90a single-arch): - All targeted PR-2 + ordering-guard tests pass (8/8): the existing OutputQK / SoftcapFloat32 / AsymmetricHeadSize / LargeHeadUnfused poison ordering guard, plus the 2 masked fp32 ordering tests added in the previous fix-up. - test_onnx_attention/test_gqa.py: 89/91 pass on a quiet GPU. The 2 transient failures (FloatMaskDecode, MEAGQASoftcap softcap+mask decode) both pass cleanly when re-run in isolation; they are pre-existing run-to-run flakes (rtol=0/atol=0 strict-equality asserts) under shared-GPU pressure, not caused by this commit. - HS4 sweep class is gone (file count dropped from 96 to 91; the delta is exactly the 5 parameterized HS4 sweep cases, as expected). - Manual negative test of the new validation: output_qk=None -> 3 outputs (disabled, OK) output_qk=2 -> 4 outputs (kQKSoftCap, OK) output_qk=-1 -> AssertionError (OK) output_qk=4 -> AssertionError (OK) output_qk=5 -> AssertionError (OK) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

justinchuby · 2026-05-06T00:39:23Z

Superseded by #28368 and #28369 (split into individual PRs).

justinchuby requested review from Copilot, tianleiwu and titaiwangms and removed request for Copilot and titaiwangms May 5, 2026 18:10

titaiwangms requested review from Copilot and titaiwangms May 5, 2026 18:57

Copilot AI reviewed May 5, 2026

View reviewed changes

titaiwangms reviewed May 5, 2026

View reviewed changes

titaiwangms approved these changes May 5, 2026

View reviewed changes

titaiwangms reviewed May 5, 2026

View reviewed changes

titaiwangms mentioned this pull request May 5, 2026

Add CUDA Attention spec-coverage tests (GQA asymmetric head-size, output_qk, fp32 softcap+mask ordering) #28371

Closed

justinchuby marked this pull request as draft May 5, 2026 21:44

titaiwangms reviewed May 5, 2026

View reviewed changes

justinchuby closed this May 6, 2026

titaiwangms mentioned this pull request May 6, 2026

Fix CPU Attention softcap/attn_mask ordering (onnx#7867, #7913) + consolidate CUDA spec coverage tests from #28371 #28379

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CUDA EP: opset 24 kernel registrations + CUTLASS alignment + MEA dispatch#28365

Fix CUDA EP: opset 24 kernel registrations + CUTLASS alignment + MEA dispatch#28365
justinchuby wants to merge 1 commit intomainfrom
fix-attention-head-size-mismatch

justinchuby commented May 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

titaiwangms left a comment

Uh oh!

titaiwangms May 5, 2026

Uh oh!

titaiwangms commented May 5, 2026

Uh oh!

titaiwangms left a comment

Uh oh!

titaiwangms left a comment

Uh oh!

justinchuby commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

justinchuby commented May 5, 2026

Summary

1. Missing opset 24 CUDA kernel registrations for Reshape and Cast

2. CUTLASS FMHA BiasLoader alignment

3. MEA dispatch guard for GQA with mismatched head sizes

4. Remove ReduceSum empty tensor assert

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

titaiwangms left a comment

Choose a reason for hiding this comment

1. BiasLoader → kAlignmentA is the correct root-cause fix ✅

2. Suggested isolated regression test for the BiasLoader change

3. Heads-up on the attention.cc MEA dispatch guard hunk

4. Cross-reference with our parallel work

Minor

Uh oh!

titaiwangms May 5, 2026

Choose a reason for hiding this comment

Uh oh!

titaiwangms commented May 5, 2026

Uh oh!

titaiwangms left a comment

Choose a reason for hiding this comment

Part 1 — Endorsement update with credit to the original design

Part 2 — Suggested regression test

Uh oh!

titaiwangms left a comment

Choose a reason for hiding this comment

Uh oh!

justinchuby commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. BiasLoader → `kAlignmentA` is the correct root-cause fix ✅

3. Heads-up on the `attention.cc` MEA dispatch guard hunk