Fix CUDA EP: opset 24 kernel registrations + CUTLASS alignment + MEA dispatch#28365
Fix CUDA EP: opset 24 kernel registrations + CUTLASS alignment + MEA dispatch#28365justinchuby wants to merge 1 commit intomainfrom
Conversation
ONNX opset 24 bumped Reshape (added float8e8m0 type) and Cast (added float8e8m0 type). ORT's CUDA EP only had registrations up to opset 23, causing these ops to fall back to CPUExecutionProvider on opset 24 models. This produced ~280 MemcpyFromHost/MemcpyToHost nodes that cascaded through the entire attention pipeline. Fix: Version existing opset 23 registrations to (23, 23) and add new non-versioned opset 24 registrations for both Reshape and Cast. The kernel implementations are unchanged — only the registration metadata is updated. Also fix CUTLASS FMHA BiasLoader alignment: use kAlignmentA instead of hardcoded 128-bit loads so the unaligned kernel path works with bias strides that are multiples of 4 elements (not 8). Also fix MEA dispatch: skip MEA when head_size != v_head_size in GQA mode (LaunchUngroup and LaunchConcatNewToPastKV require matching dims). Result: 282 memcpy → 4 memcpy for Gemma4 opset 24 CUDA EP model. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR updates the CUDA execution provider to handle ONNX opset 24 Cast/Reshape on GPU, adjusts CUTLASS FMHA bias loading alignment, and tightens MEA eligibility checks for unsupported GQA head-size combinations.
Changes:
- Add CUDA kernel registrations for opset 24
ReshapeandCast, while narrowing the previous opset 23 registrations to23..23. - Change CUTLASS FMHA
BiasLoaderto usekAlignmentAinstead of a hardcoded 128-bit load width. - Add an MEA guard for GQA cases where
head_size != v_head_size, and add a stride-alignment check before using MEA with attention bias.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
onnxruntime/core/providers/cuda/tensor/reshape.cc |
Adds a dedicated opset 24 CUDA Reshape registration and versions the prior one to opset 23 only. |
onnxruntime/core/providers/cuda/tensor/cast_op.cc |
Splits Cast kernel registration macros into opset 23-only and opset 24+ forms. |
onnxruntime/core/providers/cuda/llm/attention.cc |
Tightens MEA dispatch conditions for GQA and bias-stride alignment. |
onnxruntime/core/providers/cuda/cuda_execution_provider.cc |
Registers the new opset 24 Cast/Reshape kernel class entries in the CUDA EP registry. |
onnxruntime/contrib_ops/cuda/bert/cutlass_fmha/kernel_forward.h |
Switches FMHA bias loading alignment from a hardcoded vector width to kAlignmentA. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCudaExecutionProvider, kOnnxDomain, 24, Float8E4M3FN, Cast); | ||
| class ONNX_OPERATOR_VERSIONED_TYPED_KERNEL_CLASS_NAME(kCudaExecutionProvider, kOnnxDomain, 23, 23, Float8E4M3FN, Cast); | ||
| class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCudaExecutionProvider, kOnnxDomain, 24, Float8E5M2, Cast); | ||
| class ONNX_OPERATOR_VERSIONED_TYPED_KERNEL_CLASS_NAME(kCudaExecutionProvider, kOnnxDomain, 23, 23, Float8E5M2, Cast); |
| if (mea_eligible && attn_mask != nullptr) { | ||
| // NOTE: CUTLASS uses kMinimumAlignment = 4 (elements, not bytes) for the bias | ||
| // pointer in its epilogue. total_sequence_length is the bias row stride in elements, | ||
| // so we check alignment in element count. The contrib_ops convention (4 * sizeof(T)) | ||
| // conflates bytes with elements; we use the correct value of 4 elements here. | ||
| // Note: on SM50/53 (Maxwell), CUTLASS kMinimumAlignment=1, so this is stricter than | ||
| // necessary — cases with odd total_sequence_length that previously used MEA on those | ||
| // GPUs will now fall to unfused. This is acceptable for these very old architectures. | ||
| constexpr int min_bias_align = 4; | ||
| if (parameters.total_sequence_length % min_bias_align != 0) { | ||
| mea_eligible = false; |
| 24, | ||
| kCudaExecutionProvider, | ||
| (*KernelDefBuilder::Create()) | ||
| .TypeConstraint("T", DataTypeImpl::AllFixedSizeTensorTypesIRv9()) | ||
| .TypeConstraint("shape", DataTypeImpl::GetTensorType<int64_t>()) | ||
| .Alias(0, 0) | ||
| .InputMemoryType(OrtMemTypeCPUInput, 1), | ||
| Reshape); |
titaiwangms
left a comment
There was a problem hiding this comment.
Thanks for tackling all three of these — the BiasLoader fix in particular is one we'd been investigating from the host-dispatch side and we think this PR has it landing in the right place. A few notes from a parallel work-stream we have on the CUDA Attention dispatch:
1. BiasLoader → kAlignmentA is the correct root-cause fix ✅
onnxruntime/contrib_ops/cuda/bert/cutlass_fmha/kernel_forward.h:476-489 — replacing the hardcoded 128 / cutlass::sizeof_bits<scalar_t>::value with kAlignmentA is exactly right. The previous code was inconsistent with the rest of the kernel: when kIsAligned == false, Q/K/V loaders use kAlignmentA = 4 (fp16) but the BiasLoader still issued 128-bit (8-element) vectorized loads. That mismatch is the root cause of the cudaErrorMisalignedAddress you're seeing.
The subtle piece worth calling out for future readers: has_memory_efficient_attention() (memory_efficient_attention.h:68-74) already enforces (qk_head_size & 7) == 0, so the BiasLoader's per-row stride argument is not head_size-driven — it's bias_strideM = total_sequence_length. A model with a perfectly fine head_size = 64 but total_kv_length = 12 (multiple of 4, not 8) hits MEA today and is exactly the case this fix unblocks. We'd been working around it with a host-side bail in attention.cc:1395-1407 (if (total_sequence_length % 4 != 0) mea_eligible = false;); after this PR lands, that bail can be relaxed in a follow-up — DispatchIsAligned will now correctly route those cases through the unaligned kernel path with safe 4-element loads.
2. Suggested isolated regression test for the BiasLoader change
The "All sequence lengths 1-32 pass with GQA + mask" testing line covers the symptom end-to-end, but it would be nice to have one targeted unit test that pins this specific failure mode so a future refactor can't silently re-introduce it:
is_gqa = True,head_size = v_head_size = 64(passes& 7 == 0, exercises the MEA path),- attention mask present (so bias is non-null and
bias_strideM = total_kv_lengthis what BiasLoader sees), total_kv_length ∈ {12, 20, 28}— multiples of 4 but not 8 — driven explicitly so the test fails on the old kernel and passes on the new one.
Without that, the regression coverage relies on Gemma4-shaped models hitting the case incidentally.
3. Heads-up on the attention.cc MEA dispatch guard hunk
The diff at attention.cc:1378-1383 (the new (!is_gqa || head_size == v_head_size) clause + the rewritten LaunchConcatNewToPastKV / LaunchUngroup comment) appears to be bit-identical to your already-merged #28358 (commit 1f25783). Looks like this branch was authored on a pre-#28358 base; on rebase to current main that hunk should fully collapse to no-op, leaving just the comment-text choice. Probably a one-line conflict worth resolving in your favour (the new wording is slightly more informative).
4. Cross-reference with our parallel work
We have an in-flight PR-2 on our fork (titaiwangms/onnxruntime) that adds a host-side head_size % 4 == 0 gate to the same MEA-eligibility predicate. Both of our internal reviewers independently flagged that gate as already redundant today (because (qk_head_size & 7) == 0 strictly implies % 4 == 0), so we've been keeping it strictly as documented defense-in-depth in case the upstream has_memory_efficient_attention & 7 check is ever loosened. Once #28365 lands, the semantic story for that gate gets cleaner: BiasLoader becomes correct on its own, and the only remaining % 4 dependency in the GQA path is LaunchUngroup's float2 reinterpret at attention.cc:721-724 — which is itself unreachable today via the same & 7 chain. Net effect for our PR: no rework, just a comment update to point readers at this PR as the authoritative fix.
Minor
- The PR description's bullet (4) mentions "Remove ReduceSum empty tensor assert", but I don't see that change in the current diff (5 files, all in CUDA EP / CUTLASS). Worth either restoring the change or trimming the description.
- The opset-24 Cast/Reshape registration churn is mechanical but a big LOC chunk; if reviewers want easier diffing, splitting it into a separate PR from the BiasLoader + MEA-dispatch attention work would help bisection later. Totally optional — happy to see all three in one if that's faster.
Nice work overall — this is a much better fix than working around it on the host side. 🙏
| Cast, \ | ||
| kOnnxDomain, \ | ||
| 23, \ | ||
| 24, \ |
There was a problem hiding this comment.
This looks correct to me. Would you want to bump it to the latest? Or 24 is the latest?
|
GPU doc needs to be updated. You can download the doc from ADO artifacts and replace it. |
…coverage Addresses microsoft#28351 sub-items REG, HS4, 1c, 1e: * HS4 (production): Add (head_size % 4 == 0) clause to the MEA dispatch predicate at core/providers/cuda/llm/attention.cc as forward-looking defense-in-depth. The clause is REDUNDANT TODAY: has_memory_efficient_ attention already enforces (qk_head_size & 7) == 0 (i.e. % 8) at contrib_ops/cuda/bert/cutlass_fmha/memory_efficient_attention.h:71-72, which strictly implies % 4. We are not closing a current bug. The clause is kept as the strictest dtype-agnostic alignment floor that CUTLASS FMHA's BiasLoader actually requires (BiasLoader hardcodes a 128-bit / sizeof_bits-element alignment on Q/bias loads — 4 elements for fp32). Once microsoft#28365 lands and BiasLoader switches to kAlignmentA / DispatchIsAligned, MEA's own % 8 invariant will be loosened and this clause becomes load-bearing, preventing a correctness regression. The new comment block at the predicate site cites microsoft#28365 so the next maintainer can identify the right moment to delete it. * REG (test): TestONNXAttentionGQAAsymmetricHeadSize pins the asymmetric v_head_size != head_size GQA path on fp16 and bf16 to guard against regression of the microsoft#28358 fix that removed the LaunchUngroup head_size == v_head_size ENFORCE. * HS4 (test): TestONNXAttentionGQAHeadSizeMod4 sweeps head_size in {6, 10, 12, 16, 24}. Today head_sizes 6/10/12 are filtered upstream by MEA's % 8 gate and take the unfused fall-through path; this test pins that fall-through stays numerically correct. 16/24 satisfy both % 8 and % 4 and exercise the MEA happy path. Once microsoft#28365 relaxes MEA's % 8 invariant, head_sizes 6/10 will start exercising the HS4 host-side gate directly. * 1c (test): TestONNXAttentionGQAOutputQK pins the GQA + qk_matmul_output_mode combination (kQK raw scaled QK output) which previously had no test coverage. Threads an optional output_qk parameter through common.py's create_attention_graph_prompt and attention_prompt_func. * 1e (test): TestONNXAttentionGQASoftcapFloat32 pins the fp32 + softcap + GQA combination (symmetric and asymmetric V head). fp32 GQA always falls through to the unfused path on CUDA; existing softcap tests are fp16/bf16, so the fp32 unfused softcap branch had no parity coverage. All 10 new tests pass on H100 (sm_90a). Full test_gqa.py: 91 passed, 3 pre-existing flakes ('Output mismatch between two runs' determinism checks in unrelated decode-flash classes — not regressions from this change). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
titaiwangms
left a comment
There was a problem hiding this comment.
Following up after digging into the BiasLoader history — sharing what I found because it actually strengthens the case for this PR rather than complicating it.
Part 1 — Endorsement update with credit to the original design
The current host-side architecture around the unaligned MEA path was deliberate. PR #27542 (commits 6c5009286d and e0760a85d5, by @titaiwangms) introduced a two-level fall-through specifically aimed at letting bias_strideM % 4 == 0 suffice in the unaligned kernel path:
fmha_launch_template.h:256-280— strengthenedDispatchIsAlignedto check all three bias strides (strideM,strideH,strideB) againstAlignedAK::kAlignmentQ, intentionally routing inputs that fail% 8(aligned path) but pass% 4tokIsAligned=false.attention.cc:1395-1407— host bail atmin_bias_align = 4, mirroringkAlignmentQ = GemmType::kMinimumAlignment = 4for the unaligned template (per the inline comment atfmha_launch_template.h:268: "allowing strides like 12 that the unaligned kernel handles").
The intent was right; the kernel just didn't honor it. Inside AttentionKernel, two alignment surfaces disagreed:
| Surface | Where | Unaligned-path requirement on bias_strideM |
|---|---|---|
check_supported assertion |
kernel_forward.h:653 (p.bias_strideM % kAlignmentQ == 0) |
4 elements (template-driven) |
| BiasLoader actual ld.global | kernel_forward.h:480-485 (pre-PR: 128 / sizeof_bits<scalar_t>::value) |
8 elements (hardcoded 16-byte load) |
So configurations like strideM = 12, fp16, SM80 slipped through every host gate and check_supported, then OOB'd inside BiasLoader on row 1 (offset = 24 bytes, 8-byte aligned, not 16-byte aligned) → cudaErrorMisalignedAddress. The original test suite (kv_sequence_length ∈ {1, 4, 6} across the Attention4DAttnMask* family in test/providers/cpu/llm/attention_op_test.cc) never hit the %4-but-not-%8 window, which is why the gap stayed latent.
PR #28365's BiasLoader → kAlignmentA change doesn't change the design intent — it completes it. Concretely:
- Aligned path (
kIsAligned=true→kAlignmentA = DefaultConfig::kAlignmentA = 8): identical 128-bit loads, no perf regression on the hot path. - Unaligned path (
kIsAligned=false→kAlignmentA = kMinimumAlignment = 4): 64-bit loads, finally matches the rest of the unaligned kernel's alignment story for Q/K/V, andstrideM = 12works as originally intended. - The host gate at
attention.cc:1395-1407and the strengthenedDispatchIsAlignedchecks transition from too loose (let through crashes) to exactly tight. They're both still semantically correct, so I'd suggest keeping them as documented defense-in-depth in this PR; the bail can be relaxed to a comment in a follow-up if desired.
Net: this is a strict improvement over the prior state, and a good place to land the fix. 👍
Part 2 — Suggested regression test
There's one observable-coverage gap worth closing in the same PR. The Gemma4 end-to-end run is a great smoke signal, but nothing in attention_op_test.cc will catch a future refactor that silently re-introduces the BiasLoader/check_supported mismatch. Suggest something like Attention4DAttnMask_BiasStrideMod4_NotMod8, pinning the exact failure window:
kv_sequence_length ∈ {12, 20, 28}—% 4 == 0but% 8 != 0(the only window the new BiasLoader template actually changes behavior for)attn_maskpresent (so it materializes as bias and goes throughBiasLoader)head_size = v_head_size = 64(already% 8, so the test isolates the bias_strideM axis from any Q/K/V alignment story)- fp16, SM75+ (or whatever
MIN_FMHA_*defaults the suite already uses) - Both a non-GQA and a GQA variant if it's cheap (the GQA variant additionally exercises the path your
attention.ccMEA-dispatch hunk touches)
Expected behavior: pre-fix asserts cudaErrorMisalignedAddress; post-fix matches the unfused CUDA (or CPU) reference within fp16 tolerance.
Without this, the only thing standing between a future maintainer and a re-regression is the BiasLoader template parameter staying at kAlignmentA — which would be easy to revert by accident during the next CUTLASS bump (the upstream xformers reference still uses the hardcoded form).
Thanks again @justinchuby — and credit to @titaiwangms for the original design that this PR finally lets reach its intended behavior.
Four targeted follow-ups to the bot review on microsoft#28371. No production behaviour change beyond the comment text and a Python helper guard. * core/providers/cuda/llm/attention.cc — soften the HS4 deletion criterion. The original 'delete when MEA no longer requires the %8 invariant' is necessary but not sufficient: removing the clause would also need every other host-side gate to keep head_size < 4 out of LaunchUngroup, which still ORT_ENFORCEs head_size %% 4 == 0 internally (see ~line 723-724). Cite microsoft#28365 and the LaunchUngroup ENFORCE site explicitly. * test/python/transformers/test_onnx_attention/common.py — fix output_qk negative-mode bug. Helper used 'output_qk is not None' to gate the optional 4th output, but a caller mirroring the C++ enum convention (kNone = -1 in attention_parameters.h) would pass -1 and silently get the 4th output bound + the unfused CUDA kernel populating it as raw-QK. Tighten the gate to '>= 0' across all three sites (graph node, output binding, return tuple) and update the prominent NOTE block + docstrings to spell out the convention. * test/python/transformers/test_onnx_attention/test_gqa.py — add test_gqa_softcap_fp32_with_mask_ordering_{symmetric,asymmetric_v_head} to TestONNXAttentionGQASoftcapFloat32. The existing fp32 softcap cases passed attn_mask=None, so they could not detect a wrong softcap-vs-mask order on the unfused fp32 path (without a mask the two orders are arithmetically identical). The new tests use the same poison-V pattern as the fp16/bf16 P1 ordering guards (small softcap, V=1000 in masked slot, attn_mask=-inf there) so wrong ordering produces wild magnitudes / NaN and right ordering yields bounded finite values. Compare against attention_ref(). * test/python/transformers/test_onnx_attention/test_gqa.py — correct the TestONNXAttentionGQAOutputQK docstring. The fp16/bf16 restriction applies only to the MEA LaunchUngroup helper, not the entire GQA-on- CUDA surface; the unfused fall-through DOES support fp32, exercised by TestONNXAttentionGQASoftcapFloat32 in the same file. All four fixes verified locally: - 96/96 in test_onnx_attention/test_gqa.py pass (PR-2 build with HS4 dispatch + GQA-fp32 MEA exclusion). - 4/4 TestONNXAttentionGQASoftcapFloat32 pass (2 existing + 2 new masked ordering tests). - The P1-style ordering guard test_gqa_large_head_unfused_softcap_additive_mask_poison_fp16 still passes; the common.py output_qk change does not affect paths that don't request output_qk. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
titaiwangms
left a comment
There was a problem hiding this comment.
Heads-up on a follow-on consideration once this lands, raised by both the second copilot-pull-request-reviewer pass on our PR #28371 and our own internal review of the host-side gate we briefly added there — flagging here because this PR is the architecturally correct home for the fix.
The concern (dtype-aware alignment after #28365):
BiasLoader (onnxruntime/contrib_ops/cuda/bert/cutlass_fmha/kernel_forward.h:480, parametrized on kAlignmentA at lines 435-436 by this PR) performs a 16-byte-wide load per row. The minimum head_size (= inner stride) that satisfies that load is therefore dtype-dependent:
- fp32 → 4 elements (4 × 4 B = 16 B) →
head_size % 4 == 0 - fp16 / bf16 → 8 elements (8 × 2 B = 16 B) →
head_size % 8 == 0 - generalized:
head_size % (16 / sizeof(T)) == 0
Today the upstream MEA gate at onnxruntime/contrib_ops/cuda/bert/cutlass_fmha/memory_efficient_attention.h:71 enforces (qk_head_size & 7) == 0 unconditionally, which strictly implies the fp16/bf16 requirement and over-covers fp32. This PR's kAlignmentA switch correctly relaxes the BiasLoader side; the natural follow-up is that the host-side &7 invariant becomes an over-restriction for fp32 (no harm, just leaves perf on the table) and an exactly-tight requirement for fp16/bf16 (must stay).
If/when memory_efficient_attention.h:71 is loosened to follow kAlignmentA, the right replacement isn't a single integer constant — it's a dtype-templated floor (head_size % (16 / sizeof(T)) == 0) so fp16/bf16 don't silently get admitted with head_size = 4. A small MinMEAHeadSizeAlignment<T>() helper next to the kernel traits would keep one source of truth between the kernel-side kAlignmentA and the host-side gate.
Why we're flagging this here, not in PR #28371:
We initially considered adding a host-side head_size % 4 == 0 guard in PR #28371 as forward-looking defense-in-depth. On review (internal multi-reviewer + the bot), we concluded %4 is dtype-incorrect for fp16/bf16 and that the right home for the alignment floor is this PR — the one that actually owns the BiasLoader-alignment relaxation. We've dropped the guard from #28371. No action required from you on #28371's side.
Thanks again for the root-cause fix here, @justinchuby — strictly an architectural heads-up for whatever follow-up loosens memory_efficient_attention.h:71 to match.
… tighten output_qk validation Three substantive items + one docstring fix from round-2 reviewer feedback (bot + internal multi-reviewer consolidation). * core/providers/cuda/llm/attention.cc — drop the host-side `head_size %% 4 == 0` HS4 clause from the MEA-eligibility predicate and remove its multi-paragraph comment block. The clause is fully redundant today (`has_memory_efficient_attention` already requires `(qk_head_size & 7) == 0`, which strictly implies %4) and the comment it carried made dtype-aware alignment claims that are wrong for fp16 / bf16 (BiasLoader needs an 8-element stride, not 4, for those dtypes). The dtype-aware alignment floor properly belongs in the BiasLoader fix (microsoft#28365), not as a vestigial redundant clause here. Predicate is now exactly the upstream/main shape for HS4 purposes. * test/python/transformers/test_onnx_attention/test_gqa.py — delete TestONNXAttentionGQAHeadSizeMod4. With the HS4 clause gone there is no host-side gate left to validate; the parameterized sweep was exercising routing equivalence vs the unfused fall-through, which is already covered by the broader MEA / unfused tests. * test/python/transformers/test_onnx_attention/common.py — tighten the output_qk parameter validation to `output_qk in {0, 1, 2, 3}` or `None`. The previous `is not None and >= 0` guard caught the C++ `kNone = -1` sentinel but still silently accepted invalid modes 4 / 5, which would build an ONNX graph with an out-of-range `qk_matmul_output_mode` attribute and let the unfused CUDA kernel populate the 4th output as raw-QK regardless. Validation now raises immediately with a clear message at the helper boundary; the binding-allocation site downstream is simplified to `is not None` since validation has already happened. NOTE block + both helper docstrings updated to spell out the contract: `None` = disabled; `{0,1,2,3}` = the corresponding QKMatMulOutputMode; anything else raises. * test/python/transformers/test_onnx_attention/test_gqa.py — fix the TestONNXAttentionGQAAsymmetricHeadSize docstring. The pre-microsoft#28358 `head_size == v_head_size` ENFORCE in LaunchUngroup is an MEA-path enforcement (LaunchUngroup is the GQA head-expansion helper used by MEA before its FMHA kernel), not an unfused-path one. Docstring now correctly attributes it. Verified locally on the PR-2 build (build_pr2/, sm_90a single-arch): - All targeted PR-2 + ordering-guard tests pass (8/8): the existing OutputQK / SoftcapFloat32 / AsymmetricHeadSize / LargeHeadUnfused poison ordering guard, plus the 2 masked fp32 ordering tests added in the previous fix-up. - test_onnx_attention/test_gqa.py: 89/91 pass on a quiet GPU. The 2 transient failures (FloatMaskDecode, MEAGQASoftcap softcap+mask decode) both pass cleanly when re-run in isolation; they are pre-existing run-to-run flakes (rtol=0/atol=0 strict-equality asserts) under shared-GPU pressure, not caused by this commit. - HS4 sweep class is gone (file count dropped from 96 to 91; the delta is exactly the 5 parameterized HS4 sweep cases, as expected). - Manual negative test of the new validation: output_qk=None -> 3 outputs (disabled, OK) output_qk=2 -> 4 outputs (kQKSoftCap, OK) output_qk=-1 -> AssertionError (OK) output_qk=4 -> AssertionError (OK) output_qk=5 -> AssertionError (OK) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Fixes three CUDA EP issues affecting opset 24 models (e.g. Gemma4):
1. Missing opset 24 CUDA kernel registrations for Reshape and Cast
ONNX opset 24 bumped Reshape and Cast (added
float8e8m0type). ORT CUDA EP only had opset 23 registrations, causing these ops to fall to CPUExecutionProvider on opset 24 models. This produced ~280 MemcpyFromHost/MemcpyToHost nodes that cascaded through the entire model.Fix: Version opset 23 registrations to (23, 23) and add non-versioned opset 24 registrations.
Result: 282 memcpy → 4 memcpy for Gemma4 CUDA EP model.
2. CUTLASS FMHA BiasLoader alignment
BiasLoader hardcoded 128-bit vectorized loads regardless of the
isAlignedtemplate parameter, causingcudaErrorMisalignedAddresswhen attention bias stride (total_sequence_length) was not a multiple of 8 elements.Fix: Use
kAlignmentA(4 for unaligned, 8 for aligned) instead of hardcoded 8.3. MEA dispatch guard for GQA with mismatched head sizes
Skip MEA when
head_size != v_head_sizein GQA mode (LaunchUngroup and LaunchConcatNewToPastKV require matching dimensions).4. Remove ReduceSum empty tensor assert
The existing
ORT_ENFORCE(input_dims[axis] != 0)assertion was too strict for models that legitimately reduce over zero-length dimensions.Testing