Fix CUTLASS FMHA BiasLoader alignment for unaligned kernel path by justinchuby · Pull Request #28369 · microsoft/onnxruntime

justinchuby · 2026-05-05T18:48:19Z

BiasLoader hardcoded 128-bit vectorized loads (ElementsPerAccess = 128/sizeof_bits = 8 for fp16) regardless of the isAligned template flag. When bias stride was not a multiple of 8, the unaligned kernel was selected but BiasLoader still used 128-bit loads → cudaErrorMisalignedAddress.

Fix: Use kAlignmentA (4 for unaligned, 8 for aligned) instead of hardcoded 8.

Tested with Gemma4 Attention + mask at all seq lengths 1–32.

BiasLoader hardcoded 128-bit (8 fp16 element) vectorized loads via ElementsPerAccess = 128 / sizeof_bits<scalar_t> regardless of the isAligned template parameter. When attention bias stride (total_sequence_length) was not a multiple of 8, the unaligned kernel was selected but BiasLoader still used 128-bit loads on the bias, causing cudaErrorMisalignedAddress. Fix: Use kAlignmentA (kMinimumAlignment=4 for unaligned path, kAlignmentA=8 for aligned path) as BiasLoader's ElementsPerAccess. This allows the unaligned kernel to use 64-bit loads for the bias. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>

Test the BiasLoader alignment fix with total_kv_length values that are not divisible by 8 (the fp16 vectorized load width). Before the fix, these would cause wrong results or crashes in the MEA kernel path. 16 test cases across 3 categories: - MHA decode: 8 lengths (5, 7, 9, 13, 27 unaligned + 8, 16, 32 aligned) - MHA prompt: 5 lengths (5, 7, 13 unaligned + 8, 16 aligned) - GQA decode: 3 lengths (5, 9 unaligned + 16 aligned) All use float16 with 4D additive attention mask on CUDA EP, comparing against PyTorch reference. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>

This was referenced May 5, 2026

Fix CUDA EP: add opset 24 kernel registrations for Reshape/Cast + CUTLASS alignment #28366

Closed

Fix CUDA EP: opset 24 kernel registrations + CUTLASS alignment + MEA dispatch #28365

Closed

justinchuby requested review from Copilot, tianleiwu and titaiwangms and removed request for Copilot and tianleiwu May 6, 2026 16:41

Copilot started reviewing on behalf of justinchuby May 6, 2026 16:42 View session

justinchuby linked an issue May 6, 2026 that may be closed by this pull request

ONNX Attention MEA crash for KV-shared layers with borrowed K/V #28376

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CUTLASS FMHA BiasLoader alignment for unaligned kernel path#28369

Fix CUTLASS FMHA BiasLoader alignment for unaligned kernel path#28369
justinchuby wants to merge 2 commits intomainfrom
fix-cutlass-biasloader-alignment

justinchuby commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

justinchuby commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant