Skip to content

Fix CUTLASS FMHA BiasLoader alignment for unaligned kernel path#28369

Open
justinchuby wants to merge 2 commits intomainfrom
fix-cutlass-biasloader-alignment
Open

Fix CUTLASS FMHA BiasLoader alignment for unaligned kernel path#28369
justinchuby wants to merge 2 commits intomainfrom
fix-cutlass-biasloader-alignment

Conversation

@justinchuby
Copy link
Copy Markdown
Contributor

BiasLoader hardcoded 128-bit vectorized loads (ElementsPerAccess = 128/sizeof_bits = 8 for fp16) regardless of the isAligned template flag. When bias stride was not a multiple of 8, the unaligned kernel was selected but BiasLoader still used 128-bit loads → cudaErrorMisalignedAddress.

Fix: Use kAlignmentA (4 for unaligned, 8 for aligned) instead of hardcoded 8.

Tested with Gemma4 Attention + mask at all seq lengths 1–32.

BiasLoader hardcoded 128-bit (8 fp16 element) vectorized loads via
ElementsPerAccess = 128 / sizeof_bits<scalar_t> regardless of the
isAligned template parameter. When attention bias stride
(total_sequence_length) was not a multiple of 8, the unaligned kernel
was selected but BiasLoader still used 128-bit loads on the bias,
causing cudaErrorMisalignedAddress.

Fix: Use kAlignmentA (kMinimumAlignment=4 for unaligned path,
kAlignmentA=8 for aligned path) as BiasLoader's ElementsPerAccess.
This allows the unaligned kernel to use 64-bit loads for the bias.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Justin Chu <justinchu@microsoft.com>
Test the BiasLoader alignment fix with total_kv_length values that are
not divisible by 8 (the fp16 vectorized load width). Before the fix,
these would cause wrong results or crashes in the MEA kernel path.

16 test cases across 3 categories:
- MHA decode: 8 lengths (5, 7, 9, 13, 27 unaligned + 8, 16, 32 aligned)
- MHA prompt: 5 lengths (5, 7, 13 unaligned + 8, 16 aligned)
- GQA decode: 3 lengths (5, 9 unaligned + 16 aligned)

All use float16 with 4D additive attention mask on CUDA EP, comparing
against PyTorch reference.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Justin Chu <justinchu@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ONNX Attention MEA crash for KV-shared layers with borrowed K/V

1 participant