Fix CUTLASS FMHA BiasLoader alignment for unaligned kernel path#28369
Open
justinchuby wants to merge 2 commits intomainfrom
Open
Fix CUTLASS FMHA BiasLoader alignment for unaligned kernel path#28369justinchuby wants to merge 2 commits intomainfrom
justinchuby wants to merge 2 commits intomainfrom
Conversation
BiasLoader hardcoded 128-bit (8 fp16 element) vectorized loads via ElementsPerAccess = 128 / sizeof_bits<scalar_t> regardless of the isAligned template parameter. When attention bias stride (total_sequence_length) was not a multiple of 8, the unaligned kernel was selected but BiasLoader still used 128-bit loads on the bias, causing cudaErrorMisalignedAddress. Fix: Use kAlignmentA (kMinimumAlignment=4 for unaligned path, kAlignmentA=8 for aligned path) as BiasLoader's ElementsPerAccess. This allows the unaligned kernel to use 64-bit loads for the bias. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>
This was referenced May 5, 2026
Test the BiasLoader alignment fix with total_kv_length values that are not divisible by 8 (the fp16 vectorized load width). Before the fix, these would cause wrong results or crashes in the MEA kernel path. 16 test cases across 3 categories: - MHA decode: 8 lengths (5, 7, 9, 13, 27 unaligned + 8, 16, 32 aligned) - MHA prompt: 5 lengths (5, 7, 13 unaligned + 8, 16 aligned) - GQA decode: 3 lengths (5, 9 unaligned + 16 aligned) All use float16 with 4D additive attention mask on CUDA EP, comparing against PyTorch reference. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Justin Chu <justinchu@microsoft.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
BiasLoader hardcoded 128-bit vectorized loads (
ElementsPerAccess = 128/sizeof_bits = 8for fp16) regardless of theisAlignedtemplate flag. When bias stride was not a multiple of 8, the unaligned kernel was selected but BiasLoader still used 128-bit loads →cudaErrorMisalignedAddress.Fix: Use
kAlignmentA(4 for unaligned, 8 for aligned) instead of hardcoded 8.Tested with Gemma4 Attention + mask at all seq lengths 1–32.