Skip to content

Fix FlashMLA combine grid dimensions for long query batches#182

Merged
interestingLSY merged 1 commit intodeepseek-ai:mainfrom
PerkzZheng:fix-combine-grid-dims
Apr 29, 2026
Merged

Fix FlashMLA combine grid dimensions for long query batches#182
interestingLSY merged 1 commit intodeepseek-ai:mainfrom
PerkzZheng:fix-combine-grid-dims

Conversation

@PerkzZheng
Copy link
Copy Markdown
Contributor

Summary

  • Swap the FlashMLA combine kernel launch grid from [batch_size, s_q, h_q / BLOCK_SIZE_M] to [s_q, batch_size, h_q / BLOCK_SIZE_M].
  • Update the kernel's blockIdx interpretation to preserve the same logical (batch, query, head-block) mapping.

Motivation

This avoids putting large query-token counts on gridDim.y, which can exceed CUDA's grid Y dimension limit for long prefill/decode-style query batches and trigger an invalid launch configuration.

Related vLLM issue: vllm-project/vllm#27043

Tests

  • git diff --check

@interestingLSY
Copy link
Copy Markdown
Collaborator

This change, which puts params.b to gridDim.y, may cause really large batches (and I wouldn't rule out someone actually running them, since people have all kinds of weird use cases) being failed to run, I'd prefer to use b * s_q as gridDim.x instead of putting b into gridDim.y as it is now. Would it be convenient for you to make that change?

@PerkzZheng
Copy link
Copy Markdown
Contributor Author

This change, which puts params.b to gridDim.y, may cause really large batches (and I wouldn't rule out someone actually running them, since people have all kinds of weird use cases) being failed to run, I'd prefer to use b * s_q as gridDim.x instead of putting b into gridDim.y as it is now. Would it be convenient for you to make that change?

sure, that is a simple change. we can also use a static persistent kernel (loop inside), but it doesn't make too much difference currently. I will use what you have suggested for now.

@PerkzZheng PerkzZheng force-pushed the fix-combine-grid-dims branch from 0b5ee59 to 695e612 Compare April 29, 2026 08:58
@PerkzZheng
Copy link
Copy Markdown
Contributor Author

This change, which puts params.b to gridDim.y, may cause really large batches (and I wouldn't rule out someone actually running them, since people have all kinds of weird use cases) being failed to run, I'd prefer to use b * s_q as gridDim.x instead of putting b into gridDim.y as it is now. Would it be convenient for you to make that change?

sure, that is a simple change. we can also use a static persistent kernel (loop inside), but it doesn't make too much difference currently. I will use what you have suggested for now.

@interestingLSY it is done. thanks.

@interestingLSY interestingLSY merged commit 9241ae3 into deepseek-ai:main Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants