Fix FlashMLA combine grid dimensions for long query batches by PerkzZheng · Pull Request #182 · deepseek-ai/FlashMLA

PerkzZheng · 2026-04-29T02:02:34Z

Summary

Swap the FlashMLA combine kernel launch grid from [batch_size, s_q, h_q / BLOCK_SIZE_M] to [s_q, batch_size, h_q / BLOCK_SIZE_M].
Update the kernel's blockIdx interpretation to preserve the same logical (batch, query, head-block) mapping.

Motivation

This avoids putting large query-token counts on gridDim.y, which can exceed CUDA's grid Y dimension limit for long prefill/decode-style query batches and trigger an invalid launch configuration.

Related vLLM issue: vllm-project/vllm#27043

Tests

git diff --check

interestingLSY · 2026-04-29T08:46:01Z

This change, which puts params.b to gridDim.y, may cause really large batches (and I wouldn't rule out someone actually running them, since people have all kinds of weird use cases) being failed to run, I'd prefer to use b * s_q as gridDim.x instead of putting b into gridDim.y as it is now. Would it be convenient for you to make that change?

PerkzZheng · 2026-04-29T08:51:51Z

This change, which puts params.b to gridDim.y, may cause really large batches (and I wouldn't rule out someone actually running them, since people have all kinds of weird use cases) being failed to run, I'd prefer to use b * s_q as gridDim.x instead of putting b into gridDim.y as it is now. Would it be convenient for you to make that change?

sure, that is a simple change. we can also use a static persistent kernel (loop inside), but it doesn't make too much difference currently. I will use what you have suggested for now.

PerkzZheng · 2026-04-29T09:02:03Z

This change, which puts params.b to gridDim.y, may cause really large batches (and I wouldn't rule out someone actually running them, since people have all kinds of weird use cases) being failed to run, I'd prefer to use b * s_q as gridDim.x instead of putting b into gridDim.y as it is now. Would it be convenient for you to make that change?

sure, that is a simple change. we can also use a static persistent kernel (loop inside), but it doesn't make too much difference currently. I will use what you have suggested for now.

@interestingLSY it is done. thanks.

Swap FlashMLA combine grid dimensions

695e612

PerkzZheng force-pushed the fix-combine-grid-dims branch from 0b5ee59 to 695e612 Compare April 29, 2026 08:58

interestingLSY merged commit 9241ae3 into deepseek-ai:main Apr 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix FlashMLA combine grid dimensions for long query batches#182

Fix FlashMLA combine grid dimensions for long query batches#182
interestingLSY merged 1 commit intodeepseek-ai:mainfrom
PerkzZheng:fix-combine-grid-dims

PerkzZheng commented Apr 29, 2026

Uh oh!

interestingLSY commented Apr 29, 2026

Uh oh!

PerkzZheng commented Apr 29, 2026

Uh oh!

PerkzZheng commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

PerkzZheng commented Apr 29, 2026

Summary

Motivation

Tests

Uh oh!

interestingLSY commented Apr 29, 2026

Uh oh!

PerkzZheng commented Apr 29, 2026

Uh oh!

PerkzZheng commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants