Skip to content

fix calc space bug#91

Merged
beginlner merged 2 commits intodeepseek-ai:mainfrom
uchihatmtkinu:feat/fix_calc_large_space_bwd
Aug 27, 2025
Merged

fix calc space bug#91
beginlner merged 2 commits intodeepseek-ai:mainfrom
uchihatmtkinu:feat/fix_calc_large_space_bwd

Conversation

@uchihatmtkinu
Copy link
Copy Markdown
Collaborator

The workspace_bytes will be overlimit with large batch_size and max_seq_len since B, H, Q are int32_t and can easily reach the upper bound. Changing the order will fix this as sizeof(xx) is uint64_t. This bug is coming from the 77 example of cutlass codebase. It will also be fixed in cutlass.

The workspace will be much more wasted and could be optimized when the variance of the seqlens is large. But at least it won't cause error with this PR. I will improve it in future.

I also notice the performance of backward kernel is dropped during this scenario(large variance in seqlens), I will try to optimize it as well.

@beginlner
Copy link
Copy Markdown
Collaborator

LGTM, Thanks!

@beginlner beginlner merged commit 261330b into deepseek-ai:main Aug 27, 2025
@SeanLi-OI
Copy link
Copy Markdown
Contributor

Hi, @uchihatmtkinu
If there is any chance to optimize workspace buffer usage in backward? In some extreme conditions when I'm training for long context extension, _flash_attn_varlen_backward is trying to allocate hundreds of GB gpu memory which is obviously cannot be satisfied.

@uchihatmtkinu
Copy link
Copy Markdown
Collaborator Author

Hi @SeanLi-OI, I'm working on the optimization of buffer usage in backward. Hopefully, it will be done before next weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants