Unify strides of `o` and `do` in attention backward by Bznkxs · Pull Request #1 · Bznkxs/dkernel

Bznkxs · 2025-01-07T20:15:25Z

When performing backward propagation, o and do will sometimes have different strides and fail the stride check:

https://github.com/linxihui/dkernel/blob/main/dkernel/ops/sparse_attn_bwd.py#L675

This is true when training a model with GQA, where the key and value need to be repeated before passing to the kernel. When passing in q, k and v with size [B, S, H, D] and stride [S*H*D, H*D, D, 1] where B == 1, the output o has the same size and stride, but do has stride [H*D, H*D, D, 1]. I have not tested training a model without GQA.

According to this attention implementation of Megatron-LM, stride for a dimension that is 1 has no meaning, so the two strides mean the same thing. We adapt their solution to pass the stride check here.

Unify strides of o and do in attention backward

4bce36c

Bznkxs merged commit ce34e4d into main Jan 7, 2025

Bznkxs deleted the Bznkxs-bwd-stride-unification branch January 7, 2025 20:15

Bznkxs mentioned this pull request Jan 7, 2025

Unify strides of o and do in attention backward linxihui/dkernel#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify strides of `o` and `do` in attention backward#1

Unify strides of `o` and `do` in attention backward#1
Bznkxs merged 1 commit into
mainfrom
Bznkxs-bwd-stride-unification

Bznkxs commented Jan 7, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Bznkxs commented Jan 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bznkxs commented Jan 7, 2025 •

edited

Loading