Skip to content

Update flash_attention to version 2.0.1#4323

Closed
kkk55596 wants to merge 1 commit intohpcaitech:mainfrom
kkk55596:update_flashattn
Closed

Update flash_attention to version 2.0.1#4323
kkk55596 wants to merge 1 commit intohpcaitech:mainfrom
kkk55596:update_flashattn

Conversation

@kkk55596
Copy link
Copy Markdown

@kkk55596 kkk55596 commented Jul 25, 2023

📌 Checklist before creating the PR

  • I have created an issue for this PR for traceability
  • The title follows the standard format: [doc/gemini/tensor/...]: A concise description
  • I have added relevant tags if possible for us to better distinguish different PRs

🚨 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge

e.g. fixed #1234, closed #1234, resolved #1234

fixed #4322

📝 What does this PR do?

Summarize your work here.
if you have any plots/diagrams/screenshots/tables, please attach them here.

According to Flash Attention, I modify flash_attention.py and requirements-test.txt for supporting flashattn v2.0.1

💥 Checklist before requesting a review

  • I have linked my PR to an issue (instruction)
  • My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
  • I have performed a self-review of my code
  • I have added thorough tests.
  • I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

  • 🌝 Yes, I do.
  • 🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

@clyang
Copy link
Copy Markdown

clyang commented Jul 25, 2023

wow. According to Flash Attention's webiste

FlashAttention-2 is about 2x faster than its previous version, reaching up to 230 TFLOPs/s on A100 GPUs (FP16/BF16).

It would be great if this PR can pass the review and be accepted.

@kurisusnowdeng
Copy link
Copy Markdown
Contributor

@kkk55596 Thank you so much for contribution. We will actually provide the general ColoAttention interface and deprecate the other ones very soon. Thus, could you please try to replace the xformers part that we currently use with flash attention 2?

Here are few tips for you:

  1. We could use flash_attn_func for attention with no paddings
  2. Attention with paddings will need get_seq_info_from_mask, unpad, repad in order to work with flash_attn_varlen_func
  3. Flash attention 2 only supports fp16/bf16 on Ampere or better GPUs. For other precisions or hardwares, we still need xformers to accelerate attention;
  4. (Optional) flash attention's CUDA version does not support attention bias, while its triton version does. We would really appreciate it if you are able to help ColoAttention to support attention bias.

flash_attn_unpadded_func,
flash_attn_unpadded_kvpacked_func,
flash_attn_unpadded_qkvpacked_func,
flash_attn_varlen_func,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As commented in conversations

@github-actions
Copy link
Copy Markdown
Contributor

The code coverage for the changed files is 11%.

Click me to view the complete report
Name                                               Stmts   Miss  Cover
----------------------------------------------------------------------
colossalai/kernel/cuda_native/flash_attention.py     292    261    11%
----------------------------------------------------------------------
TOTAL                                                292    261    11%

@kurisusnowdeng
Copy link
Copy Markdown
Contributor

kurisusnowdeng commented Aug 4, 2023

Closed since the same feature has been completed by #4347

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[flashattn] Support Flash Attention to v2.0.1

3 participants