Skip to content

Add flash attention v2 and INT4 CUDA for LLaMA E2E benchmarking#20149

Merged
kunal-vaishnavi merged 5 commits intomicrosoft:mainfrom
kunal-vaishnavi:kvaishnavi/llama-add-flash-attn
Mar 30, 2024
Merged

Add flash attention v2 and INT4 CUDA for LLaMA E2E benchmarking#20149
kunal-vaishnavi merged 5 commits intomicrosoft:mainfrom
kunal-vaishnavi:kvaishnavi/llama-add-flash-attn

Conversation

@kunal-vaishnavi
Copy link
Contributor

Description

This PR adds flash attention v2 and support for INT4 CUDA benchmarking in PyTorch.

Motivation and Context

The flash attention v2 algorithm helps improve model performance in PyTorch. Support for INT4 CUDA in PyTorch is done through the bitsandbytes package.

@kunal-vaishnavi kunal-vaishnavi merged commit a0ebd5f into microsoft:main Mar 30, 2024
YUNQIUGUO pushed a commit that referenced this pull request Apr 2, 2024
### Description
This PR adds flash attention v2 and support for INT4 CUDA benchmarking
in PyTorch.

### Motivation and Context
The [flash attention v2](https://github.com/Dao-AILab/flash-attention)
algorithm helps improve model performance in PyTorch. Support for INT4
CUDA in PyTorch is done through the
[`bitsandbytes`](https://github.com/TimDettmers/bitsandbytes) package.
TedThemistokleous pushed a commit to TedThemistokleous/onnxruntime that referenced this pull request May 7, 2024
…osoft#20149)

### Description
This PR adds flash attention v2 and support for INT4 CUDA benchmarking
in PyTorch.

### Motivation and Context
The [flash attention v2](https://github.com/Dao-AILab/flash-attention)
algorithm helps improve model performance in PyTorch. Support for INT4
CUDA in PyTorch is done through the
[`bitsandbytes`](https://github.com/TimDettmers/bitsandbytes) package.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants