Add flash attention v2 and INT4 CUDA for LLaMA E2E benchmarking by kunal-vaishnavi · Pull Request #20149 · microsoft/onnxruntime

kunal-vaishnavi · 2024-03-29T21:42:30Z

Description

This PR adds flash attention v2 and support for INT4 CUDA benchmarking in PyTorch.

Motivation and Context

The flash attention v2 algorithm helps improve model performance in PyTorch. Support for INT4 CUDA in PyTorch is done through the bitsandbytes package.

onnxruntime/python/tools/transformers/models/llama/benchmark_e2e.py

### Description This PR adds flash attention v2 and support for INT4 CUDA benchmarking in PyTorch. ### Motivation and Context The [flash attention v2](https://github.com/Dao-AILab/flash-attention) algorithm helps improve model performance in PyTorch. Support for INT4 CUDA in PyTorch is done through the [`bitsandbytes`](https://github.com/TimDettmers/bitsandbytes) package.

…osoft#20149) ### Description This PR adds flash attention v2 and support for INT4 CUDA benchmarking in PyTorch. ### Motivation and Context The [flash attention v2](https://github.com/Dao-AILab/flash-attention) algorithm helps improve model performance in PyTorch. Support for INT4 CUDA in PyTorch is done through the [`bitsandbytes`](https://github.com/TimDettmers/bitsandbytes) package.

kunal-vaishnavi added 4 commits March 28, 2024 17:27

Enable flash attention v2 for PyTorch models when benchmarking

0fce15e

Add instructions for installing flash attention v2

701d5f3

Add INT4 CUDA benchmarking for PyTorch eager

15f0ab6

Add instructions for installing PyTorch quantization

3232e42

kunal-vaishnavi added the release:1.17.3 label Mar 29, 2024

hanbitmyths reviewed Mar 29, 2024

View reviewed changes

onnxruntime/python/tools/transformers/models/llama/benchmark_e2e.py Outdated Show resolved Hide resolved

Use flash attention v2 for CUDA and SDPA for CPU

3e7b79e

hanbitmyths reviewed Mar 29, 2024

View reviewed changes

onnxruntime/python/tools/transformers/models/llama/benchmark_e2e.py Show resolved Hide resolved

hanbitmyths approved these changes Mar 29, 2024

View reviewed changes

kunal-vaishnavi merged commit a0ebd5f into microsoft:main Mar 30, 2024

dependabot bot mentioned this pull request Jan 17, 2026

nuget: Bump the dotnet-minor group with 10 updates psford/claudeProjects#4

Merged

dependabot bot mentioned this pull request Jan 27, 2026

Bump Microsoft.ML.OnnxRuntime from 1.17.0 to 1.23.2 freduardo4/H.O.P.E.#16

Closed

dependabot bot mentioned this pull request Feb 10, 2026

Bump Microsoft.ML.OnnxRuntime from 1.17.0 to 1.24.1 freduardo4/H.O.P.E.#51

Open

dependabot bot mentioned this pull request Feb 23, 2026

Bump Microsoft.ML.OnnxRuntime from 1.17.0 to 1.24.2 PrivStackApp/PrivStack-IO#57

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add flash attention v2 and INT4 CUDA for LLaMA E2E benchmarking#20149

Add flash attention v2 and INT4 CUDA for LLaMA E2E benchmarking#20149
kunal-vaishnavi merged 5 commits intomicrosoft:mainfrom
kunal-vaishnavi:kvaishnavi/llama-add-flash-attn

kunal-vaishnavi commented Mar 29, 2024

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kunal-vaishnavi commented Mar 29, 2024

Description

Motivation and Context

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants