Performance Update (2025.04.22) by interestingLSY · Pull Request #71 · deepseek-ai/FlashMLA

interestingLSY · 2025-04-22T09:20:28Z

The new release of Flash MLA, which delivers 5% ~ 15% performance improvement on compute-bound workloads, achieving up to 660 TFlops on NVIDIA H800 SXM5 GPUs.
The interface of the new version is fully compatible with the old one.
A deep-dive blog is provided

* Fix benchmark script * Performance optimization for compute-bound cases * Add new testcase (s_k = 16384) * Update README.md * Update comment * Update README.md * Add the deep-dive blog * Add background color for MLA Kernel Sched.drawio.svg * Use relative path for the schedule image * Move flash_mla.h to kernels/params.h Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

interestingLSY added 10 commits April 21, 2025 15:42

Fix benchmark script

063ffa8

Performance optimization for compute-bound cases

287061e

Add new testcase (s_k = 16384)

69b6482

Update README.md

9352b7a

Update comment

15f3897

Update README.md

c7996e9

Add the deep-dive blog

984059a

Add background color for MLA Kernel Sched.drawio.svg

65451d6

Use relative path for the schedule image

c7123cb

Move flash_mla.h to kernels/params.h

828a19c

beginlner merged commit c2067be into deepseek-ai:main Apr 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Update (2025.04.22)#71

Performance Update (2025.04.22)#71
beginlner merged 10 commits intodeepseek-ai:mainfrom
interestingLSY:main

interestingLSY commented Apr 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

interestingLSY commented Apr 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants