Skip to content

Add multi-latent attention, profiling instrumentation, other perf fixes#8

Merged
andrewkchan merged 16 commits intomainfrom
mla
May 2, 2025
Merged

Add multi-latent attention, profiling instrumentation, other perf fixes#8
andrewkchan merged 16 commits intomainfrom
mla

Conversation

@andrewkchan
Copy link
Copy Markdown
Owner

@andrewkchan andrewkchan commented Apr 30, 2025

Adds a version of multi-latent attention and profiling instrumentation. Also adds a CLI option -L to "pin" (lock) model weights in memory, which is needed to fix an issue on my test machine (AWS r6a.12xlarge) where pages of the export tensors kept getting evicted by the OS, causing severe performance issues due to thrash.

MLA requires the model to be re-exported with python convert.py --mla .... The engine will automatically use MLA when running a model exported with the option.

Currently, MLA is slower than MHA on short-context generations (~2.6 tok/s vs ~4 tok/s for 128-token generations with negligible prompt length on DeepSeek-V3 quantized to Q2K). The model active bytes when ignoring KV cache is slightly higher at 16.29 GB for MLA vs 14.99GB for MHA, so a regression is not unexpected, but one of this size is surprising and indicates the effective bandwidth is lower.

Model active bytes including KV cache (of context size 4096) is much better at 16.58GB for MLA vs 39.55GB for MHA. I haven't yet tested the token throughput difference.

@andrewkchan andrewkchan changed the title Add multi-latent attention and profiling instrumentation Add multi-latent attention, profiling instrumentation, other perf fixes May 2, 2025
@andrewkchan andrewkchan merged commit 2c99d65 into main May 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant