Skip to content

[feat][plugin] make ATOM mla attention works for vllm#265

Merged
XiaobingSuper merged 15 commits intoROCm:mainfrom
XiaobingSuper:xiaobing/oot_kimi
Mar 10, 2026
Merged

[feat][plugin] make ATOM mla attention works for vllm#265
XiaobingSuper merged 15 commits intoROCm:mainfrom
XiaobingSuper:xiaobing/oot_kimi

Conversation

@XiaobingSuper
Copy link
Copy Markdown
Contributor

@XiaobingSuper XiaobingSuper commented Mar 4, 2026

Motivation

Following #126, this PR makes ATOM mla attention work for the vLLM plugin model. Note: the sparse mla is not supported now and will be implemented in the next step.

Technical Details

The design tails can be seen in #126.

Test Plan

This PR does a test for Kimi-K2-Thinking-MXFP4 mode with TP4 on mi355:

export SAFETENSORS_FAST_GPU=1
export VLLM_ROCM_USE_AITER=1
export VLLM_RPC_TIMEOUT=1800000

export VLLM_CACHE_ROOT=/root/.cache/vllm
export TORCHINDUCTOR_CACHE_DIR=/root/.cache/inductor
export HIP_VISIBLE_DEVICES=0,1,2,3
# quick allreduce
export AITER_QUICK_REDUCE_QUANTIZATION=INT4
export ATOM_PROFILER_MORE=1

export VLLM_TORCH_PROFILER_RECORD_SHAPES=1

model_path= Kimi-K2-Thinking-MXFP4
vllm serve $model_path \
    --host localhost \
    --port 8001 \
    --tensor-parallel-size 4 \
    --enable-expert-parallel \
    --trust-remote-code \
    --disable-log-requests \
    --gpu_memory_utilization 0.9 \
    --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
    --kv-cache-dtype fp8 \
    --max-num-batched-tokens 18432 \
    --max-model-len 16384 \
    --no-enable-prefix-caching

Test Result

gsmk result"

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match|↑  |0.9371|±  |0.0067|
|     |       |strict-match    |     3|exact_match|↑  |0.9363|±  |0.0067|
  

Submission Checklist

Copilot AI review requested due to automatic review settings March 4, 2026 11:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds vLLM plugin-mode support for ATOM’s MLA attention path (non-sparse), including backend selection, metadata plumbing, and DeepSeek V3 model registration/loading so MLA can run end-to-end under vLLM.

Changes:

  • Route vLLM’s use_mla attention selection to an ATOM MLA backend and add MLA-specific plugin-mode metadata builders.
  • Implement plugin-mode MLA forward/prefill/decode logic (including positions capture for graph mode).
  • Register DeepSeek V3 as a supported vLLM plugin model and add a plugin-mode load_weights implementation.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
atom/utils/backends.py Extends compilation-cache hashing to ignore <frozen os> traced “files”.
atom/plugin/vllm/register.py Patches vLLM process_weights_after_loading for Attention/MLAAttention.
atom/plugin/vllm/platform.py Selects ATOM MLA backend when attn_selector_config.use_mla is true.
atom/plugin/vllm/model_wrapper.py Copies positions into a static buffer for graph-mode MLA correctness.
atom/plugin/attention_mla.py New: plugin-mode MLAAttention implementation helpers (prefill/decode/DCP).
atom/plugin/attention.py Adds MLA plugin-mode metadata builders + backend wiring; renames plugin metadata class.
atom/models/deepseek_v2.py Adds DeepSeek V3 support + plugin-mode load_weights.
atom/model_ops/utils.py Removes duplicate per_tensor_dequantize implementation (keeps the canonical one).
atom/model_ops/paged_attention.py Integrates vLLM MLAAttention usage and allocates a shared positions buffer.
atom/model_ops/linear.py Ensures activation tensor is contiguous before quantizer .view() calls.
atom/model_ops/base_attention.py Adjusts MLA unified-attn path to apply o_proj outside MLA impl.
atom/model_ops/attentions/aiter_mla.py Decorates MLA backend/builder for plugin mode; builder init adjustments.
atom/model_ops/attentions/aiter_attention.py Removes unused import.
atom/model_ops/attention_mla.py Adds plugin-mode hooks/decorator and splits v_up and o_proj responsibilities.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread atom/plugin/attention.py Outdated
Comment thread atom/plugin/attention.py Outdated
Comment thread atom/model_ops/paged_attention.py Outdated
Comment thread atom/plugin/attention.py Outdated
Comment thread atom/plugin/attention_mla.py Outdated
Comment thread atom/plugin/attention_mla.py
Comment thread atom/plugin/attention_mla.py
Comment thread atom/plugin/attention.py
Comment thread atom/plugin/vllm/register.py
Comment thread atom/model_ops/linear.py Outdated
@XiaobingSuper
Copy link
Copy Markdown
Contributor Author

DeepSeek-R1-0528 with TP=8 has also been tested:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match|↑  |0.9424|±  |0.0064|
|     |       |strict-match    |     3|exact_match|↑  |0.9363|±  |0.0067|

Comment thread atom/model_ops/attention_mla.py Outdated
Copilot AI review requested due to automatic review settings March 4, 2026 13:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread atom/plugin/vllm/register.py
Comment thread atom/plugin/attention_mla.py
Comment thread atom/plugin/attention_mla.py Outdated
Comment thread atom/plugin/attention.py
Comment thread atom/model_ops/paged_attention.py Outdated
Comment thread atom/model_ops/attention_mla.py
Copy link
Copy Markdown
Collaborator

@ChuanLi1101 ChuanLi1101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left my comment FYI.

Comment thread atom/model_ops/attention_mla.py Outdated
Comment thread atom/plugin/attention_mla.py
Comment thread atom/model_ops/paged_attention.py Outdated
Comment thread atom/model_ops/paged_attention.py Outdated
Comment thread atom/plugin/attention_mla.py
Copilot AI review requested due to automatic review settings March 5, 2026 05:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread atom/plugin/attention_mla.py Outdated
Comment thread atom/plugin/attention_mla.py Outdated
Comment thread atom/plugin/vllm/register.py
Comment thread atom/plugin/attention.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread atom/model_ops/paged_attention.py Outdated
Comment thread atom/plugin/attention_mla.py
Comment thread atom/plugin/attention.py
Comment thread atom/model_ops/paged_attention.py Outdated
Comment thread atom/model_ops/paged_attention.py
ChuanLi1101
ChuanLi1101 previously approved these changes Mar 5, 2026
Copy link
Copy Markdown
Collaborator

@ChuanLi1101 ChuanLi1101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the quick turnaround.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

zejunchen-zejun
zejunchen-zejun previously approved these changes Mar 9, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread atom/model_ops/base_attention.py
Comment thread atom/plugin/attention.py
Copilot AI review requested due to automatic review settings March 10, 2026 03:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread atom/plugin/attention.py
Comment thread atom/plugin/attention_mla.py
@XiaobingSuper XiaobingSuper merged commit 78b1a4d into ROCm:main Mar 10, 2026
14 of 16 checks passed
Jasen2201 pushed a commit to Jasen2201/ATOM that referenced this pull request Apr 10, 2026
* [feat][plugin] make ATOM mla attention works for vllm

Signed-off-by: XiaobingSuper <xiaobingzhangupc@gmail.com>

* recover unrelated code

* simplify attention.py code

Signed-off-by: XiaobingSuper <xiaobingzhangupc@gmail.com>

* update postions init

* cleare code v1

* update scale use

* fix typo

* fix ruff issue

* update base_attention

* clear mla init

* clear code

* avoid copy for quant_func

* simlpe code

* reduce atom change

* warp mla attention head_dim arg

---------

Signed-off-by: XiaobingSuper <xiaobingzhangupc@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants