Skip to content

[1/N][feat] Make ATOM work with vLLM and SGLang#126

Merged
valarLip merged 1 commit intoROCm:mainfrom
zejunchen-zejun:zejun/plugin_for_atom_1223
Mar 3, 2026
Merged

[1/N][feat] Make ATOM work with vLLM and SGLang#126
valarLip merged 1 commit intoROCm:mainfrom
zejunchen-zejun:zejun/plugin_for_atom_1223

Conversation

@zejunchen-zejun
Copy link
Copy Markdown
Collaborator

@zejunchen-zejun zejunchen-zejun commented Jan 12, 2026

This PR is used to make ATOM work with vLLM and SGLang, which keeps the OOB of popular frameworks and provides the optimizations from ATOM.

For vLLM, this PR uses the vLLM official out-of-tree mechanism and make ATOM provide platform, model and attention to vLLM. Here is the design diagram and performance snapshot. Compared to vLLM, vLLM+ATOM has 6-20% performance uplift.
image
image

Here is the RFC:

For SGLang, this PR uses the official model impl backend mechanism. Here is the design diagram.
image

For attention, this PR constructs the BaseAttention and makes paged attention/radix attention inherits from this base class. The implementation details of ATOM server mode and plugin mode have been moved into the PagedAttentionImpl
image

@zejunchen-zejun zejunchen-zejun force-pushed the zejun/plugin_for_atom_1223 branch from e6e0128 to 45ec455 Compare January 13, 2026 14:58
@zejunchen-zejun zejunchen-zejun changed the title [WP][feat] Make ATOM can be model impl backend for vLLM and SGLang [WIP][feat] Make ATOM can be model impl backend for vLLM and SGLang Jan 13, 2026
@zejunchen-zejun zejunchen-zejun force-pushed the zejun/plugin_for_atom_1223 branch 3 times, most recently from cabd144 to c2657a9 Compare January 14, 2026 07:11
@zejunchen-zejun zejunchen-zejun changed the title [WIP][feat] Make ATOM can be model impl backend for vLLM and SGLang [WIP][feat] Make ATOM work as model impl backend for vLLM and SGLang Jan 15, 2026
@zejunchen-zejun zejunchen-zejun force-pushed the zejun/plugin_for_atom_1223 branch 4 times, most recently from ae1f5e9 to 02e39be Compare January 16, 2026 12:28
@zejunchen-zejun zejunchen-zejun force-pushed the zejun/plugin_for_atom_1223 branch 2 times, most recently from d0f4d79 to 2b10d8f Compare January 26, 2026 07:59
Comment thread atom/model_ops/paged_attention.py
Comment thread atom/plugin/attention.py
Comment thread atom/plugin/attention_mha.py
Comment thread atom/plugin/attention_mha.py
Comment thread atom/plugin/attention_mha.py Outdated
@zejunchen-zejun zejunchen-zejun force-pushed the zejun/plugin_for_atom_1223 branch 3 times, most recently from bdf7a06 to 09cc7ed Compare January 29, 2026 14:09
@zejunchen-zejun zejunchen-zejun changed the title [WIP][feat] Make ATOM work as model impl backend for vLLM and SGLang [feat] Make ATOM work as model impl backend for vLLM and SGLang Feb 2, 2026
@zejunchen-zejun zejunchen-zejun marked this pull request as ready for review February 2, 2026 04:01
Copilot AI review requested due to automatic review settings February 2, 2026 04:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request enables ATOM to work as a model implementation backend for vLLM and SGLang, allowing users to specify --model-impl atom when launching these frameworks. The implementation follows an official registry mechanism and combines framework-level features from vLLM/SGLang with model-level fusion kernels from ATOM/AITER.

Changes:

  • Adds plugin infrastructure to register ATOM models and attention backends with vLLM and SGLang
  • Implements attention metadata builders and handlers for plugin mode
  • Refactors model implementations (Qwen3, Qwen3MoE, etc.) to support both server and plugin modes
  • Adds documentation recipe with setup instructions and known limitations

Reviewed changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 43 comments.

Show a summary per file
File Description
recipes/Model-Impl-Backend.md Documentation and setup guide for using ATOM with vLLM and SGLang
atom/plugin/*.py Core plugin infrastructure including registration, config generation, and attention handling
atom/models/*.py Model implementations updated to support plugin mode with consistent APIs
atom/model_ops/*.py Attention operations refactored with base classes and plugin-specific implementations
atom/model_loader/loader.py Weight loading updated to support plugin mode
atom/config.py Configuration extended with plugin-specific settings
atom/utils/*.py Utilities updated for plugin mode support

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread atom/config.py Outdated
Comment thread atom/model_engine/model_runner.py
Comment thread atom/model_ops/base_attention.py
Comment thread atom/model_ops/attention_mha.py
Comment thread atom/model_ops/radix_attention.py Outdated
Comment thread atom/plugin/register.py Outdated
Comment thread recipes/Model-Impl-Backend.md Outdated
Comment thread recipes/Model-Impl-Backend.md Outdated
Comment thread recipes/Model-Impl-Backend.md Outdated
Comment thread recipes/Model-Impl-Backend.md Outdated
@zejunchen-zejun zejunchen-zejun force-pushed the zejun/plugin_for_atom_1223 branch from 1440b34 to dd0e196 Compare February 2, 2026 04:34
Copilot AI review requested due to automatic review settings February 2, 2026 04:37
@zejunchen-zejun zejunchen-zejun force-pushed the zejun/plugin_for_atom_1223 branch from dd0e196 to f6e3e47 Compare February 2, 2026 04:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 29 out of 29 changed files in this pull request and generated 33 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread atom/plugin/attention.py
Comment thread atom/model_ops/__init__.py Outdated
Comment thread atom/model_ops/base_attention.py Outdated
Comment thread atom/plugin/prepare.py Outdated
Comment thread atom/plugin/attention_mha.py
Comment thread atom/models/qwen3.py
Comment thread atom/models/qwen3.py
Comment thread atom/models/qwen3.py
Comment thread atom/model_ops/attentions/aiter_attention.py
Comment thread atom/plugin/attention.py Outdated
Copilot AI review requested due to automatic review settings February 2, 2026 08:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 29 out of 29 changed files in this pull request and generated 20 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread atom/models/qwen3_moe.py
Comment thread atom/plugin/attention_mha.py
Comment thread atom/plugin/config.py
Comment thread atom/plugin/config.py Outdated
Comment thread atom/plugin/register.py
Comment thread atom/plugin/attention_mha.py
Comment thread atom/plugin/register.py
Comment thread atom/models/qwen3_moe.py
Comment thread atom/plugin/attention.py Outdated
Comment thread atom/plugin/register.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 36 out of 36 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread atom/model_ops/base_attention.py
Comment thread atom/plugin/vllm/register.py
Comment thread atom/models/qwen3_moe.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 36 out of 36 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

atom/models/qwen3_moe.py:218

  • Inconsistent naming of parameter 'alibi_slopes' in attention class instantiations. In qwen3_moe.py at line 214, the parameter is passed positionally (fourth parameter), but in qwen3.py line 113 and other models it's passed as a named argument 'alibi_slopes=None'. This inconsistency could lead to maintenance issues. Consider using named arguments consistently across all instantiations for clarity.
        self.attn = ops.ATTN_CLS(
            self.num_heads,
            self.head_dim,
            self.scaling,
            self.num_kv_heads,

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread atom/plugin/attention_mha.py
Comment thread atom/models/qwen3_moe.py
Comment thread atom/utils/backends.py
# as unified_attention
from atom.plugin import is_vllm

if is_vllm() and "unified_attention" in node.name:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember attention is always spliting op by @mark_spliting_op decorator, why we need this additional condition?

Copy link
Copy Markdown
Collaborator Author

@zejunchen-zejun zejunchen-zejun Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @ZhangLirong-amd
I move the original attention forward into a class named PagedAttention. Here is its forward code with this PR:

    def forward(
        self,
        query: torch.Tensor,
        key: torch.Tensor,
        value: torch.Tensor,
        positions: torch.Tensor = None,
        q_scale: Optional[torch.Tensor] = None,
        qkv: torch.Tensor = None,
        **kwargs,
    ):
        if is_vllm():
            output = unified_attention_with_output_base_for_plugin_mode(
                query,
                q_scale,
                key,
                value,
                positions,
                layer_name=self.layer_name,
                use_mla=self.use_mla,
                qkv=qkv,
            )
            return output

        # for atom server mode
        output = torch.ops.aiter.unified_attention_with_output_base(
            query, q_scale, key, value, positions, self.layer_name, self.use_mla, qkv
        )
        return output

For ATOM plugin mode(ATOM work as out-of-tree platform of vLLM), the unified_attention_with_output_base_for_plugin_mode will be called for this path and it has not been decorated by the mark_spliting_op. It is not a registered op of torch.compile because it will call vLLM official Attention interface. The vLLM official Attention forward will call a registered op torch.ops.vllm.unified_attention, so the compiled graph here will already have a standalone node named unified_attention. We just need to pick out this node and split it from the root graph on ATOM side.

For ATOM server mode(ATOM work as standalone engine), the logic has no change and the unified_attention_with_output_base will be called, meanwhile it is a registered op because it is decorated by mark_spliting_op.

According to above, I developed a heuristic split judging function _split_judge_func to handle different scenarios.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, zejun. Got it. And I notice vllm mark these ops as splitting_ops https://github.com/vllm-project/vllm/blob/main/vllm/config/compilation.py#L665-L679 , in this version, we only splitting unified_attention right?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vllm mark these ops as splitting_ops
Yes, vLLM has marked these attention related ops as splitting ops, but it doesn't affect the ATOM side because when ATOM work as out-of-tree platform, the compilation is owned on ATOM side. ATOM calls torch.compile to compile the model and decide how to split the graph. The only thing vLLM did is call the torch.ops.vllm.unified_attention.

we only splitting unified_attention right?
Yes, for now we only split unified_attention. For future usage, we can extend the split heuristic here.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 36 out of 36 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread recipes/SGLang-ATOM-Model-Impl-Backend.md
Comment thread atom/model_ops/attention_mha.py Outdated
# this method will just be called by vLLM and there is no logic in this method
# as ATOM handles the process after loading weights for all ops by itself
def process_weights_after_loading(self, act_dtype: torch.dtype = torch.bfloat16):
pass
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe raise not implement here?

Copy link
Copy Markdown
Collaborator Author

@zejunchen-zejun zejunchen-zejun Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method process_weights_after_loading of PagedAttentionImpl has no logic code because indeed we don't have any post processing logic after weight finish loading. However even if we have no logic here, the method still will be called by vLLM here: https://github.com/vllm-project/vllm/blob/f5d1281c9d1b96cb4f046f1ec2c53a525f319098/vllm/model_executor/layers/attention/attention.py#L503, so we add this method to PagedAttentionImpl to bypass the calling.

It is only used for OOT path, so we removed this method here and added into the class through the decorator.

try:
from vllm.attention.layer import Attention, AttentionType
except ImportError:
from vllm.model_executor.layers.attention import Attention
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the reason for using vllm/sglang attn class?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attention op is a little bit different from other ops(Linear/FusedMoE). We need to register the attention backend to vLLM and call official vLLM attention op in ATOM because vLLM has the ownership for 2 things when ATOM running the plugin mode.

  1. trigger build metadata for attention op in vLLM model runner. If we do not register the attention backend to vLLM, the metadata cannot be built before running the model forward
  2. trigger get_impl_cls to find the attention implementation class. This method is called in official attention op https://github.com/vllm-project/vllm/blob/f5d1281c9d1b96cb4f046f1ec2c53a525f319098/vllm/model_executor/layers/attention/attention.py#L315. If we do not call official attention op, the impl class will not be called by vLLM

)
from atom.model_ops.activation import SiluAndMul
from atom.model_ops.attention_mla import MLAModules, is_rocm_aiter_fp4bmm_enabled
from atom.model_ops.base_attention import Attention
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we find a better way... instead of modify all the model files

Copy link
Copy Markdown
Collaborator Author

@zejunchen-zejun zejunchen-zejun Feb 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It makes sense. Fixed. I change the code here and hide the attention selection logic into the base_attention file, as a result, there is no code changes to all model files. The developers can construct the model and call self.attn = Attention(...) like before.

attn_type=AttentionType.DECODER,
kv_sharing_target_layer_name=None,
**extra_impl_args,
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can override process_weights_after_loading here instead if..else in loader.py for following logic
if isinstance(module, Attention):
module.process_weights_after_loading(act_dtype=act_dtype)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes perfect sense. We have override the process_weights_after_loading method, as the result, the if-else code logic in loader.py is removed in this PR.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 36 out of 36 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread atom/plugin/vllm/register.py
Comment thread atom/plugin/config.py
Comment thread atom/plugin/config.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 31 out of 31 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread recipes/SGLang-ATOM-Model-Impl-Backend.md
Comment thread atom/models/qwen3_moe.py
Comment thread atom/model_ops/base_attention.py Outdated
Comment thread atom/plugin/vllm/register.py
Comment thread recipes/SGLang-ATOM-Model-Impl-Backend.md
Comment thread recipes/SGLang-ATOM-Model-Impl-Backend.md
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 31 out of 31 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@zejunchen-zejun
Copy link
Copy Markdown
Collaborator Author

Hi, @valarLip @ChuanLi1101 @sunway513 @wuhuikx

I tried to solve the significant comments. Could you help review the PR again?

Thank you!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 32 out of 32 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread atom/plugin/attention_mha.py Outdated
Comment thread recipes/vLLM-ATOM-OOT-Plugin-Backend.md
Copy link
Copy Markdown
Collaborator

@ChuanLi1101 ChuanLi1101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hard work. It took me a while to review the PR. I’ve left some comments on a few more serious issues that may cause bugs, for your reference.

Comment thread atom/plugin/attention.py Outdated
Comment thread atom/plugin/prepare.py
Comment thread atom/plugin/vllm/platform.py Outdated
Comment thread atom/models/qwen3.py
Comment thread atom/plugin/attention_mha.py
Comment thread atom/plugin/attention_mha.py Outdated
Comment thread atom/plugin/attention_mha.py Outdated
Comment thread atom/plugin/vllm/model_wrapper.py
Comment thread atom/plugin/attention.py
Comment thread atom/plugin/attention.py
@zejunchen-zejun
Copy link
Copy Markdown
Collaborator Author

Thanks for the hard work. It took me a while to review the PR. I’ve left some comments on a few more serious issues that may cause bugs, for your reference.

Thank you for significant suggestions. I will resolve soon!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 32 out of 32 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread atom/plugin/attention.py Outdated
Comment thread atom/model_loader/loader.py Outdated
Comment thread recipes/vLLM-ATOM-OOT-Plugin-Backend.md
@sunway513
Copy link
Copy Markdown
Collaborator

sunway513 commented Mar 3, 2026

PR #126 Review: [1/N][feat] Make ATOM work with vLLM and SGLang

+3,243 / -179 | 32 files | 21 commits


Overall Assessment

This is an ambitious PR that enables ATOM to function as an out-of-tree plugin for vLLM and SGLang. The architectural approach is sound — leveraging vLLM's official OOT mechanism makes the design transparent to vLLM with zero upstream code changes. The reported 6-20% performance uplift is compelling.

I performed an initial review and then double-checked each finding against the latest commit on this branch. Several issues flagged earlier (by myself and others on older commits) have already been addressed. Below are the remaining items.

Note: An earlier version of this comment listed 7 critical issues. After verification against the latest code, most were either already fixed or did not hold up on closer inspection. This updated version reflects the corrected assessment.


Remaining Issues

1. extend_for_sliding_window type annotation mismatchatom/plugin/attention_mha.py

The function signature declares k_scale: float and v_scale: float, but in practice self.k_scale / self.v_scale (which are torch.Tensor or None) are passed in, and the downstream cp_mha_gather_cache expects torch.Tensor. The code works at runtime, but the type annotation is misleading and will cause issues with type checkers and IDE tooling.

2. extend_workspace buffer bypasses framework memory managementatom/plugin/attention.py

The workspace buffer [2, 32*1024, num_kv_heads, head_dim] is allocated outside of vLLM's cache manager. For a model like Qwen3-235B (8 kv heads, 128 head_dim, bf16), this is ~128 MB of untracked GPU memory. At high gpu_mem_utilization, this creates OOM risk. Also flagged by @wuhuikx — recommend at minimum adding a prominent comment/warning, and ideally integrating with vLLM's memory accounting.


Design Suggestions (Non-blocking)

3. Heavy use of decorators increases debugging difficulty

Four decorators dynamically modify class inheritance chains and methods at import time (PagedAttentionImplDecoratorForPluginMode, AiterAttentionMetadataBuilderDecoratorForPluginMode, AiterBackendDecoratorForPluginMode, FusedMoEDecoratorForPluginMode). This makes stack traces harder to follow and debugging more difficult. Worth considering more explicit inheritance patterns in future iterations.

4. is_plugin_mode() / is_vllm() / is_sglang() scattered across core code

Core files (attention_mha.py, base_attention.py, embed_head.py, loader.py, moe.py) now contain plugin mode conditionals throughout. Consider using strategy pattern or abstract interfaces to better isolate plugin-specific behavior in future PRs.

5. SGLang support is incomplete

Multiple locations raise NotImplementedError for SGLang paths. Agree with @wuhuikx's suggestion to focus this PR on vLLM only and split SGLang into a follow-up PR.

6. Commit history cleanup

21 commits with several messages like "add" or "make lint happy". Recommend squashing before merge.


Previously Flagged Issues — Verified as Resolved or Not Applicable

For transparency, the following were flagged earlier but do not hold up against the latest code:

  • elif 0: dead code → Fixed in latest commit (now elif use_triton_attn and self.rotary_emb is not None)
  • positions None crashpositions is passed from model forward, not from Context.positions; dummy runs exit early
  • sliding_window None TypeError → Proper None guard exists (if sliding_window is None or sliding_window == -1)
  • paged_attention_triton_plugin_mode arg name mismatch → Fixed in latest commit, call sites now use k_cache= / v_cache=
  • ATOMPlatform fallback missing method → When disabled, ATOMPlatform = None and register_platform() returns None; no class instantiation occurs
  • model_runner=None crash in aiter_attention.py → In plugin mode, the decorator replaces __init__ entirely, so the original code path is not reached

CI/CD Coverage Requirement

See issue #255 for the full CI enhancement plan. Summary below:

CPU Unit Tests — Validated and Ready to Cherry-Pick

23 tests across 4 files have been developed, validated, and pass on CPU without GPU/vllm/aiter/triton dependencies (runs in ~1 second). The code is available on branch ci/plugin-mode-test-coverage in the ATOM repo for cherry-picking.

Test File Tests Coverage
tests/test_plugin_prepare.py 7 is_vllm(), is_sglang(), is_plugin_mode(), _set_framework_backbone(), invalid framework, case insensitivity
tests/test_plugin_config.py 6 PluginConfig defaults, vllm/sglang mode fields, field completeness
tests/test_plugin_vllm_register.py 6 register_platform() enable/disable, register_model() skip, model registry overrides, set_attn_cls() → PagedAttention/RadixAttention
tests/test_plugin_vllm_platform.py 4 ATOMPlatform None when disabled, inherits RocmPlatform when enabled, returns ATOM backend, fallback when attention disabled

Also included on the branch:

  • .github/workflows/atom-plugin-test.yaml — New workflow (CPU unit tests on every PR + GPU smoke test)
  • .github/scripts/atom_plugin_test.sh — vLLM/SGLang plugin launch + inference + accuracy script

GPU Tests — Concept Only (For Follow-Up Development)

The following GPU test levels are proposed but not yet implemented — they require actual GPU hardware and full vLLM + AITER stack:

Level Description GPU Est. Time Priority
L1: Plugin wiring Decorator application, method injection, plugin discovery 1× MI355 ~5 min P0
L2: Kernel dispatch Verify correct attention kernel selected per config (fusion/triton/asm paths, sliding window, FP8 vs BF16) 1× MI355 ~15 min P1
L3: E2E correctness Plugin mode vs server mode output consistency, accuracy (gsm8k), multi-turn 8× MI355 ~30 min P1
L4: Perf regression Throughput comparison plugin vs server mode (>= 95% baseline) 8× MI355 ~60 min P2 (nightly)

Positive Aspects

  • Leverages vLLM's official OOT mechanism — zero upstream code changes needed
  • Sound attention abstraction hierarchy (BaseAttention → PagedAttention / RadixAttention)
  • 6-20% performance uplift backed by benchmark data
  • CI passing
  • Recipe documentation and RFC provided
  • Good responsiveness to review feedback — multiple issues from earlier reviews already addressed

Updated after double-checking all findings against the latest commit on this branch. CPU tests validated locally — 23/23 passing.

Comment thread atom/plugin/attention.py Outdated
@zejunchen-zejun
Copy link
Copy Markdown
Collaborator Author

zejunchen-zejun commented Mar 3, 2026

  1. extend_for_sliding_window type annotation mismatch — atom/plugin/attention_mha.py
  2. extend_workspace buffer bypasses framework memory management — atom/plugin/attention.py
  3. Heavy use of decorators increases debugging difficulty
  4. is_plugin_mode() / is_vllm() / is_sglang() scattered across core code
  5. SGLang support is incomplete
  6. Commit history cleanup

Thank you for comments. Let me fix and give the feedback:

  1. Fixed type mismatch
  2. Added warning to frontend users for the untracked gpu mem usage. This piece of memory is used to store the fetched kv cache for extend path(chunked prefill). For accounting those gpu mem usage into vLLM memory budget, it is not easy to do because the attention is entirely bypassed for profiler run in vLLM. The vLLM has the assumption that the attention will not consume any memory.
  3. Yes, it makes sense. We consider the side effect of the decorator but we still plan to use it. The purpose is to isolate the plugin-mode behavior from the server-mode behavior for using the same component class, and the decorator AiterBackendDecoratorForPluginMode and FusedMoEDecoratorForPluginMode are quite simple, they just add some methods or modify the class name to follow the vLLM calling convention. The backbone code for FusedMoE and AiterBackend are shared for both server mode and plugin mode.
  4. I fully agree. According to the design principle, it is a little bit unsafe to maintain the status across the core components. We plan to use abstract interface for checking current mode instead of using is_vllm, is_sglang in future PRs.
  5. For SGLang, this PR only enables basic functionalities and fallback the attention to SGLang mainline. We have another PR(base on this PR) to integrate the radix attention to ATOM for SGLang+ATOM mode.
  6. Done!

Comment thread atom/plugin/attention_mha.py Outdated

This comment was marked as duplicate.

wuhuikx
wuhuikx previously approved these changes Mar 3, 2026
ganyi1996ppo
ganyi1996ppo previously approved these changes Mar 3, 2026
framework

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 32 out of 32 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread atom/model_ops/base_attention.py
Comment thread atom/model_ops/base_attention.py
Comment thread atom/plugin/attention_mha.py
Comment thread atom/plugin/vllm/register.py
Comment thread recipes/SGLang-ATOM-Model-Impl-Backend.md
Copy link
Copy Markdown
Collaborator

@valarLip valarLip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, nice work

@zejunchen-zejun
Copy link
Copy Markdown
Collaborator Author

Thank you @valarLip

Copy link
Copy Markdown
Collaborator

@ChuanLi1101 ChuanLi1101 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants