[1/N][feat] Make ATOM work with vLLM and SGLang#126
Conversation
e6e0128 to
45ec455
Compare
cabd144 to
c2657a9
Compare
ae1f5e9 to
02e39be
Compare
d0f4d79 to
2b10d8f
Compare
bdf7a06 to
09cc7ed
Compare
There was a problem hiding this comment.
Pull request overview
This pull request enables ATOM to work as a model implementation backend for vLLM and SGLang, allowing users to specify --model-impl atom when launching these frameworks. The implementation follows an official registry mechanism and combines framework-level features from vLLM/SGLang with model-level fusion kernels from ATOM/AITER.
Changes:
- Adds plugin infrastructure to register ATOM models and attention backends with vLLM and SGLang
- Implements attention metadata builders and handlers for plugin mode
- Refactors model implementations (Qwen3, Qwen3MoE, etc.) to support both server and plugin modes
- Adds documentation recipe with setup instructions and known limitations
Reviewed changes
Copilot reviewed 29 out of 29 changed files in this pull request and generated 43 comments.
Show a summary per file
| File | Description |
|---|---|
| recipes/Model-Impl-Backend.md | Documentation and setup guide for using ATOM with vLLM and SGLang |
| atom/plugin/*.py | Core plugin infrastructure including registration, config generation, and attention handling |
| atom/models/*.py | Model implementations updated to support plugin mode with consistent APIs |
| atom/model_ops/*.py | Attention operations refactored with base classes and plugin-specific implementations |
| atom/model_loader/loader.py | Weight loading updated to support plugin mode |
| atom/config.py | Configuration extended with plugin-specific settings |
| atom/utils/*.py | Utilities updated for plugin mode support |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
1440b34 to
dd0e196
Compare
dd0e196 to
f6e3e47
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 29 out of 29 changed files in this pull request and generated 33 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 29 out of 29 changed files in this pull request and generated 20 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 36 out of 36 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 36 out of 36 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
atom/models/qwen3_moe.py:218
- Inconsistent naming of parameter 'alibi_slopes' in attention class instantiations. In qwen3_moe.py at line 214, the parameter is passed positionally (fourth parameter), but in qwen3.py line 113 and other models it's passed as a named argument 'alibi_slopes=None'. This inconsistency could lead to maintenance issues. Consider using named arguments consistently across all instantiations for clarity.
self.attn = ops.ATTN_CLS(
self.num_heads,
self.head_dim,
self.scaling,
self.num_kv_heads,
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # as unified_attention | ||
| from atom.plugin import is_vllm | ||
|
|
||
| if is_vllm() and "unified_attention" in node.name: |
There was a problem hiding this comment.
I remember attention is always spliting op by @mark_spliting_op decorator, why we need this additional condition?
There was a problem hiding this comment.
Hi, @ZhangLirong-amd
I move the original attention forward into a class named PagedAttention. Here is its forward code with this PR:
def forward(
self,
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
positions: torch.Tensor = None,
q_scale: Optional[torch.Tensor] = None,
qkv: torch.Tensor = None,
**kwargs,
):
if is_vllm():
output = unified_attention_with_output_base_for_plugin_mode(
query,
q_scale,
key,
value,
positions,
layer_name=self.layer_name,
use_mla=self.use_mla,
qkv=qkv,
)
return output
# for atom server mode
output = torch.ops.aiter.unified_attention_with_output_base(
query, q_scale, key, value, positions, self.layer_name, self.use_mla, qkv
)
return output
For ATOM plugin mode(ATOM work as out-of-tree platform of vLLM), the unified_attention_with_output_base_for_plugin_mode will be called for this path and it has not been decorated by the mark_spliting_op. It is not a registered op of torch.compile because it will call vLLM official Attention interface. The vLLM official Attention forward will call a registered op torch.ops.vllm.unified_attention, so the compiled graph here will already have a standalone node named unified_attention. We just need to pick out this node and split it from the root graph on ATOM side.
For ATOM server mode(ATOM work as standalone engine), the logic has no change and the unified_attention_with_output_base will be called, meanwhile it is a registered op because it is decorated by mark_spliting_op.
According to above, I developed a heuristic split judging function _split_judge_func to handle different scenarios.
There was a problem hiding this comment.
Thanks, zejun. Got it. And I notice vllm mark these ops as splitting_ops https://github.com/vllm-project/vllm/blob/main/vllm/config/compilation.py#L665-L679 , in this version, we only splitting unified_attention right?
There was a problem hiding this comment.
vllm mark these ops as splitting_ops
Yes, vLLM has marked these attention related ops as splitting ops, but it doesn't affect the ATOM side because when ATOM work as out-of-tree platform, the compilation is owned on ATOM side. ATOM calls torch.compile to compile the model and decide how to split the graph. The only thing vLLM did is call the torch.ops.vllm.unified_attention.
we only splitting unified_attention right?
Yes, for now we only split unified_attention. For future usage, we can extend the split heuristic here.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 36 out of 36 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # this method will just be called by vLLM and there is no logic in this method | ||
| # as ATOM handles the process after loading weights for all ops by itself | ||
| def process_weights_after_loading(self, act_dtype: torch.dtype = torch.bfloat16): | ||
| pass |
There was a problem hiding this comment.
maybe raise not implement here?
There was a problem hiding this comment.
The method process_weights_after_loading of PagedAttentionImpl has no logic code because indeed we don't have any post processing logic after weight finish loading. However even if we have no logic here, the method still will be called by vLLM here: https://github.com/vllm-project/vllm/blob/f5d1281c9d1b96cb4f046f1ec2c53a525f319098/vllm/model_executor/layers/attention/attention.py#L503, so we add this method to PagedAttentionImpl to bypass the calling.
It is only used for OOT path, so we removed this method here and added into the class through the decorator.
| try: | ||
| from vllm.attention.layer import Attention, AttentionType | ||
| except ImportError: | ||
| from vllm.model_executor.layers.attention import Attention |
There was a problem hiding this comment.
what's the reason for using vllm/sglang attn class?
There was a problem hiding this comment.
Attention op is a little bit different from other ops(Linear/FusedMoE). We need to register the attention backend to vLLM and call official vLLM attention op in ATOM because vLLM has the ownership for 2 things when ATOM running the plugin mode.
- trigger build metadata for attention op in vLLM model runner. If we do not register the attention backend to vLLM, the metadata cannot be built before running the model forward
- trigger
get_impl_clsto find the attention implementation class. This method is called in official attention op https://github.com/vllm-project/vllm/blob/f5d1281c9d1b96cb4f046f1ec2c53a525f319098/vllm/model_executor/layers/attention/attention.py#L315. If we do not call official attention op, the impl class will not be called by vLLM
| ) | ||
| from atom.model_ops.activation import SiluAndMul | ||
| from atom.model_ops.attention_mla import MLAModules, is_rocm_aiter_fp4bmm_enabled | ||
| from atom.model_ops.base_attention import Attention |
There was a problem hiding this comment.
can we find a better way... instead of modify all the model files
There was a problem hiding this comment.
Yes. It makes sense. Fixed. I change the code here and hide the attention selection logic into the base_attention file, as a result, there is no code changes to all model files. The developers can construct the model and call self.attn = Attention(...) like before.
| attn_type=AttentionType.DECODER, | ||
| kv_sharing_target_layer_name=None, | ||
| **extra_impl_args, | ||
| ) |
There was a problem hiding this comment.
maybe we can override process_weights_after_loading here instead if..else in loader.py for following logic
if isinstance(module, Attention):
module.process_weights_after_loading(act_dtype=act_dtype)
There was a problem hiding this comment.
It makes perfect sense. We have override the process_weights_after_loading method, as the result, the if-else code logic in loader.py is removed in this PR.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 36 out of 36 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 31 out of 31 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 31 out of 31 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Hi, @valarLip @ChuanLi1101 @sunway513 @wuhuikx I tried to solve the significant comments. Could you help review the PR again? Thank you! |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 32 out of 32 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ChuanLi1101
left a comment
There was a problem hiding this comment.
Thanks for the hard work. It took me a while to review the PR. I’ve left some comments on a few more serious issues that may cause bugs, for your reference.
Thank you for significant suggestions. I will resolve soon! |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 32 out of 32 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
PR #126 Review:
|
| Test File | Tests | Coverage |
|---|---|---|
tests/test_plugin_prepare.py |
7 | is_vllm(), is_sglang(), is_plugin_mode(), _set_framework_backbone(), invalid framework, case insensitivity |
tests/test_plugin_config.py |
6 | PluginConfig defaults, vllm/sglang mode fields, field completeness |
tests/test_plugin_vllm_register.py |
6 | register_platform() enable/disable, register_model() skip, model registry overrides, set_attn_cls() → PagedAttention/RadixAttention |
tests/test_plugin_vllm_platform.py |
4 | ATOMPlatform None when disabled, inherits RocmPlatform when enabled, returns ATOM backend, fallback when attention disabled |
Also included on the branch:
.github/workflows/atom-plugin-test.yaml— New workflow (CPU unit tests on every PR + GPU smoke test).github/scripts/atom_plugin_test.sh— vLLM/SGLang plugin launch + inference + accuracy script
GPU Tests — Concept Only (For Follow-Up Development)
The following GPU test levels are proposed but not yet implemented — they require actual GPU hardware and full vLLM + AITER stack:
| Level | Description | GPU | Est. Time | Priority |
|---|---|---|---|---|
| L1: Plugin wiring | Decorator application, method injection, plugin discovery | 1× MI355 | ~5 min | P0 |
| L2: Kernel dispatch | Verify correct attention kernel selected per config (fusion/triton/asm paths, sliding window, FP8 vs BF16) | 1× MI355 | ~15 min | P1 |
| L3: E2E correctness | Plugin mode vs server mode output consistency, accuracy (gsm8k), multi-turn | 8× MI355 | ~30 min | P1 |
| L4: Perf regression | Throughput comparison plugin vs server mode (>= 95% baseline) | 8× MI355 | ~60 min | P2 (nightly) |
Positive Aspects
- Leverages vLLM's official OOT mechanism — zero upstream code changes needed
- Sound attention abstraction hierarchy (BaseAttention → PagedAttention / RadixAttention)
- 6-20% performance uplift backed by benchmark data
- CI passing
- Recipe documentation and RFC provided
- Good responsiveness to review feedback — multiple issues from earlier reviews already addressed
Updated after double-checking all findings against the latest commit on this branch. CPU tests validated locally — 23/23 passing.
Thank you for comments. Let me fix and give the feedback:
|
framework Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 32 out of 32 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Thank you @valarLip |
This PR is used to make ATOM work with vLLM and SGLang, which keeps the OOB of popular frameworks and provides the optimizations from ATOM.
For vLLM, this PR uses the vLLM official out-of-tree mechanism and make ATOM provide platform, model and attention to vLLM. Here is the design diagram and performance snapshot. Compared to vLLM, vLLM+ATOM has 6-20% performance uplift.


Here is the RFC:
For SGLang, this PR uses the official model impl backend mechanism. Here is the design diagram.

For attention, this PR constructs the BaseAttention and makes paged attention/radix attention inherits from this base class. The implementation details of ATOM server mode and plugin mode have been moved into the PagedAttentionImpl
