Skip to content

[Feat][Plugin] Enable MTP for vLLM Plugin#557

Draft
whx-sjtu wants to merge 10 commits intomainfrom
whx-sjtu/atom-support-vllm-glm5-mtp
Draft

[Feat][Plugin] Enable MTP for vLLM Plugin#557
whx-sjtu wants to merge 10 commits intomainfrom
whx-sjtu/atom-support-vllm-glm5-mtp

Conversation

@whx-sjtu
Copy link
Copy Markdown
Contributor

@whx-sjtu whx-sjtu commented Apr 14, 2026

Motivation

This PR enables MTP feature for running DeepSeekV3 and GLM5 with vLLM+atom.

Technical Details

  1. atom_config related bugfix.
  2. Fix wrong full_cls_name of different MLA sparse attention backends.
  3. Register model architecture and model class for DeepSeek V3 and GLM5 MTP.
  4. Add index_buffer for DeepseekMTP.
  5. Adapt full graph of main model with mtp enabled.

Test Plan

Comming soon.

Test Result

  1. zai-org/GLM-5.1-FP8

Accuracy test commands:

lm_eval --model local-completions \
        --model_args model=/home/models/GLM-5.1-FP8,base_url=http://localhost:8000/v1/completions,num_concurrent=64,max_retries=3 \
        --tasks gsm8k \
        --num_fewshot 20

Accuracy test result with mtp=3:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|    20|exact_match|↑  |0.9454|±  |0.0063|
|     |       |strict-match    |    20|exact_match|↑  |0.9462|±  |0.0062|
  1. deepseek-ai/DeepSeek-R1-0528

Accuracy test commands:

lm_eval --model local-completions \
        --model_args model=/home/models/DeepSeek-R1-0528,base_url=http://localhost:8000/v1/completions,num_concurrent=16,max_retries=3,tokenized_requests=False \
        --tasks gsm8k \
        --num_fewshot 3

Accuracy test result with mtp=3:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match|↑  |0.9492|±  |0.0060|
|     |       |strict-match    |     3|exact_match|↑  |0.9469|±  |0.0062|

Submission Checklist

@whx-sjtu whx-sjtu marked this pull request as ready for review April 14, 2026 14:52
@whx-sjtu whx-sjtu changed the title [Feat][Plugin] Enable spec decoding for GLM5 in atom (vLLM Plugin) [Feat][Plugin] Enable spec decoding for GLM5 (vLLM Plugin) Apr 14, 2026
@whx-sjtu whx-sjtu force-pushed the whx-sjtu/atom-support-vllm-glm5-mtp branch from 17446a6 to 9015568 Compare April 15, 2026 03:37
@wuhuikx
Copy link
Copy Markdown
Collaborator

wuhuikx commented Apr 15, 2026

Could you please help attach the accuracy test results on gms8k? Do we support MTP=1 or MTP=1/2/3? How about the acceptance ratio?

@wuhuikx wuhuikx marked this pull request as draft April 15, 2026 09:16
@wuhuikx
Copy link
Copy Markdown
Collaborator

wuhuikx commented Apr 15, 2026

I will turn this PR to draft and go through CI after the code review is done.

@whx-sjtu
Copy link
Copy Markdown
Contributor Author

Could you please help attach the accuracy test results on gms8k? Do we support MTP=1 or MTP=1/2/3? How about the acceptance ratio?

Sure I will attach the acc results later. Now we support MTP=1/2/3, but the acceptance rate is low (about 20% for first draft token and 0 for other tokens) and I'm working on it.

@whx-sjtu whx-sjtu changed the title [Feat][Plugin] Enable spec decoding for GLM5 (vLLM Plugin) [Feat][Plugin] Enable spec decoding for vLLM Plugin Apr 17, 2026
@whx-sjtu whx-sjtu changed the title [Feat][Plugin] Enable spec decoding for vLLM Plugin [Feat][Plugin] Enable MTP for vLLM Plugin Apr 21, 2026
Comment thread atom/plugin/vllm/model_wrapper.py Outdated
self.model_arch = model_arch
# if self.forced_model_arch is not None:
# model_arch = self.forced_model_arch
# logger.info(f"Using forced model arch: {model_arch} for vLLM plugin mode")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be removed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread atom/config.py Outdated
# can coexist in one process. Resolve per-forward config first to avoid
# reading a stale global singleton.
if not is_vllm():
return None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here should have an assertion to avoid non-vllm backend calling this method
it should only be called by atom-vllm backend

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread atom/config.py


def get_current_atom_config() -> Config:
forward_atom_config = _get_current_atom_config_from_vllm_forward_context()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here maybe a little bit risky. If the forward_atom_config is None, and there is no assertion, it will silent fallback to the global singleton _current_atom_config. Can we add some log here? Or make it more safe
In ideal situation, the lifecycle and ownership forward_atom_config is belong to the model itself and the main model will get its atom config, draft model will get its config. While if draft model cannot get its config, it will fallback to the _current_atom_config, which not be correct

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that we should always obtain forward_atom_config from vllm_forward_context? Will there be scenarios that we need to return the default global _current_atom_config?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us add some comments here to warn a case: there is no atom config in forward context, so the default global config will be provided. With this warning, we can mitigate the coherent issue for local value and its twins global value.

Comment thread atom/plugin/vllm/model_wrapper.py Outdated
position_offset = getattr(self.model, "vllm_draft_position_offset", 0)
if position_offset == 0:
return positions
return positions + position_offset
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we leave some comments here for the position offset

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed anymore. removed.

@wuhuikx wuhuikx marked this pull request as ready for review April 22, 2026 12:39
@wuhuikx wuhuikx marked this pull request as draft April 22, 2026 12:40
ganyi1996ppo
ganyi1996ppo previously approved these changes Apr 23, 2026
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
@whx-sjtu whx-sjtu force-pushed the whx-sjtu/atom-support-vllm-glm5-mtp branch from 2c1db99 to 90aa06b Compare April 23, 2026 10:49
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants