Skip to content

Support Mimo-v2.5-Pro#654

Draft
wufann wants to merge 2 commits intoROCm:mainfrom
wufann:mimov2pro
Draft

Support Mimo-v2.5-Pro#654
wufann wants to merge 2 commits intoROCm:mainfrom
wufann:mimov2pro

Conversation

@wufann
Copy link
Copy Markdown
Contributor

@wufann wufann commented Apr 28, 2026

Motivation

Support Mimo-v2.5-Pro

Technical Details

  1. Fused QKV Loading Hook (Core Technical Change)

Mimo-V2.5-Pro checkpoints store a single fused qkv_proj weight in TP-interleaved layout, while Flash checkpoints use separate q_proj/k_proj/v_proj. To avoid modifying the QKVParallelLinear.weight_loader interface, a model-level hook mechanism
was introduced:

  • loader.py: Before the packed_modules_mapping loop in the weight loading iteration, the loader calls model.load_fused_qkv_hook(). If the hook returns True, the weight is considered handled and the rest of the loading logic is skipped
    via continue.
  • mimo_v2.py / mimo_v2_mtp.py: Both model classes implement load_fused_qkv_hook — when the weight name contains qkv_proj and exists in params_dict, it chunks the weight tensor by TP rank and writes it directly, bypassing the
    shard-based split logic.

Auto-adaptation: Flash checkpoint weight names are q_proj/k_proj/v_proj → hook never fires → normal packed_modules_mapping path. Mimo-V2.5-Pro checkpoint weight names are qkv_proj → hook intercepts and handles directly. No model-type branching needed. ref:https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/mimo_v2.py#L1106

  1. File & Class Renaming (Unified Naming)
  • mimo_v2_flash.py → mimo_v2.py
  • mimo_v2_flash_mtp.py → mimo_v2_mtp.py

All class names drop the Flash suffix: MiMoV2FlashForCausalLM → MiMoV2ForCausalLM, MiMoV2FlashMTP → MiMoV2MTP, MiMoV2FlashDecoderLayer → MiMoV2DecoderLayer, etc.

  1. Model Registration (Flash + Pro Compatibility)
  • model_runner.py: Added "MiMoV2ForCausalLM" architecture entry (used by Mimo-V2.5-Pro weights), kept "MiMoV2FlashForCausalLM" for backward compatibility. is_mimo_v2() now recognizes both "mimo_v2" and "mimo_v2_flash" model types.
  • eagle.py: Added "MiMoV2MTPModel" MTP architecture entry alongside the existing "MiMoV2FlashMTPModel".
  • config.py: Added "mimo_v2" → "mimo_v2_mtp" to _MTP_TYPE_MAP. Changed MTP config override check from "mimo_v2_flash_mtp" to "mimo_v2_mtp".
  1. max_position_embeddings Adaptation

Mimo-V2.5-Pro's HF config uses context_len instead of max_position_embeddings. Both model files now use getattr(config, "context_len", None) or getattr(config, "max_position_embeddings", 32768), preferring context_len when present.

Test Plan

TP8 + FP8KV
TP8 + FP8KV + MTP1

python -m atom.entrypoints.openai_server --model /data/MiMo-V2.5-Pro -tp 8 --trust-remote-code --kv_cache_dtype fp8

python -m atom.entrypoints.openai_server --model /data/MiMo-V2.5-Pro -tp 8 --trust-remote-code --kv_cache_dtype fp8 --method mtp

python -m atom.benchmarks.benchmark_serving \
  --model=/data/MiMo-V2.5-Pro --backend=vllm --base-url=http://localhost:8000 \
  --dataset-name=random \
  --random-input-len=1024 --random-output-len=1024 \
  --random-range-ratio=0.8 \
  --num-prompts=1280 --max-concurrency=128 \
  --request-rate=inf --ignore-eos \
  --save-result --percentile-metrics="ttft,tpot,itl,e2el"

lm_eval --model local-completions \
  --model_args model=/data/MiMo-V2.5-Pro,base_url=http://localhost:8000/v1/completions,num_concurrent=64,max_retries=3,tokenized_requests=False \
  --tasks gsm8k --num_fewshot 5

Test Result

Acc
TP8 + FP8KV

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9386 ± 0.0066
strict-match 5 exact_match 0.9348 ± 0.0068

TP4 + FP8KV + MTP1

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9401 ± 0.0065
strict-match 5 exact_match 0.9386 ± 0.0066

Perf
TP8 + FP8KV

data
============ Serving Benchmark Result ============
Successful requests:                     1280
Benchmark duration (s):                  378.80
Total input tokens:                      1180188
Total generated tokens:                  1177601
Request throughput (req/s):              3.38
Output token throughput (tok/s):         3108.74
Total Token throughput (tok/s):          6224.30
---------------Time to First Token----------------
Mean TTFT (ms):                          312.11
Median TTFT (ms):                        127.69
P99 TTFT (ms):                           3232.73
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          39.85
Median TPOT (ms):                        40.38
P99 TPOT (ms):                           42.00
---------------Inter-token Latency----------------
Mean ITL (ms):                           39.83
Median ITL (ms):                         34.22
P99 ITL (ms):                            121.85
----------------End-to-end Latency----------------
Mean E2EL (ms):                          36953.47
Median E2EL (ms):                        36939.84
P99 E2EL (ms):                           42933.85
==================================================
TP8 + FP8KV + MTP1
data
============ Serving Benchmark Result ============
Successful requests:                     1280
Benchmark duration (s):                  344.32
Total input tokens:                      1180188
Total generated tokens:                  1175988
Request throughput (req/s):              3.72
Output token throughput (tok/s):         3415.39
Total Token throughput (tok/s):          6842.98
---------------Time to First Token----------------
Mean TTFT (ms):                          339.56
Median TTFT (ms):                        135.65
P99 TTFT (ms):                           3382.80
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          35.61
Median TPOT (ms):                        33.81
P99 TPOT (ms):                           49.84
---------------Inter-token Latency----------------
Mean ITL (ms):                           48.19
Median ITL (ms):                         40.22
P99 ITL (ms):                            117.61
----------------End-to-end Latency----------------
Mean E2EL (ms):                          33013.86
Median E2EL (ms):                        31428.81
P99 E2EL (ms):                           49556.51
==================================================

cc: @billishyahao

Submission Checklist

@wufann wufann marked this pull request as draft April 28, 2026 06:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant