Skip to content

[Reverted][RL] Support Rollout Routing Replay#5321

Merged
Jiang-Jia-Jun merged 22 commits intoPaddlePaddle:developfrom
gongshaotian:dev_r3
Dec 5, 2025
Merged

[Reverted][RL] Support Rollout Routing Replay#5321
Jiang-Jia-Jun merged 22 commits intoPaddlePaddle:developfrom
gongshaotian:dev_r3

Conversation

@gongshaotian
Copy link
Collaborator

@gongshaotian gongshaotian commented Dec 1, 2025

Reverted.
New PR: #5405

@gongshaotian gongshaotian marked this pull request as ready for review December 1, 2025 14:28
@paddle-bot
Copy link

paddle-bot bot commented Dec 1, 2025

Thanks for your contribution!

@codecov-commenter
Copy link

codecov-commenter commented Dec 2, 2025

Codecov Report

❌ Patch coverage is 60.24590% with 97 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@b5a7abe). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...model_executor/layers/moe/routing_indices_cache.py 56.75% 59 Missing and 5 partials ⚠️
...el_executor/layers/moe/fused_moe_triton_backend.py 11.11% 6 Missing and 2 partials ⚠️
...del_executor/layers/moe/fused_moe_wint2_backend.py 0.00% 5 Missing ⚠️
fastdeploy/model_executor/layers/moe/moe.py 76.19% 4 Missing and 1 partial ⚠️
...l_executor/layers/moe/fused_moe_cutlass_backend.py 42.85% 4 Missing ⚠️
fastdeploy/engine/args_utils.py 70.00% 2 Missing and 1 partial ⚠️
...el_executor/layers/moe/fused_moe_marlin_backend.py 0.00% 3 Missing ⚠️
...odel_executor/layers/moe/fused_moe_backend_base.py 50.00% 2 Missing ⚠️
..._executor/layers/moe/fused_moe_deepgemm_backend.py 0.00% 2 Missing ⚠️
fastdeploy/worker/gpu_model_runner.py 94.11% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #5321   +/-   ##
==========================================
  Coverage           ?   59.05%           
==========================================
  Files              ?      326           
  Lines              ?    40464           
  Branches           ?     6131           
==========================================
  Hits               ?    23895           
  Misses             ?    14726           
  Partials           ?     1843           
Flag Coverage Δ
GPU 59.05% <60.24%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a Rollout Routing Replay (R3) feature for RL training tasks, enabling the recording of routing information during inference and its direct utilization in the training process to address consistency issues between training and inference in MOE models.

Key Changes

  • Adds RoutingReplayManager class for request-level routing table management within FastDeploy
  • Adds RoutingStore abstraction with local file system implementation (RDMA implementation is work-in-progress)
  • Integrates routing replay into MOE layers and all quantization backends via hook functions

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
fastdeploy/config.py Adds RoutingReplayConfig class with configuration options for routing replay feature
fastdeploy/engine/args_utils.py Adds CLI argument parsing and config creation for routing replay
fastdeploy/engine/engine.py Passes routing replay config to worker service
fastdeploy/worker/worker_process.py Adds routing replay config argument and initializes it in FDConfig
fastdeploy/worker/gpu_model_runner.py Integrates routing replay manager with model runner, adds is_chunk_step tracking
fastdeploy/model_executor/forward_meta.py Adds routing_replay_table field to ForwardMeta for passing routing table through model layers
fastdeploy/model_executor/layers/moe/routing_indices_cache.py Core implementation: RoutingReplayManager, RoutingStore classes, and Triton kernel for saving routing data
fastdeploy/model_executor/layers/moe/moe.py Adds forward_meta parameter and topk_ids_hookfunc to enable routing capture in MOE layers
fastdeploy/model_executor/layers/moe/*.py Updates all MOE backend implementations to support routing replay hook function
fastdeploy/model_executor/models/glm4_moe.py Passes forward_meta to MOE layer forward calls
fastdeploy/rl/rollout_config.py Adds routing_replay_config parameter to rollout config
Comments suppressed due to low confidence (1)

fastdeploy/config.py:1491

  • Unnecessary 'pass' statement.
        pass

"--routing_replay_config",
type=json.loads,
default=None,
help="Configation of Rollout Routing Replay.",
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in help text: 'Configation' should be 'Configuration'.

Suggested change
help="Configation of Rollout Routing Replay.",
help="Configuration of Rollout Routing Replay.",

Copilot uses AI. Check for mistakes.
gongshaotian and others added 4 commits December 4, 2025 20:31
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

def forward_chunked_moe(self, x: paddle.Tensor, gate: nn.Layer, forward_meta: ForwardMeta):
def forward_chunked_moe(
self, x: paddle.Tensor, gate: nn.Layer, forward_meta: ForwardMeta, topk_ids_hookfunc: Callable = None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forward_meta 加个默认值

yuanlehome
yuanlehome previously approved these changes Dec 5, 2025
Copy link
Collaborator

@yuanlehome yuanlehome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Jiang-Jia-Jun Jiang-Jia-Jun merged commit 96d2d48 into PaddlePaddle:develop Dec 5, 2025
12 of 18 checks passed
Jiang-Jia-Jun added a commit that referenced this pull request Dec 5, 2025
Jiang-Jia-Jun added a commit that referenced this pull request Dec 5, 2025
gongshaotian added a commit that referenced this pull request Dec 5, 2025
@gongshaotian gongshaotian changed the title [RL] Support Rollout Routing Replay [Reverted][RL] Support Rollout Routing Replay Dec 5, 2025
EmmonsCurse pushed a commit that referenced this pull request Dec 5, 2025
* [RL] Support Rollout Routing Replay

* add routing indices cache

* fix config bug and moe forward bug

* R3 Support GLM

* support eb4.5

* fix merge bug

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* add routing replay ci

* support glm topk

* support orther top_k

* fix ci bug

* pre-commit

* only support chatcmpl

* Revert "Revert "[RL] Support Rollout Routing Replay (#5321)" (#5402)"

This reverts commit c45e064.

* Fix XPU and NPU bug

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Yuanle Liu <yuanlehome@163.com>
sunlei1024 pushed a commit that referenced this pull request Dec 6, 2025
* [RL] Support Rollout Routing Replay

* add routing indices cache

* fix config bug and moe forward bug

* R3 Support GLM

* support eb4.5

* fix merge bug

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* add routing replay ci

* support glm topk

* support orther top_k

* fix ci bug

* pre-commit

* only support chatcmpl

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Yuanle Liu <yuanlehome@163.com>
sunlei1024 pushed a commit that referenced this pull request Dec 6, 2025
sunlei1024 pushed a commit that referenced this pull request Dec 6, 2025
* [RL] Support Rollout Routing Replay

* add routing indices cache

* fix config bug and moe forward bug

* R3 Support GLM

* support eb4.5

* fix merge bug

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* add routing replay ci

* support glm topk

* support orther top_k

* fix ci bug

* pre-commit

* only support chatcmpl

* Revert "Revert "[RL] Support Rollout Routing Replay (#5321)" (#5402)"

This reverts commit c45e064.

* Fix XPU and NPU bug

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Yuanle Liu <yuanlehome@163.com>
Jiang-Jia-Jun pushed a commit that referenced this pull request Dec 8, 2025
* [RL] Support Rollout Routing Replay

* add routing indices cache

* fix config bug and moe forward bug

* R3 Support GLM

* support eb4.5

* fix merge bug

* Apply suggestion from @Copilot



* Apply suggestion from @Copilot



* Apply suggestion from @Copilot



* Apply suggestion from @Copilot



* add routing replay ci

* support glm topk

* support orther top_k

* fix ci bug

* pre-commit

* only support chatcmpl

* Revert "Revert "[RL] Support Rollout Routing Replay (#5321)" (#5402)"

This reverts commit c45e064.

* Fix XPU and NPU bug

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Yuanle Liu <yuanlehome@163.com>
Jiang-Jia-Jun added a commit that referenced this pull request Dec 8, 2025
… tools (#5418)

* feat(fmq): add ZMQ-based FMQ implementation and benchmark tools

* move FMQ_CONFIG_JSON to envs

* fix top_p_candidates (#5400)

Co-authored-by: freeliuzc <lzc842650834@gmail.com>

* [RL] Support Rollout Routing Replay (#5321)

* [RL] Support Rollout Routing Replay

* add routing indices cache

* fix config bug and moe forward bug

* R3 Support GLM

* support eb4.5

* fix merge bug

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* add routing replay ci

* support glm topk

* support orther top_k

* fix ci bug

* pre-commit

* only support chatcmpl

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Yuanle Liu <yuanlehome@163.com>

* [Bug fix] Fix the multi-input accuracy issue in the pooling model. (#5374)

* fix multi-inputs

* fix threshold

* fix threshold

* fix

* [BugFix]remove _execute_empty_input (#5396)

* Revert "[RL] Support Rollout Routing Replay (#5321)" (#5402)

This reverts commit 96d2d48.

* [New][RL] Support Rollout Routing Replay (#5405)

* [RL] Support Rollout Routing Replay

* add routing indices cache

* fix config bug and moe forward bug

* R3 Support GLM

* support eb4.5

* fix merge bug

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* add routing replay ci

* support glm topk

* support orther top_k

* fix ci bug

* pre-commit

* only support chatcmpl

* Revert "Revert "[RL] Support Rollout Routing Replay (#5321)" (#5402)"

This reverts commit c45e064.

* Fix XPU and NPU bug

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Yuanle Liu <yuanlehome@163.com>

* bf16 deepseek (#5379)

* fix deepseek (#5410)

* Update tests/inter_communicator/test_fmq_factory.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update benchmarks/benchmark_fmq.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update fastdeploy/inter_communicator/fmq.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

---------

Co-authored-by: GoldPancake <56388518+Deleter-D@users.noreply.github.com>
Co-authored-by: freeliuzc <lzc842650834@gmail.com>
Co-authored-by: RAM <gstian5555@outlook.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Yuanle Liu <yuanlehome@163.com>
Co-authored-by: lizexu123 <39205361+lizexu123@users.noreply.github.com>
Co-authored-by: 周周周 <39978853+zhoutianzi666@users.noreply.github.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
Co-authored-by: bukejiyu <52310069+bukejiyu@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants