[WIP]feat: add router replay for megatron engine by TaoZex · Pull Request #1207 · inclusionAI/AReaL

TaoZex · 2026-04-18T12:40:16Z

Description

This PR implements Rollout Routing Replay (R3) for MoE models, addressing training instability caused by inference-training routing discrepancy in asynchronous RL training. R3 records expert routing indices from the inference engine and replays them during training, ensuring consistent expert selection regardless of weight staleness.

Key Changes

Core MoE Patch (router_replay_patch.py):

RouterReplay class (one per MoE layer) with RECORD/REPLAY_FORWARD/REPLAY_BACKWARD actions
patched_routing: replaces TopKRouter.routing — uses scores.gather(1, target_topk_idx) in replay mode instead of torch.topk, preserving gradient flow
Four monkey-patches: TransformerConfig.__init__, TopKRouter.__init__, TopKRouter.routing, MoEAlltoAllTokenDispatcher.preprocess

Data Distribution (router_replay_utils.py):

set_router_replay_data: 4-step pipeline — right-pad→left-align → TP/SP scatter → PP layer slice → Dense/MoE mapping
RouterReplayHelper: locates RouterReplay instances by (pp_rank, vp_stage)
Layer allocation helpers: get_num_layers_to_build, get_moe_num_layers_to_build (PP/VP aware)

MegatronEngine Integration (megatron_engine_r3_patch.py):

Wraps forward_backward_batch: retrieves routed_experts via side-channel, splits per micro-batch, injects replay setup via per-instance class swap, toggles forward/backward replay mode, cleans up in finally

Actor & Workflow Integration (actor_r3_patch.py, rlvr_r3_patch.py):

Actor: splits routed_experts per mini-batch, delivers via engine side-channel (bypasses pack_tensor_dict 4D incompatibility)
Workflow: resolve_r3_moe_config auto-resolves num_moe_layers/topk from HF config; extract_routed_experts converts SGLang numpy output to left-padded torch tensor

SGLang Integration (sglang_r3_patch.py, sglang_remote.py):

Server patch: pre-encodes routed_experts as base64 in TokenizerManager._handle_batch_output (fixes jsonable_encoder silently flattening torch.Tensor to {} when skip_tokenizer_init=True)
Client: decodes base64, validates num_sgl_token divisibility

Orchestrator & Config (rl_trainer.py, cli_args.py):

return_routed_experts=True → auto-sets enable_router_replay, resolves MoE config, forces skip_tokenizer_init=True, validates SGLang-only support

Supported Parallelism

Dimension	Supported	Mechanism
TP	✅	`scatter_to_sequence_parallel_region` + `seq_align_to` by `tp_size`
PP	✅	`get_current_rank_layer_info` slices per PP rank's MoE layers
VP	✅	Cumulative offset by `vp_stage` in `RouterReplayHelper`
CP	✅	`seq_align_to = tp_size * cp_size * 2` when `cp_size > 1`
DP	✅	Data flows with mini-batches; no conflict

New Metrics

rollout_train_logprobs_abs_diff_mean:

mean = (1/|M|) * Σ_{i∈M} |log π_rollout(a_i|s_i) - log π_train(a_i|s_i)|,  M = {i: loss_mask[i]=1}

Mean absolute difference between rollout and training log-probs over response tokens. Reflects routing-inconsistency-induced policy deviation; R3 should reduce this to only weight-update drift.

rollout_train_logprobs_abs_diff_std:

std = sqrt((1/(|M|-1)) * Σ_{i∈M} (|log π_rollout(a_i|s_i) - log π_train(a_i|s_i)| - mean)²)

Standard deviation of the above differences. High values indicate extreme outliers from completely inconsistent routing — the primary cause of PPO training collapse via disproportionate gradients.

Both computed in torch.no_grad() on detached tensors; negligible overhead.

Related Paper

Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers (Ma et al., arXiv:2510.11370, 2025) — proposes R3 to reduce training-inference policy KL divergence and prevent MoE RL training collapse.

Related Issue

Fixes #(issue)

Type of Change

Checklist

I have read the Contributing Guide
Pre-commit hooks pass (pre-commit run --all-files)
Relevant tests pass; new tests added for new functionality
Documentation updated (if applicable; built with ./docs/build_all.sh)
Branch is up to date with main
Self-reviewed via /review-pr command
This PR was created by a coding agent via /create-pr
This PR is a breaking change

Breaking Change Details (if applicable):

N/A

Additional Context

Backward Compatible: return_routed_experts=False (default) → all R3 code inactive, zero overhead
SGLang Only: vLLM backend does not support return_routed_experts; config validation raises explicit error
Side-Channel Delivery: routed_experts delivered via engine._r3_pending_routed_experts to bypass pack_tensor_dict 4D incompatibility
Server Patch Required: sglang_r3_patch must be installed on inference server to fix torch.Tensor serialization when skip_tokenizer_init=True

gemini-code-assist

Code Review

This pull request implements Router Replay (R3) to align Mixture-of-Experts (MoE) routing decisions between rollout inference and training, preventing performance degradation caused by weight staleness in RL. The changes include monkey-patches for Megatron-Core components, engine-level wrappers for micro-batch scheduling, and workflow integrations to propagate routing indices from SGLang. Feedback focuses on critical architectural issues regarding global state and thread safety, specifically the risks of patching class-level iterators and using global lists for router instances. Additionally, there are recommendations to fix potential data loss in uneven batch splitting and to optimize performance by removing GPU-CPU synchronization points in the data processing pipeline.

…l_moe

TaoZex · 2026-04-29T16:11:36Z

Unit Test Results (including test_r3_mask_alignment.py and test_router_replay.py)

End-to-End Test Results (including test_router_replay_e2e.py)

TaoZex · 2026-04-30T04:35:53Z

Metric Comparison: Router Replay (r3)

Using the moonlight_16b_a3b_gsm8k_grpo_megatron_h20.yaml configuration, compare the metric results with router replay (r3) enabled versus disabled:

rollout_train_logprobs_abs_diff_std metric comparison
(Standard deviation of absolute differences in training log-probabilities during rollout)

rollout_train_logprobs_abs_diff_mean metric comparison
(Mean of absolute differences in training log-probabilities during rollout)

bingyechen added 6 commits April 18, 2026 17:34

feat: add router replay for megatron engine

fd37381

feat: fix

e7bf2c6

feat: fix

00ed924

feat: add config for test

bb650a2

feat: fix

87dca2a

faet: fix

6198aea

gemini-code-assist Bot reviewed Apr 18, 2026

View reviewed changes

bingyechen added 23 commits April 18, 2026 22:58

fix(router): refactor

45fbf9f

fix(engine): fix routed_experts format

4655264

fix(sglang): ban skip_tokenizer_init

833ec68

feat(math): add base config

a887b52

fix(math): fix config

e5dae7d

fix: fix skip_tokenizer_init

c06c5d0

feat(engine): fix optimizer

c53033a

feat(router): fix code

9b752a4

fix(engine): remove

b149621

fix: logger fix

7161578

feat: add r3 log

e05df49

feat(validate): improve r3

2027b6a

refactor(router_replay_patch): print

4ca85b2

fix(router): fix calculate router

c78ee7f

fix(ppo): add warning

c1ba06b

feat(config): fix

1a5201c

fix(engine): fix forward_only

e9c8ddd

fix(config ): set skip_tokenizer_init false

e770b98

fix(ppo): add dense count

66a8269

fix(math): fix eps_clip

79e7d9d

feat(math): add config

23b657c

fix(config): remove Instruct

a9b0b46

feat(R3): add logprob diff

e78147c

bingyechen and others added 19 commits April 29, 2026 11:45

fix(guard): remove useless log

3fa3283

refactor: restore local.py

805bda8

docs: remove

1ec270d

style: log level

3f40170

refactor(r3): remove metric

1449e9a

refactor(engine): remove RouterReplay statics

515312c

Merge branch 'main' into final_moe

fecd34b

feat: add docs and tests

cfe2b31

test(r3_mask_alignment): remove unused test

a61ef21

test(r3_mask_alignment): fix test

e8296a0

test(router_replay): fix test

6235267

refactor(tests): get model

200addc

fix: fix test

2e2ef99

test(torchrun): add prox_logp

9e25cd9

docs: remove useless

86951db

Merge branch 'main' into final_moe

eaf8b27

docs: remove

7d4ced3

Merge branch 'final_moe' of https://github.com/TaoZex/AReaL into fina…

439976a

…l_moe

fix(trainer): remove useless code

fc338b1

TaoZex marked this pull request as ready for review April 29, 2026 16:03

TaoZex requested review from garrett4wade, nuzant and rchardx as code owners April 29, 2026 16:03

TaoZex marked this pull request as draft April 30, 2026 07:10

TaoZex force-pushed the final_moe branch from 51f49a0 to fc338b1 Compare April 30, 2026 07:18

bingyechen added 3 commits April 30, 2026 15:26

fix(ppo): add r3 for _compute_logp

5779c3a

refactor(ppo): fix metric

4cc6544

feat(ppo): add _r3_effectiveness_stats

e2a7980

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP]feat: add router replay for megatron engine#1207

[WIP]feat: add router replay for megatron engine#1207
TaoZex wants to merge 89 commits intoinclusionAI:mainfrom
TaoZex:final_moe

TaoZex commented Apr 18, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TaoZex commented Apr 29, 2026

Uh oh!

TaoZex commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

TaoZex commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key Changes

Supported Parallelism

New Metrics

Related Paper

Related Issue

Type of Change

Checklist

Additional Context

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TaoZex commented Apr 29, 2026

Uh oh!

TaoZex commented Apr 30, 2026

Metric Comparison: Router Replay (r3)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

TaoZex commented Apr 18, 2026 •

edited

Loading