Skip to content

Conversation

@rainyfly
Copy link
Collaborator

@rainyfly rainyfly commented Oct 27, 2025

Motivation

Support EPLB.

为保证MoE部分不同专家之间的负载均衡,会将共享专家和高负载的细粒度专家在集群的不同GPU做多个复制,让GPU把更多的热数据(发给共享专家的)跑起来。

EPLB 通过复制高负载专家(Redundant Experts Strategy)并对专家分配进行启发式调整,确保不同 GPU 之间的负载均衡。这种方法解决了专家并行中因专家负载不均导致的计算资源浪费问题。分层负载平衡策略也可用于预填充阶段,具有较小的专家并行规模。

@paddle-bot
Copy link

paddle-bot bot commented Oct 27, 2025

Thanks for your contribution!

kevincheng2
kevincheng2 previously approved these changes Oct 31, 2025
rank_expert_list, logical_to_physical_map, expert_count
)
# TO BE FIXED
self.worker.get_model().update_state_dict(state_dicts)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注意,后续这里需要采用copy_覆盖原始权重的形式,而不是替换原Tensor对象

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

收到,后边修改

Copy link
Collaborator

@yuanlehome yuanlehome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR标题和描述尽量详细点

@Jiang-Jia-Jun Jiang-Jia-Jun merged commit f83d0cf into PaddlePaddle:develop Nov 3, 2025
24 of 28 checks passed
@kevincheng2 kevincheng2 mentioned this pull request Nov 3, 2025
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants