Skip to content

Conversation

@rainyfly
Copy link
Collaborator

@rainyfly rainyfly commented Oct 27, 2025

Motivation

Support and robust for tpN for PD.

  1. For TP-N,each TP in P DP instance should send cache to corresponding TP in D. In previous implementation, there is no check for failure in one tp shard, which may influence correctness and stability. To solve this problem:
  • Add sync for adding cache transfer task
  • Add sync for checking result in all TP.

Because engine_worker_queue is used for delivering cache task and cache transfer result, TP results are synchronized in it. So we add multiple functions to support following patternings:

  1. N deliver data -> 1 recieves data.
  2. 1 delivers data -> N recieve data.

@paddle-bot
Copy link

paddle-bot bot commented Oct 27, 2025

Thanks for your contribution!

@Jiang-Jia-Jun Jiang-Jia-Jun merged commit 25498ef into PaddlePaddle:develop Nov 3, 2025
12 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants