Skip to content

Conversation

@zeroRains
Copy link
Contributor

@zeroRains zeroRains commented Sep 24, 2025

问题描述:
pd分离+cudagraph在压测推理时会出现部分query推理结果超长的现象,结果超长导致了使用cache的blocks数量激增,出现频繁的cache换入换出现象,导致开启cudagraph后的QPS显著低于不开启cudagraph的QPS。

产生原因:
pd分离场景下,P节点和D节点的KV Cache交互需要cache_messager进行管理,在cache_messager中,会使用prefilled_step_idx这个信号参与传输逻辑,这个信号只有在atten_backend初始化的时候会增加。在cudagraph捕获图的过程中,有多次atten_backend的初始化过程,而不开启cudagraph时,真正进行推理之前是不会有atten_backend的初始化过程的。这就导致了开启cudagraph和不开启cudagraph时,prefilled_step_idx的初始值不一致的问题,从而影响了后续KV Cache的传输,出现模型计算结果diff,导致开启cudagraph在部分query计算时出现超长的问题。

解决方案:
在cudagraph捕获图完成后,重置prefilled_step_idx信号,确保开启cudagraph和不开启cudagraph时,cache_messager的信号保持一致。

原pr: #4235
pcard-71500

@paddle-bot
Copy link

paddle-bot bot commented Sep 24, 2025

Thanks for your contribution!

@gongshaotian
Copy link
Collaborator

这里也贴一下原PR~

Copy link
Collaborator

@gongshaotian gongshaotian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Jiang-Jia-Jun Jiang-Jia-Jun merged commit 07db281 into PaddlePaddle:feature/experimental_feature_20250908 Oct 13, 2025
14 of 15 checks passed
@zeroRains zeroRains deleted the pd_0908 branch October 13, 2025 02:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants