[Cherry-Pick][BugFix]fix the bug for prefilled_step_idx signal of cache_messager in cudagraph and PD #4252

zeroRains · 2025-09-24T13:21:34Z

问题描述：
pd分离+cudagraph在压测推理时会出现部分query推理结果超长的现象，结果超长导致了使用cache的blocks数量激增，出现频繁的cache换入换出现象，导致开启cudagraph后的QPS显著低于不开启cudagraph的QPS。

产生原因：
pd分离场景下，P节点和D节点的KV Cache交互需要cache_messager进行管理，在cache_messager中，会使用prefilled_step_idx这个信号参与传输逻辑，这个信号只有在atten_backend初始化的时候会增加。在cudagraph捕获图的过程中，有多次atten_backend的初始化过程，而不开启cudagraph时，真正进行推理之前是不会有atten_backend的初始化过程的。这就导致了开启cudagraph和不开启cudagraph时，prefilled_step_idx的初始值不一致的问题，从而影响了后续KV Cache的传输，出现模型计算结果diff，导致开启cudagraph在部分query计算时出现超长的问题。

解决方案：
在cudagraph捕获图完成后，重置prefilled_step_idx信号，确保开启cudagraph和不开启cudagraph时，cache_messager的信号保持一致。

原pr： #4235
pcard-71500

paddle-bot · 2025-09-24T13:21:41Z

Thanks for your contribution!

…aph and PD

gongshaotian · 2025-10-11T03:03:11Z

这里也贴一下原PR～

gongshaotian

LGTM

zeroRains force-pushed the pd_0908 branch from 7f84529 to 5096276 Compare September 25, 2025 01:47

fix the bug for prefilled_step_idx signal of cache_messager in cudagr…

8a881ad

…aph and PD

zeroRains force-pushed the pd_0908 branch from bd17ec8 to 8a881ad Compare September 25, 2025 05:18

zeroRains added 6 commits September 25, 2025 14:34

Merge branch 'feature/experimental_feature_20250908' into pd_0908

0d3f684

Merge branch 'feature/experimental_feature_20250908' into pd_0908

d43da1d

Merge branch 'feature/experimental_feature_20250908' into pd_0908

6e17300

Merge branch 'feature/experimental_feature_20250908' into pd_0908

ec8ee95

Merge branch 'feature/experimental_feature_20250908' into pd_0908

fb763c5

support dp

5e62847

zeroRains force-pushed the pd_0908 branch from fb9bd4f to 5e62847 Compare October 10, 2025 10:53

gongshaotian approved these changes Oct 11, 2025

View reviewed changes

YuanRisheng added the skip-ci: coverage label Oct 11, 2025

Jiang-Jia-Jun merged commit 07db281 into PaddlePaddle:feature/experimental_feature_20250908 Oct 13, 2025
14 of 15 checks passed

zeroRains deleted the pd_0908 branch October 13, 2025 02:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cherry-Pick][BugFix]fix the bug for prefilled_step_idx signal of cache_messager in cudagraph and PD #4252

[Cherry-Pick][BugFix]fix the bug for prefilled_step_idx signal of cache_messager in cudagraph and PD #4252

Uh oh!

zeroRains commented Sep 24, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented Sep 24, 2025

Uh oh!

gongshaotian commented Oct 11, 2025

Uh oh!

gongshaotian left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Cherry-Pick][BugFix]fix the bug for prefilled_step_idx signal of cache_messager in cudagraph and PD #4252

[Cherry-Pick][BugFix]fix the bug for prefilled_step_idx signal of cache_messager in cudagraph and PD #4252

Uh oh!

Conversation

zeroRains commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paddle-bot bot commented Sep 24, 2025

Uh oh!

gongshaotian commented Oct 11, 2025

Uh oh!

gongshaotian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zeroRains commented Sep 24, 2025 •

edited

Loading