[Cherry-Pick][BugFix]fix the bug for prefilled_step_idx signal of cache_messager in cudagraph and PD #4252
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
问题描述:
pd分离+cudagraph在压测推理时会出现部分query推理结果超长的现象,结果超长导致了使用cache的blocks数量激增,出现频繁的cache换入换出现象,导致开启cudagraph后的QPS显著低于不开启cudagraph的QPS。
产生原因:
pd分离场景下,P节点和D节点的KV Cache交互需要cache_messager进行管理,在cache_messager中,会使用prefilled_step_idx这个信号参与传输逻辑,这个信号只有在atten_backend初始化的时候会增加。在cudagraph捕获图的过程中,有多次atten_backend的初始化过程,而不开启cudagraph时,真正进行推理之前是不会有atten_backend的初始化过程的。这就导致了开启cudagraph和不开启cudagraph时,prefilled_step_idx的初始值不一致的问题,从而影响了后续KV Cache的传输,出现模型计算结果diff,导致开启cudagraph在部分query计算时出现超长的问题。
解决方案:
在cudagraph捕获图完成后,重置prefilled_step_idx信号,确保开启cudagraph和不开启cudagraph时,cache_messager的信号保持一致。
原pr: #4235
pcard-71500