[Cherry-Pick] [BugFix] fix instability after clearing weight (#5493) #5487
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
当前的 event_loop_normal 有两个状态信号,model_weight_status 是一个 SharedMemory,model_weights_signal 是一个 numpy 数组。每轮循环中,这个 np 数组会先读取 shm 的值,做一次 broadcast 之后再把自身值反向写回 shm。
另一方面,api server 也会去写 shm,在收到 /update_model_weight 调用时更新 shm 的状态为 UPDATING。如果 api server 写 shm 时,worker 刚好执行到 np 数组读取 shm 之后、写回 shm之前,np 数组读的是 CLEARED 状态,就会把 shm 的 UPDATING 信号覆盖掉,导致更新权重超时。
如果在清除权重后检测 shm 值为 CLEARED 时 sleep 住的话,可以保证 np 数组在参数 offload 期间不会去读/写 shm 的值,直到下一次 api server 收到 update 信号时 worker 才解冻,这样下一次 np 数组读的就是正确的 UPDATING 信号,不会读到上一轮的 CLEARED 信号。
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.