During a GRPO experiment (Qwen2.5-7B, max-len=6144, 8*H100), I encountered OOM (Out of Memory) issues when waking up VLLM.
Additionally, I noticed that in the latest VLLM version 0.8.3, they updated the wakeup API by adding a 'tags' parameter PR. As shown in the figure below, we can first load only the weights, then update the parameters, and finally load the KV cache to reduce peak memory usage during the refit_policy_generation phase. This logic has also been implemented in veRL.

During a GRPO experiment (Qwen2.5-7B, max-len=6144, 8*H100), I encountered OOM (Out of Memory) issues when waking up VLLM.
Additionally, I noticed that in the latest VLLM version
0.8.3, they updated the wakeup API by adding a 'tags' parameter PR. As shown in the figure below, we can first load only the weights, then update the parameters, and finally load the KV cache to reduce peak memory usage during the refit_policy_generation phase. This logic has also been implemented in veRL.