Separately wakeup vllm to reduce peak memory in refitting

During a GRPO experiment (Qwen2.5-7B, max-len=6144, 8*H100), I encountered OOM (Out of Memory) issues when waking up VLLM. 
Additionally, I noticed that in the latest VLLM version `0.8.3`, they updated the wakeup API by adding a 'tags' parameter [PR](https://github.com/vllm-project/vllm/pull/15500). As shown in the figure below, we can first load only the weights, then update the parameters, and finally load the KV cache to reduce peak memory usage during the refit_policy_generation phase. This logic has also been implemented in veRL.

![wakeup](https://private-user-images.githubusercontent.com/46737979/425212360-37033bed-67a6-4578-9355-7111aaf719ec.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NDQ3MDU0MDUsIm5iZiI6MTc0NDcwNTEwNSwicGF0aCI6Ii80NjczNzk3OS80MjUyMTIzNjAtMzcwMzNiZWQtNjdhNi00NTc4LTkzNTUtNzExMWFhZjcxOWVjLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTA0MTUlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwNDE1VDA4MTgyNVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWNmNDJkMDc1NzU0ZWUyZjJjYzY0YWIwNjBkMzI5Mzk4Mzc1ZWFlOTQ5ODQxMDA0ODFlYTgwYzYwYTNhMmFiMDAmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.ZO546BkngMnWh8DsVOMrNGj7H_Hf2s5F33B1pDcq4TA)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separately wakeup vllm to reduce peak memory in refitting #191

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Separately wakeup vllm to reduce peak memory in refitting #191

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions