vLLM native OpenAI compatible server with weight syncing#63
vLLM native OpenAI compatible server with weight syncing#63BjarniHaukur wants to merge 2 commits intoPrimeIntellect-ai:mainfrom
Conversation
|
I commented out the data parallel tests. I'm baffled as to why they don't work here. I have an open PR to trl where they pass without fail (trl pr) If this is something you want to merge, I can take another look. |
|
i ended up implementing my own dynamic batching server which builds around LLM() but exposes an async endpoint, i never found AsyncLLM to be reliable for weight-syncing in training runs and it seems like the general guidance is that isn't fully supported without a custom build config/container (e.g. how it's handled in veRL and related libraries) at some point we'll migrate to AsyncLLM but the current solution works well for now (perf tests are fairly close in my measurements) and offers a bit more control for error handling. |
Out of curiosity, what kind of errors did you hit with |
* Implement HTTPMonitor to send node status and training progress (PrimeIntellect-ai#17) * Implement HTTPMonitor to send node status and training progress to generic HTTP Server * Address PR comments * Track stage of training in monitor * fix logger in http monitor * make default monitor wandb * Send IP information to HTTPMonitor * Fix ruff issues * Separate metric logger and monitors * Minor bug fix * Revert metric_logger setup to initial impl * Update monitor config setup --------- Co-authored-by: Sami Jaghouar <sami.jaghouar@hotmail.fr> * Fix bug where HTTP Monitor wasn't handling async funcs properly (PrimeIntellect-ai#38) * Fix minor rebase issue * Fix ruff issues * Fix rebase issues --------- Co-authored-by: Sami Jaghouar <sami.jaghouar@hotmail.fr>
This PR adds a new
vllm_serve_async.pyscript to verifiers. It adds:vllm_serve.pyvllm.entrypoints.openai.api_server