Skip to content

vLLM native OpenAI compatible server with weight syncing#63

Closed
BjarniHaukur wants to merge 2 commits intoPrimeIntellect-ai:mainfrom
BjarniHaukur:main
Closed

vLLM native OpenAI compatible server with weight syncing#63
BjarniHaukur wants to merge 2 commits intoPrimeIntellect-ai:mainfrom
BjarniHaukur:main

Conversation

@BjarniHaukur
Copy link

This PR adds a new vllm_serve_async.py script to verifiers. It adds:

  • a fully featured OpenAI compatible endpoint
  • that mirrors the weight syncing logic from vllm_serve.py
  • while delegating endpoint complexity to vllm.entrypoints.openai.api_server

image

@CLAassistant
Copy link

CLAassistant commented May 26, 2025

CLA assistant check
All committers have signed the CLA.

@BjarniHaukur
Copy link
Author

I commented out the data parallel tests. I'm baffled as to why they don't work here. I have an open PR to trl where they pass without fail (trl pr)

If this is something you want to merge, I can take another look.

@willccbb
Copy link
Member

willccbb commented Jun 2, 2025

i ended up implementing my own dynamic batching server which builds around LLM() but exposes an async endpoint, i never found AsyncLLM to be reliable for weight-syncing in training runs and it seems like the general guidance is that isn't fully supported without a custom build config/container (e.g. how it's handled in veRL and related libraries)

at some point we'll migrate to AsyncLLM but the current solution works well for now (perf tests are fairly close in my measurements) and offers a bit more control for error handling.

@willccbb willccbb closed this Jun 2, 2025
@lewtun
Copy link

lewtun commented Jul 21, 2025

i never found AsyncLLM to be reliable for weight-syncing in training runs

Out of curiosity, what kind of errors did you hit with AsyncLLM?

ronaldnetawat pushed a commit to ronaldnetawat/verifiers that referenced this pull request Nov 13, 2025
* Implement HTTPMonitor to send node status and training progress (PrimeIntellect-ai#17)

* Implement HTTPMonitor to send node status and training progress to generic HTTP Server

* Address PR comments

* Track stage of training in monitor

* fix logger in http monitor

* make default monitor wandb

* Send IP information to HTTPMonitor

* Fix ruff issues

* Separate metric logger and monitors

* Minor bug fix

* Revert metric_logger setup to initial impl

* Update monitor config setup

---------

Co-authored-by: Sami Jaghouar <sami.jaghouar@hotmail.fr>

* Fix bug where HTTP Monitor wasn't handling async funcs properly (PrimeIntellect-ai#38)

* Fix minor rebase issue

* Fix ruff issues

* Fix rebase issues

---------

Co-authored-by: Sami Jaghouar <sami.jaghouar@hotmail.fr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments