Currently we use vllm.distributed.device_communicators.pynccl to broadcast and update weights in our non-colocated implementation (#489) since ray.util.collective can't work well when vllm tp-size>1 for now.
It will make our train worker coupled with specific inference backend, so it's better to find how to use a native ray collective to decouple them.
Currently we use
vllm.distributed.device_communicators.pyncclto broadcast and update weights in our non-colocated implementation (#489) sinceray.util.collectivecan't work well when vllm tp-size>1 for now.It will make our train worker coupled with specific inference backend, so it's better to find how to use a native ray collective to decouple them.