Error:
Same behavior as #564
raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)
Repro:
commit 3f6d52f
uv run python examples/run_grpo_math.py \
policy.generation.colocated.enabled=false \
policy.generation.colocated.resources.gpus_per_node=2 \
policy.generation.vllm_cfg.tensor_parallel_size=2 \
checkpointing.enabled=false \
cluster.gpus_per_node=4
Error:
Same behavior as #564
Repro:
commit 3f6d52f
uv run python examples/run_grpo_math.py \ policy.generation.colocated.enabled=false \ policy.generation.colocated.resources.gpus_per_node=2 \ policy.generation.vllm_cfg.tensor_parallel_size=2 \ checkpointing.enabled=false \ cluster.gpus_per_node=4