Hi there,
I've been trying to run multi-turn GRPO. The training set is in size of 1027 data points and val set is 4715. Saving checkpoints and hf models was working fine until step 72, and at step 72 checkpoints took over 15 mins to save, but when it was trying to save a huggingface model at step 73 it failed with the following error message:
�[36m(skyrl_entrypoint pid=3034109)�[0m 2025-09-08 22:36:45.019 | INFO | skyrl_train.trainer:train:359 - Finished: 'save_hf_model', time cost: 601.62s
Traceback (most recent call last):
File "/home/ubuntu/SkyRL/skyrl-train/skyrl_train/entrypoints/main_base.py", line 288, in main
ray.get(skyrl_entrypoint.remote(cfg))
File "/home/ubuntu/.cache/uv/builds-v0/.tmpu9oQzg/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.cache/uv/builds-v0/.tmpu9oQzg/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return func(args, kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.cache/uv/builds-v0/.tmpu9oQzg/lib/python3.12/site-packages/ray/_private/worker.py", line 2858, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.cache/uv/builds-v0/.tmpu9oQzg/lib/python3.12/site-packages/ray/_private/worker.py", line 958, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ActorDiedError): �[36mray::skyrl_entrypoint()�[39m (pid=3034109, ip=172.31.18.65)
File "/home/ubuntu/SkyRL/skyrl-train/skyrl_train/entrypoints/main_base.py", line 279, in skyrl_entrypoint
exp.run()
File "/home/ubuntu/SkyRL/skyrl-train/skyrl_train/entrypoints/main_base.py", line 272, in run
trainer.train()
File "/tmp/ray/session_2025-09-08_15-43-10_928650_3021217/runtime_resources/working_dir_files/_ray_pkg_18ecd9287266148b/skyrl_train/trainer.py", line 360, in train
self.save_models()
File "/tmp/ray/session_2025-09-08_15-43-10_928650_3021217/runtime_resources/working_dir_files/_ray_pkg_18ecd9287266148b/skyrl_train/trainer.py", line 1376, in save_models
ray.get(
^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
class_name: FSDPPolicyRayActorBase
actor_id: d23854da87fe7662fa512c1b01000000
pid: 3041415
namespace: 2b8a18f5-2d6c-40b4-b2b6-6ee8d3bf69df
ip: ...
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Here is the logged curve of timing/save_checkpoints
I feel like this error is related to the size of my training set since I tried to run the exact same config on a ~2000 data point training set and it couldn't even save the hf model on the first saving interval (step 18). And i've been consistently encountering this error.
NUM_GPUS=8
NUM_INFERENCE_ENGINES=2
TP_SIZE=4
MAX_INPUT_LENGTH=29000
MAX_GENERATE_LENGTH=3000
TRAIN_BATCH_SIZE=128
uv run --isolated --extra vllm -m skyrl_train.entrypoints.main_base
trainer.algorithm.advantage_estimator="grpo"
data.train_data="['/home/ubuntu/s3_in/skyrl/rl_training_set_1027.parquet']"
data.val_data="['/home/ubuntu/s3_in/skyrl/rl_validation_set_4715.parquet']"
trainer.policy.model.path="customized_qwen2.5-code-7B-instruct"
trainer.epochs=20
trainer.placement.colocate_all=true
trainer.strategy=fsdp2
trainer.policy.fsdp_config.cpu_offload=false
trainer.ref.fsdp_config.cpu_offload=true
trainer.policy.optimizer_config.max_grad_norm=0.5
trainer.policy.sequence_parallel_size=1
trainer.placement.policy_num_gpus_per_node=$NUM_GPUS
trainer.placement.ref_num_gpus_per_node=$NUM_GPUS
generator.num_inference_engines=$NUM_INFERENCE_ENGINES
generator.inference_engine_tensor_parallel_size=$TP_SIZE
trainer.train_batch_size=$TRAIN_BATCH_SIZE
trainer.micro_forward_batch_size_per_gpu=8
trainer.micro_train_batch_size_per_gpu=1
trainer.max_prompt_length=6000
generator.max_input_length=$MAX_INPUT_LENGTH
generator.sampling_params.max_generate_length=$MAX_GENERATE_LENGTH
trainer.policy.optimizer_config.lr=1.0e-6
trainer.policy_mini_batch_size=128
trainer.algorithm.use_kl_loss=false
trainer.ckpt_interval=18
trainer.max_ckpts_to_keep=1
trainer.hf_save_interval=19
trainer.dump_data_batch=false
generator.backend=vllm
generator.run_engines_locally=true
generator.weight_sync_backend=nccl
generator.async_engine=true
generator.batched=false
environment.env_class=text2sql
generator.use_conversation_multi_turn=true
generator.n_samples_per_prompt=6
generator.gpu_memory_utilization=0.7
generator.max_turns=10
generator.sampling_params.temperature=0.5
generator.sampling_params.top_p=0.95
generator.sampling_params.stop='["", ""]'
generator.append_eos_token_after_stop_str_in_multi_turn=true
generator.eval_sampling_params.stop='["", ""]'
generator.eval_n_samples_per_prompt=6
trainer.project_name="skyrlsql_p5"
trainer.run_name=$run_name
trainer.resume_mode=latest
trainer.ckpt_path=$HOME/ckpts/${run_name}
trainer.export_path=$HOME/skyrl_export/${run_name}/
trainer.eval_batch_size=1024
trainer.eval_before_train=false
trainer.eval_interval=18
Hi there,
I've been trying to run multi-turn GRPO. The training set is in size of 1027 data points and val set is 4715. Saving checkpoints and hf models was working fine until step 72, and at step 72 checkpoints took over 15 mins to save, but when it was trying to save a huggingface model at step 73 it failed with the following error message:
�[36m(skyrl_entrypoint pid=3034109)�[0m 2025-09-08 22:36:45.019 | INFO | skyrl_train.trainer:train:359 - Finished: 'save_hf_model', time cost: 601.62s
Traceback (most recent call last):
File "/home/ubuntu/SkyRL/skyrl-train/skyrl_train/entrypoints/main_base.py", line 288, in main
ray.get(skyrl_entrypoint.remote(cfg))
File "/home/ubuntu/.cache/uv/builds-v0/.tmpu9oQzg/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.cache/uv/builds-v0/.tmpu9oQzg/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return func(args, kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.cache/uv/builds-v0/.tmpu9oQzg/lib/python3.12/site-packages/ray/_private/worker.py", line 2858, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.cache/uv/builds-v0/.tmpu9oQzg/lib/python3.12/site-packages/ray/_private/worker.py", line 958, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ActorDiedError): �[36mray::skyrl_entrypoint()�[39m (pid=3034109, ip=172.31.18.65)
File "/home/ubuntu/SkyRL/skyrl-train/skyrl_train/entrypoints/main_base.py", line 279, in skyrl_entrypoint
exp.run()
File "/home/ubuntu/SkyRL/skyrl-train/skyrl_train/entrypoints/main_base.py", line 272, in run
trainer.train()
File "/tmp/ray/session_2025-09-08_15-43-10_928650_3021217/runtime_resources/working_dir_files/_ray_pkg_18ecd9287266148b/skyrl_train/trainer.py", line 360, in train
self.save_models()
File "/tmp/ray/session_2025-09-08_15-43-10_928650_3021217/runtime_resources/working_dir_files/_ray_pkg_18ecd9287266148b/skyrl_train/trainer.py", line 1376, in save_models
ray.get(
^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
class_name: FSDPPolicyRayActorBase
actor_id: d23854da87fe7662fa512c1b01000000
pid: 3041415
namespace: 2b8a18f5-2d6c-40b4-b2b6-6ee8d3bf69df
ip: ...
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Here is the logged curve of timing/save_checkpoints
I feel like this error is related to the size of my training set since I tried to run the exact same config on a ~2000 data point training set and it couldn't even save the hf model on the first saving interval (step 18). And i've been consistently encountering this error.
NUM_GPUS=8
NUM_INFERENCE_ENGINES=2
TP_SIZE=4
MAX_INPUT_LENGTH=29000
MAX_GENERATE_LENGTH=3000
TRAIN_BATCH_SIZE=128
uv run --isolated --extra vllm -m skyrl_train.entrypoints.main_base
trainer.algorithm.advantage_estimator="grpo"
data.train_data="['/home/ubuntu/s3_in/skyrl/rl_training_set_1027.parquet']"
data.val_data="['/home/ubuntu/s3_in/skyrl/rl_validation_set_4715.parquet']"
trainer.policy.model.path="customized_qwen2.5-code-7B-instruct"
trainer.epochs=20
trainer.placement.colocate_all=true
trainer.strategy=fsdp2
trainer.policy.fsdp_config.cpu_offload=false
trainer.ref.fsdp_config.cpu_offload=true
trainer.policy.optimizer_config.max_grad_norm=0.5
trainer.policy.sequence_parallel_size=1
trainer.placement.policy_num_gpus_per_node=$NUM_GPUS
trainer.placement.ref_num_gpus_per_node=$NUM_GPUS
generator.num_inference_engines=$NUM_INFERENCE_ENGINES
generator.inference_engine_tensor_parallel_size=$TP_SIZE
trainer.train_batch_size=$TRAIN_BATCH_SIZE
trainer.micro_forward_batch_size_per_gpu=8
trainer.micro_train_batch_size_per_gpu=1
trainer.max_prompt_length=6000
generator.max_input_length=$MAX_INPUT_LENGTH
generator.sampling_params.max_generate_length=$MAX_GENERATE_LENGTH
trainer.policy.optimizer_config.lr=1.0e-6
trainer.policy_mini_batch_size=128
trainer.algorithm.use_kl_loss=false
trainer.ckpt_interval=18
trainer.max_ckpts_to_keep=1
trainer.hf_save_interval=19
trainer.dump_data_batch=false
generator.backend=vllm
generator.run_engines_locally=true
generator.weight_sync_backend=nccl
generator.async_engine=true
generator.batched=false
environment.env_class=text2sql
generator.use_conversation_multi_turn=true
generator.n_samples_per_prompt=6
generator.gpu_memory_utilization=0.7
generator.max_turns=10
generator.sampling_params.temperature=0.5
generator.sampling_params.top_p=0.95
generator.sampling_params.stop='["", ""]'
generator.append_eos_token_after_stop_str_in_multi_turn=true
generator.eval_sampling_params.stop='["", ""]'
generator.eval_n_samples_per_prompt=6
trainer.project_name="skyrlsql_p5"
trainer.run_name=$run_name
trainer.resume_mode=latest
trainer.ckpt_path=$HOME/ckpts/${run_name}
trainer.export_path=$HOME/skyrl_export/${run_name}/
trainer.eval_batch_size=1024
trainer.eval_before_train=false
trainer.eval_interval=18