Saving huggingface checkpoints times out after a certain number of epochs

Hi there,

I've been trying to run multi-turn GRPO. The training set is in size of 1027 data points and val set is 4715. Saving checkpoints and hf models was working fine until step 72, and at step 72 checkpoints took over 15 mins to save, but when it was trying to save a huggingface model at step 73 it failed  with the following error message:

[36m(skyrl_entrypoint pid=3034109)[0m 2025-09-08 22:36:45.019 | INFO     | skyrl_train.trainer:train:359 - Finished: 'save_hf_model', time cost: 601.62s
Traceback (most recent call last):
  File "/home/ubuntu/SkyRL/skyrl-train/skyrl_train/entrypoints/main_base.py", line 288, in main
    ray.get(skyrl_entrypoint.remote(cfg))
  File "/home/ubuntu/.cache/uv/builds-v0/.tmpu9oQzg/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.cache/uv/builds-v0/.tmpu9oQzg/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.cache/uv/builds-v0/.tmpu9oQzg/lib/python3.12/site-packages/ray/_private/worker.py", line 2858, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/.cache/uv/builds-v0/.tmpu9oQzg/lib/python3.12/site-packages/ray/_private/worker.py", line 958, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ActorDiedError): [36mray::skyrl_entrypoint()[39m (pid=3034109, ip=172.31.18.65)
  File "/home/ubuntu/SkyRL/skyrl-train/skyrl_train/entrypoints/main_base.py", line 279, in skyrl_entrypoint
    exp.run()
  File "/home/ubuntu/SkyRL/skyrl-train/skyrl_train/entrypoints/main_base.py", line 272, in run
    trainer.train()
  File "/tmp/ray/session_2025-09-08_15-43-10_928650_3021217/runtime_resources/working_dir_files/_ray_pkg_18ecd9287266148b/skyrl_train/trainer.py", line 360, in train
    self.save_models()
  File "/tmp/ray/session_2025-09-08_15-43-10_928650_3021217/runtime_resources/working_dir_files/_ray_pkg_18ecd9287266148b/skyrl_train/trainer.py", line 1376, in save_models
    ray.get(
           ^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
	class_name: FSDPPolicyRayActorBase
	actor_id: d23854da87fe7662fa512c1b01000000
	pid: 3041415
	namespace: 2b8a18f5-2d6c-40b4-b2b6-6ee8d3bf69df
	ip: ***.**.**.**
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.


Here is the logged curve of timing/save_checkpoints

<img width="556" height="307" alt="Image" src="https://github.com/user-attachments/assets/bda3345c-017b-474f-931e-b6ea3841158c" />

I feel like this error is related to the size of my training set since I tried to run the exact same config on a ~2000 data point training set and it couldn't even save the hf model on the first saving interval (step 18).  And i've been consistently encountering this error.

NUM_GPUS=8
NUM_INFERENCE_ENGINES=2
TP_SIZE=4
MAX_INPUT_LENGTH=29000
MAX_GENERATE_LENGTH=3000
TRAIN_BATCH_SIZE=128

uv run --isolated --extra vllm -m skyrl_train.entrypoints.main_base \
  trainer.algorithm.advantage_estimator="grpo" \
  data.train_data="['/home/ubuntu/s3_in/skyrl/rl_training_set_1027.parquet']" \
  data.val_data="['/home/ubuntu/s3_in/skyrl/rl_validation_set_4715.parquet']" \
  trainer.policy.model.path="customized_qwen2.5-code-7B-instruct" \
  trainer.epochs=20 \
  trainer.placement.colocate_all=true \
  trainer.strategy=fsdp2 \
  trainer.policy.fsdp_config.cpu_offload=false \
  trainer.ref.fsdp_config.cpu_offload=true \
  trainer.policy.optimizer_config.max_grad_norm=0.5 \
  trainer.policy.sequence_parallel_size=1 \
  trainer.placement.policy_num_gpus_per_node=$NUM_GPUS \
  trainer.placement.ref_num_gpus_per_node=$NUM_GPUS \
  generator.num_inference_engines=$NUM_INFERENCE_ENGINES \
  generator.inference_engine_tensor_parallel_size=$TP_SIZE \
  trainer.train_batch_size=$TRAIN_BATCH_SIZE \
  trainer.micro_forward_batch_size_per_gpu=8 \
  trainer.micro_train_batch_size_per_gpu=1 \
  trainer.max_prompt_length=6000 \
  generator.max_input_length=$MAX_INPUT_LENGTH \
  generator.sampling_params.max_generate_length=$MAX_GENERATE_LENGTH \
  trainer.policy.optimizer_config.lr=1.0e-6 \
  trainer.policy_mini_batch_size=128 \
  trainer.algorithm.use_kl_loss=false \
  trainer.ckpt_interval=18 \
  trainer.max_ckpts_to_keep=1 \
  trainer.hf_save_interval=19 \
  trainer.dump_data_batch=false \
  generator.backend=vllm \
  generator.run_engines_locally=true \
  generator.weight_sync_backend=nccl \
  generator.async_engine=true \
  generator.batched=false \
  environment.env_class=text2sql \
  generator.use_conversation_multi_turn=true \
  generator.n_samples_per_prompt=6 \
  generator.gpu_memory_utilization=0.7 \
  generator.max_turns=10 \
  generator.sampling_params.temperature=0.5 \
  generator.sampling_params.top_p=0.95 \
  generator.sampling_params.stop='["</sql>", "</solution>"]' \
  generator.append_eos_token_after_stop_str_in_multi_turn=true \
  generator.eval_sampling_params.stop='["</sql>", "</solution>"]' \
  generator.eval_n_samples_per_prompt=6 \
  trainer.project_name="skyrlsql_p5" \
  trainer.run_name=$run_name \
  trainer.resume_mode=latest \
  trainer.ckpt_path=$HOME/ckpts/${run_name} \
  trainer.export_path=$HOME/skyrl_export/${run_name}/ \
  trainer.eval_batch_size=1024 \
  trainer.eval_before_train=false \
  trainer.eval_interval=18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Saving huggingface checkpoints times out after a certain number of epochs #266

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Saving huggingface checkpoints times out after a certain number of epochs #266

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions