Skip to content

Issues reproducing Text2SQL tutorial - Watchdog caught collective operation timeout when running examples/text_to_sql/run_skyrl_sql.sh #757

@shanghongsim

Description

@shanghongsim

Hi, I’m trying to reproduce the text2sql example and am running into a NCCL watchdog timeout during policy_train with FSDP on an 8×H100 node.

Environment

  • Cluster: Lambda Labs, 1 node, 8× NVIDIA H100 GPUs
  • Container image: novaskyai/skyrl-train-ray-2.51.1-py3.12-cu12.8
  • CUDA: 12.8 (nvidia-smi confirms)
  • Ray: 2.51.1
  • Python: 3.12 (via uv venv --python 3.12)
  • Repository: NovaSky-AI/SkyRL
    • Branch: main
    • git status: clean, up to date with origin/main
Full pip list
(skyrl-train) (base) shanghong@oumi-compute004:~/SkyRL/skyrl-train$ pip list
Package                            Version
---------------------------------- --------------
adlfs                              2023.8.0
aiofiles                           22.1.0
aiohappyeyeballs                   2.6.1
aiohttp                            3.11.16
aiohttp-cors                       0.7.0
aiosignal                          1.3.1
aiosqlite                          0.19.0
amqp                               5.3.1
annotated-types                    0.6.0
anyio                              3.7.1
anyscale                           0.26.72
archspec                           0.2.5
argon2-cffi                        23.1.0
argon2-cffi-bindings               21.2.0
arrow                              1.3.0
asttokens                          2.4.1
attrs                              25.1.0
azure-common                       1.1.28
azure-core                         1.29.5
azure-datalake-store               0.0.53
azure-identity                     1.17.1
azure-storage-blob                 12.22.0
Babel                              2.13.1
backcall                           0.2.0
beautifulsoup4                     4.11.1
billiard                           4.2.1
bleach                             6.1.0
boltons                            24.0.0
boto3                              1.29.7
botocore                           1.32.7
Brotli                             1.1.0
cachetools                         5.5.2
celery                             5.5.3
certifi                            2025.1.31
cffi                               1.16.0
charset-normalizer                 3.3.2
click                              8.1.7
click-didyoumean                   0.3.1
click-plugins                      1.1.1.2
click-repl                         0.3.0
cloudpickle                        3.1.1
colorama                           0.4.6
colorful                           0.5.5
comm                               0.2.0
conda                              24.11.3
conda-libmamba-solver              24.9.0
conda-package-handling             2.4.0
conda_package_streaming            0.11.0
cryptography                       44.0.3
cupy-cuda12x                       13.1.0
debugpy                            1.8.0
decorator                          5.1.1
defusedxml                         0.7.1
distlib                            0.3.7
distro                             1.9.0
dm-tree                            0.1.8
entrypoints                        0.4
executing                          2.0.1
Farama-Notifications               0.0.4
fastapi                            0.115.12
fastjsonschema                     2.19.0
fastrlock                          0.8.2
filelock                           3.17.0
fqdn                               1.5.1
frozendict                         2.4.6
frozenlist                         1.4.1
fsspec                             2023.12.1
gitdb                              4.0.11
GitPython                          3.1.44
google-api-core                    2.24.2
google-auth                        2.23.4
google-cloud-core                  2.4.1
google-cloud-storage               2.14.0
google-crc32c                      1.5.0
google-resumable-media             2.6.0
googleapis-common-protos           1.61.0
grpcio                             1.74.0
gymnasium                          1.1.1
h11                                0.16.0
h2                                 4.1.0
hpack                              4.0.0
httplib2                           0.20.4
httptools                          0.7.1
humanize                           4.12.1
hyperframe                         6.0.1
idna                               3.7
importlib-metadata                 6.11.0
ipykernel                          6.27.1
ipython                            8.12.3
ipython-genutils                   0.2.0
ipywidgets                         8.1.3
isodate                            0.6.1
isoduration                        20.11.0
jedi                               0.19.1
Jinja2                             3.1.6
jmespath                           1.0.1
json5                              0.9.14
jsonpatch                          1.32
jsonpointer                        2.4
jsonschema                         4.23.0
jsonschema-specifications          2024.10.1
jupyter-client                     7.3.4
jupyter_core                       5.5.0
jupyter-events                     0.6.3
jupyter-server                     1.24.0
jupyter_server_fileid              0.9.0
jupyter_server_ydoc                0.6.1
jupyter-ydoc                       0.2.5
jupyterlab                         3.6.1
jupyterlab_pygments                0.3.0
jupyterlab_server                  2.24.0
jupyterlab_widgets                 3.0.11
kombu                              5.5.4
libmambapy                         1.5.12
log-symbols                        0.0.14
lxml                               4.9.4
lz4                                4.3.3
mamba                              1.5.12
markdown-it-py                     2.2.0
MarkupSafe                         2.1.3
matplotlib-inline                  0.1.6
mdurl                              0.1.2
memray                             1.10.0
menuinst                           2.2.0
mistune                            0.8.4
msal                               1.28.1
msal-extensions                    1.2.0b1
msgpack                            1.0.7
multidict                          6.0.5
nbclassic                          1.0.0
nbclient                           0.5.13
nbconvert                          6.5.4
nbformat                           5.9.2
nest-asyncio                       1.5.8
notebook                           6.5.7
notebook_shim                      0.2.3
numpy                              1.26.4
oauth2client                       4.1.3
opencensus                         0.11.4
opencensus-context                 0.1.3
opentelemetry-api                  1.34.1
opentelemetry-exporter-prometheus  0.55b1
opentelemetry-proto                1.27.0
opentelemetry-sdk                  1.34.1
opentelemetry-semantic-conventions 0.55b1
ormsgpack                          1.7.0
packaging                          23.0
pandas                             2.3.3
pandocfilters                      1.5.0
parso                              0.8.3
pathspec                           0.11.2
pexpect                            4.8.0
pickleshare                        0.7.5
pip                                24.3.1
platformdirs                       3.11.0
pluggy                             1.5.0
polars                             1.32.3
portalocker                        2.8.2
prometheus-client                  0.19.0
prompt-toolkit                     3.0.41
propcache                          0.3.0
proto-plus                         1.22.3
protobuf                           4.25.8
psutil                             5.9.6
ptyprocess                         0.7.0
pure-eval                          0.2.2
py-spy                             0.4.1
pyarrow                            19.0.1
pyasn1                             0.5.1
pyasn1-modules                     0.3.0
pycosat                            0.6.6
pycparser                          2.21
pycurl                             7.45.3
pydantic                           2.11.7
pydantic_core                      2.33.2
Pygments                           2.18.0
PyJWT                              2.8.0
pyOpenSSL                          25.0.0
pyparsing                          3.1.1
PySocks                            1.7.1
python-dateutil                    2.8.2
python-dotenv                      1.2.1
python-json-logger                 2.0.7
pytz                               2022.7.1
PyYAML                             6.0.1
pyzmq                              26.0.3
ray                                2.51.1
referencing                        0.36.2
requests                           2.32.3
rfc3339-validator                  0.1.4
rfc3986-validator                  0.1.1
rich                               13.3.2
rpds-py                            0.22.3
rsa                                4.7.2
ruamel.yaml                        0.18.10
ruamel.yaml.clib                   0.2.8
s3transfer                         0.8.0
scipy                              1.11.4
Send2Trash                         1.8.3
setuptools                         75.8.0
six                                1.16.0
smart-open                         6.2.0
smmap                              5.0.1
sniffio                            1.3.1
soupsieve                          2.5
spinners                           0.0.24
stack-data                         0.6.3
starlette                          0.46.2
supervisor                         4.3.0
tabulate                           0.9.0
tensorboardX                       2.6.2.2
termcolor                          2.4.0
terminado                          0.18.1
tinycss2                           1.3.0
tornado                            6.5.2
tqdm                               4.67.1
traitlets                          5.14.3
truststore                         0.10.0
types-python-dateutil              2.9.0.20240316
typing_extensions                  4.12.2
typing-inspection                  0.4.1
tzdata                             2025.2
tzlocal                            5.3
uri-template                       1.3.0
urllib3                            1.26.19
uvicorn                            0.22.0
uvloop                             0.21.0
vine                               5.1.0
virtualenv                         20.29.1
watchfiles                         0.19.0
wcwidth                            0.2.13
webcolors                          24.6.0
webencodings                       0.5.1
websocket-client                   1.8.0
websockets                         11.0.3
wheel                              0.45.1
widgetsnbextension                 4.0.11
wrapt                              1.14.1
y-py                               0.6.2
yarl                               1.18.3
ypy-websocket                      0.8.4
zipp                               3.19.2
zstandard                          0.23.0

Setup / Repro Steps

On the Lambda node:

srun --gpus=8 --time=72:00:00 \
  --container-writable \
  --container-image=novaskyai/skyrl-train-ray-2.51.1-py3.12-cu12.8 \
  --container-mounts=/home/$USER/data:/workspace/data \
  --container-save=/home/$USER/data/skyrl-img.sqsh \
  --pty bash

Inside the container:

nvidia-smi # confirming cuda 12.8

ray start --head # ray status

git clone --recurse-submodules https://github.com/NovaSky-AI/SkyRL
cd SkyRL/skyrl-train

uv venv --python 3.12
source .venv/bin/activate
uv sync --active --extra vllm

export RAY_RUNTIME_ENV_HOOK=ray._private.runtime_env.uv_runtime_env_hook.hook
export WANDB_API_KEY=<wandb-api-key>

bash examples/text_to_sql/run_skyrl_sql.sh 

I did not modify the training configuration, only the data path to point to where I had stored my data files.

The error

Training proceeds through:

  • convert_to_training_input
  • fwd_logprobs_values_reward
  • compute_advantages_and_returns
  • dump_data_batch

Then during:
train_critic_and_policy -> policy_train (FSDPPolicyWorkerBase)

I get an NCCL watchdog timeout and the Ray actor dies.

Key parts of the error log:

[36m(skyrl_entrypoint pid=952112)[0m [32m2025-12-06 12:36:56.105[0m | [1m[32mINFO    [0m | [36mskyrl_train.trainer[0m:[36mconvert_to_training_input[0m:[36m597[0m - [1m[32mNumber of sequences before padding: 1280[0m
[36m(skyrl_entrypoint pid=952112)[0m [32m2025-12-06 12:36:56.105[0m | [1m[32mINFO    [0m | [36mskyrl_train.trainer[0m:[36mconvert_to_training_input[0m:[36m599[0m - [1m[32mNumber of sequences after padding: 1280[0m
[36m(skyrl_entrypoint pid=952112)[0m [32m2025-12-06 12:36:56.105[0m | [1m[32mINFO    [0m | [36mskyrl_train.trainer[0m:[36mtrain[0m:[36m238[0m - [1m[32mNumber of sequences: 1280[0m
[36m(skyrl_entrypoint pid=952112)[0m [32m2025-12-06 12:36:56.105[0m | [1m[32mINFO    [0m | [36mskyrl_train.trainer[0m:[36mtrain[0m:[36m236[0m - [1m[32mFinished: 'convert_to_training_input', time cost: 171.64s[0m
[36m(skyrl_entrypoint pid=952112)[0m [32m2025-12-06 12:36:56.105[0m | [1m[32mINFO    [0m | [36mskyrl_train.trainer[0m:[36mtrain[0m:[36m241[0m - [1m[32mStarted: 'fwd_logprobs_values_reward'[0m
[36m(skyrl_entrypoint pid=952112)[0m [32m2025-12-06 12:47:09.361[0m | [1m[32mINFO    [0m | [36mskyrl_train.trainer[0m:[36mtrain[0m:[36m241[0m - [1m[32mFinished: 'fwd_logprobs_values_reward', time cost: 613.26s[0m
[36m(skyrl_entrypoint pid=952112)[0m [32m2025-12-06 12:47:09.362[0m | [1m[32mINFO    [0m | [36mskyrl_train.trainer[0m:[36mtrain[0m:[36m250[0m - [1m[32mStarted: 'compute_advantages_and_returns'[0m
[36m(skyrl_entrypoint pid=952112)[0m [32m2025-12-06 12:47:26.199[0m | [1m[32mINFO    [0m | [36mskyrl_train.trainer[0m:[36mcompute_advantages_and_returns[0m:[36m781[0m - [1m[32mavg_final_rewards: -0.29609376192092896, avg_response_length: 904.32890625[0m
[36m(skyrl_entrypoint pid=952112)[0m [32m2025-12-06 12:47:26.365[0m | [1m[32mINFO    [0m | [36mskyrl_train.trainer[0m:[36mtrain[0m:[36m250[0m - [1m[32mFinished: 'compute_advantages_and_returns', time cost: 17.00s[0m
[36m(skyrl_entrypoint pid=952112)[0m [32m2025-12-06 12:47:26.386[0m | [1m[32mINFO    [0m | [36mskyrl_train.trainer[0m:[36mtrain[0m:[36m262[0m - [1m[32mStarted: 'dump_data_batch'[0m
[36m(skyrl_entrypoint pid=952112)[0m [32m2025-12-06 12:48:11.955[0m | [1m[32mINFO    [0m | [36mskyrl_train.trainer[0m:[36mtrain[0m:[36m262[0m - [1m[32mFinished: 'dump_data_batch', time cost: 45.57s[0m
[36m(skyrl_entrypoint pid=952112)[0m [32m2025-12-06 12:48:11.955[0m | [1m[32mINFO    [0m | [36mskyrl_train.trainer[0m:[36mtrain[0m:[36m267[0m - [1m[32mStarted: 'train_critic_and_policy'[0m
[36m(skyrl_entrypoint pid=952112)[0m [32m2025-12-06 12:48:11.956[0m | [1m[32mINFO    [0m | [36mskyrl_train.trainer[0m:[36mtrain_critic_and_policy[0m:[36m1038[0m - [1m[32mStarted: 'policy_train'[0m
[36m(FSDPPolicyWorkerBase pid=959049)[0m 
Policy Train epoch [1/1]:   0%|          | 0/160 [00:00<?, ?it/s]
[36m(FSDPPolicyWorkerBase pid=959049)[0m [rank0]:[E1206 13:01:11.084722089 ProcessGroupNCCL.cpp:685] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=19962, OpType=_ALLGATHER_BASE, NumelIn=68125120, NumelOut=545000960, Timeout(ms)=600000) ran for 600029 milliseconds before timing out.
[36m(FSDPPolicyWorkerBase pid=959049)[0m [rank0]:[E1206 13:01:11.138792342 ProcessGroupNCCL.cpp:2252] [PG ID 0 PG GUID 0(default_pg) Rank 0]  failure detected by watchdog at work sequence id: 19962 PG status: last enqueued work: 19964, last completed work: 19961
[36m(FSDPPolicyWorkerBase pid=959049)[0m [rank0]:[E1206 13:01:11.138832442 ProcessGroupNCCL.cpp:732] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[36m(FSDPPolicyWorkerBase pid=959049)[0m [rank0]:[E1206 13:01:11.138864267 ProcessGroupNCCL.cpp:2584] [PG ID 0 PG GUID 0(default_pg) Rank 0] First PG on this rank to signal dumping.
[36m(FSDPPolicyWorkerBase pid=959298)[0m [rank3]:[E1206 13:01:11.434681723 ProcessGroupNCCL.cpp:1806] [PG ID 0 PG GUID 0(default_pg) Rank 3] Observed flight recorder dump signal from another rank via TCPStore.
[36m(FSDPPolicyWorkerBase pid=959298)[0m [rank3]:[E1206 13:01:11.436152721 ProcessGroupNCCL.cpp:1870] [PG ID 0 PG GUID 0(default_pg) Rank 3] Received a dump signal due to a collective timeout from  rank 0 and we will try our best to dump the debug info. Last enqueued NCCL work: 19964, last completed NCCL work: 19961.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. 
[36m(FSDPPolicyWorkerBase pid=959049)[0m [rank0]:[E1206 13:01:11.438828600 ProcessGroupNCCL.cpp:1870] [PG ID 0 PG GUID 0(default_pg) Rank 0] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 19964, last completed NCCL work: 19961.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. 
[36m(FSDPPolicyWorkerBase pid=959049)[0m [rank0]:[E1206 13:01:11.454303498 ProcessGroupNCCL.cpp:1589] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[36m(FSDPPolicyWorkerBase pid=959301)[0m [rank6]:[E1206 13:01:12.087674097 ProcessGroupNCCL.cpp:1935] [PG ID 0 PG GUID 0(default_pg) Rank 6] Could not acquire GIL within 300 ms on exit, possible GIL induced hang
[36m(FSDPPolicyWorkerBase pid=959049)[0m [rank0]:[E1206 13:02:11.138949062 ProcessGroupNCCL.cpp:746] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[36m(FSDPPolicyWorkerBase pid=959049)[0m [rank0]:[E1206 13:02:11.138971618 ProcessGroupNCCL.cpp:760] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[36m(FSDPPolicyWorkerBase pid=959303)[0m [rank5]:[E1206 13:01:12.983163669 ProcessGroupNCCL.cpp:1806] [PG ID 0 PG GUID 0(default_pg) Rank 5] Observed flight recorder dump signal from another rank via TCPStore.[32m [repeated 6x across cluster][0m
[36m(FSDPPolicyWorkerBase pid=959303)[0m [rank5]:[E1206 13:01:12.983490603 ProcessGroupNCCL.cpp:1870] [PG ID 0 PG GUID 0(default_pg) Rank 5] Received a dump signal due to a collective timeout from  rank 0 and we will try our best to dump the debug info. Last enqueued NCCL work: 19964, last completed NCCL work: 19961.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. [32m [repeated 6x across cluster][0m
[36m(FSDPPolicyWorkerBase pid=959303)[0m [rank5]:[E1206 13:01:12.983768157 ProcessGroupNCCL.cpp:1589] [PG ID 0 PG GUID 0(default_pg) Rank 5] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1[32m [repeated 7x across cluster][0m
[36m(FSDPPolicyWorkerBase pid=959049)[0m [rank0]:[E1206 13:02:11.186390648 ProcessGroupNCCL.cpp:2068] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=19962, OpType=_ALLGATHER_BASE, NumelIn=68125120, NumelOut=545000960, Timeout(ms)=600000) ran for 600029 milliseconds before timing out.
[36m(FSDPPolicyWorkerBase pid=959049)[0m Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:688 (most recent call first):
[36m(FSDPPolicyWorkerBase pid=959049)[0m frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7d1228146eb0 in /home/ray/.cache/uv/builds-v0/.tmph3Lvcm/lib/python3.12/site-packages/torch/lib/libc10.so)
[36m(FSDPPolicyWorkerBase pid=959049)[0m frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x247 (0x7ce2e4640147 in /home/ray/.cache/uv/builds-v0/.tmph3Lvcm/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
[36m(FSDPPolicyWorkerBase pid=959049)[0m frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x1591 (0x7ce2e4643b61 in /home/ray/.cache/uv/builds-v0/.tmph3Lvcm/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
[36m(FSDPPolicyWorkerBase pid=959049)[0m frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xd2 (0x7ce2e4644ec2 in /home/ray/.cache/uv/builds-v0/.tmph3Lvcm/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
[36m(FSDPPolicyWorkerBase pid=959049)[0m frame #4: <unknown function> + 0xd3b65 (0x7d122ccf1b65 in /home/ray/anaconda3/bin/../lib/libstdc++.so.6)
[36m(FSDPPolicyWorkerBase pid=959049)[0m frame #5: <unknown function> + 0x94ac3 (0x7d122f094ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[36m(FSDPPolicyWorkerBase pid=959049)[0m frame #6: <unknown function> + 0x1268c0 (0x7d122f1268c0 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[36m(FSDPPolicyWorkerBase pid=959049)[0m 
[36m(FSDPPolicyWorkerBase pid=959049)[0m [2025-12-06 13:02:11,819 E 959049 961204] logging.cc:118: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=19962, OpType=_ALLGATHER_BASE, NumelIn=68125120, NumelOut=545000960, Timeout(ms)=600000) ran for 600029 milliseconds before timing out.
[36m(FSDPPolicyWorkerBase pid=959049)[0m Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:688 (most recent call first):
[36m(FSDPPolicyWorkerBase pid=959049)[0m frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7d1228146eb0 in /home/ray/.cache/uv/builds-v0/.tmph3Lvcm/lib/python3.12/site-packages/torch/lib/libc10.so)
[36m(FSDPPolicyWorkerBase pid=959049)[0m frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x247 (0x7ce2e4640147 in /home/ray/.cache/uv/builds-v0/.tmph3Lvcm/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
[36m(FSDPPolicyWorkerBase pid=959049)[0m frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x1591 (0x7ce2e4643b61 in /home/ray/.cache/uv/builds-v0/.tmph3Lvcm/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
[36m(FSDPPolicyWorkerBase pid=959049)[0m frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xd2 (0x7ce2e4644ec2 in /home/ray/.cache/uv/builds-v0/.tmph3Lvcm/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
[36m(FSDPPolicyWorkerBase pid=959049)[0m frame #4: <unknown function> + 0xd3b65 (0x7d122ccf1b65 in /home/ray/anaconda3/bin/../lib/libstdc++.so.6)
[36m(FSDPPolicyWorkerBase pid=959049)[0m frame #5: <unknown function> + 0x94ac3 (0x7d122f094ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[36m(FSDPPolicyWorkerBase pid=959049)[0m frame #6: <unknown function> + 0x1268c0 (0x7d122f1268c0 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[36m(FSDPPolicyWorkerBase pid=959049)[0m 
[36m(FSDPPolicyWorkerBase pid=959049)[0m Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2074 (most recent call first):
[36m(FSDPPolicyWorkerBase pid=959049)[0m frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7d1228146eb0 in /home/ray/.cache/uv/builds-v0/.tmph3Lvcm/lib/python3.12/site-packages/torch/lib/libc10.so)
[36m(FSDPPolicyWorkerBase pid=959049)[0m frame #1: <unknown function> + 0xe1c1a1 (0x7ce2e461c1a1 in /home/ray/.cache/uv/builds-v0/.tmph3Lvcm/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
[36m(FSDPPolicyWorkerBase pid=959049)[0m frame #2: <unknown function> + 0x9468e6 (0x7ce2e41468e6 in /home/ray/.cache/uv/builds-v0/.tmph3Lvcm/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
[36m(FSDPPolicyWorkerBase pid=959049)[0m frame #3: <unknown function> + 0xd3b65 (0x7d122ccf1b65 in /home/ray/anaconda3/bin/../lib/libstdc++.so.6)
[36m(FSDPPolicyWorkerBase pid=959049)[0m frame #4: <unknown function> + 0x94ac3 (0x7d122f094ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[36m(FSDPPolicyWorkerBase pid=959049)[0m frame #5: <unknown function> + 0x1268c0 (0x7d122f1268c0 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[36m(FSDPPolicyWorkerBase pid=959049)[0m  /home/ray/.cache/uv/builds-v0/.tmph3Lvcm/lib/python3.12/site-packages/ray/_raylet.so(+0x15fd74a) [0x7d122e3fd74a] ray::operator<<()
[36m(FSDPPolicyWorkerBase pid=959049)[0m /home/ray/.cache/uv/builds-v0/.tmph3Lvcm/lib/python3.12/site-packages/ray/_raylet.so(+0x15fe1fc) [0x7d122e3fe1fc] ray::RayLog::operator<< <>()
[36m(FSDPPolicyWorkerBase pid=959049)[0m /home/ray/.cache/uv/builds-v0/.tmph3Lvcm/lib/python3.12/site-packages/ray/_raylet.so(+0x7f1389) [0x7d122d5f1389] ray::TerminateHandler()
[36m(FSDPPolicyWorkerBase pid=959049)[0m /home/ray/anaconda3/bin/../lib/libstdc++.so.6(+0xb64f2) [0x7d122ccd44f2] __cxxabiv1::__terminate()
[36m(FSDPPolicyWorkerBase pid=959049)[0m /home/ray/anaconda3/bin/../lib/libstdc++.so.6(_ZSt10unexpectedv+0) [0x7d122ccce2f3] std::unexpected()
[36m(FSDPPolicyWorkerBase pid=959049)[0m /home/ray/anaconda3/bin/../lib/libstdc++.so.6(+0xb64eb) [0x7d122ccd44eb] __cxxabiv1::__terminate()
[36m(FSDPPolicyWorkerBase pid=959049)[0m /home/ray/.cache/uv/builds-v0/.tmph3Lvcm/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so(+0x94696a) [0x7ce2e414696a] c10d::ProcessGroupNCCL::Watchdog::run()
[36m(FSDPPolicyWorkerBase pid=959049)[0m /home/ray/anaconda3/bin/../lib/libstdc++.so.6(+0xd3b65) [0x7d122ccf1b65] execute_native_thread_routine
[36m(FSDPPolicyWorkerBase pid=959049)[0m /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7d122f094ac3]
[36m(FSDPPolicyWorkerBase pid=959049)[0m /usr/lib/x86_64-linux-gnu/libc.so.6(+0x1268c0) [0x7d122f1268c0]
[36m(FSDPPolicyWorkerBase pid=959049)[0m 
[36m(FSDPPolicyWorkerBase pid=959049)[0m [2025-12-06 13:02:11,830 E 959049 961204] logging.cc:125: Stack trace: 
[36m(FSDPPolicyWorkerBase pid=959049)[0m  /home/ray/.cache/uv/builds-v0/.tmph3Lvcm/lib/python3.12/site-packages/ray/_raylet.so(+0x15fd74a) [0x7d122e3fd74a] ray::operator<<()
[36m(FSDPPolicyWorkerBase pid=959049)[0m /home/ray/.cache/uv/builds-v0/.tmph3Lvcm/lib/python3.12/site-packages/ray/_raylet.so(+0x15fe1fc) [0x7d122e3fe1fc] ray::RayLog::operator<< <>()
[36m(FSDPPolicyWorkerBase pid=959049)[0m /home/ray/.cache/uv/builds-v0/.tmph3Lvcm/lib/python3.12/site-packages/ray/_raylet.so(+0x16007b8) [0x7d122e4007b8] ray::TerminateHandler()
[36m(FSDPPolicyWorkerBase pid=959049)[0m /home/ray/anaconda3/bin/../lib/libstdc++.so.6(+0xb64f2) [0x7d122ccd44f2] __cxxabiv1::__terminate()
[36m(FSDPPolicyWorkerBase pid=959049)[0m /home/ray/anaconda3/bin/../lib/libstdc++.so.6(_ZSt10unexpectedv+0) [0x7d122ccce2f3] std::unexpected()
[36m(FSDPPolicyWorkerBase pid=959049)[0m /home/ray/anaconda3/bin/../lib/libstdc++.so.6(+0xb64eb) [0x7d122ccd44eb] __cxxabiv1::__terminate()
[36m(FSDPPolicyWorkerBase pid=959049)[0m /home/ray/.cache/uv/builds-v0/.tmph3Lvcm/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so(+0x94696a) [0x7ce2e414696a] c10d::ProcessGroupNCCL::Watchdog::run()
[36m(FSDPPolicyWorkerBase pid=959049)[0m /home/ray/anaconda3/bin/../lib/libstdc++.so.6(+0xd3b65) [0x7d122ccf1b65] execute_native_thread_routine
[36m(FSDPPolicyWorkerBase pid=959049)[0m /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7d122f094ac3]
[36m(FSDPPolicyWorkerBase pid=959049)[0m /usr/lib/x86_64-linux-gnu/libc.so.6(+0x1268c0) [0x7d122f1268c0]
[36m(FSDPPolicyWorkerBase pid=959049)[0m 
[36m(FSDPPolicyWorkerBase pid=959049)[0m 
[36m(FSDPPolicyWorkerBase pid=959049)[0m *** SIGABRT received at time=1765026131 on cpu 0 ***
[36m(FSDPPolicyWorkerBase pid=959049)[0m PC: @     0x7d122f0969fc  (unknown)  pthread_kill
[36m(FSDPPolicyWorkerBase pid=959049)[0m     @     0x7d122f042520  (unknown)  (unknown)
[36m(FSDPPolicyWorkerBase pid=959049)[0m [2025-12-06 13:02:11,832 E 959049 961204] logging.cc:474: *** SIGABRT received at time=1765026131 on cpu 0 ***
[36m(FSDPPolicyWorkerBase pid=959049)[0m [2025-12-06 13:02:11,832 E 959049 961204] logging.cc:474: PC: @     0x7d122f0969fc  (unknown)  pthread_kill
[36m(FSDPPolicyWorkerBase pid=959049)[0m [2025-12-06 13:02:11,833 E 959049 961204] logging.cc:474:     @     0x7d122f042520  (unknown)  (unknown)
[36m(FSDPPolicyWorkerBase pid=959049)[0m Fatal Python error: Aborted
[36m(FSDPPolicyWorkerBase pid=959049)[0m 
[36m(FSDPPolicyWorkerBase pid=959049)[0m Extension modules: msgpack._cmsgpack, psutil._psutil_linux, google._upb._message, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, yaml._yaml, uvloop.loop, ray._raylet, numpy._core._multiarray_umath, numpy.linalg._umath_linalg, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, pyarrow.lib, pyarrow._json, regex._regex, markupsafe._speedups, PIL._imaging, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, scipy._lib._ccallback_c, scipy.linalg._fblas, scipy.linalg._flapack, _cyutility, scipy._cyutility, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_schur_sqrtm, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._slsqplib, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy._lib._uarray._uarray, scipy.special._ufuncs_cxx, scipy.special._ellip_harm_2, scipy.special._special_ufuncs, scipy.special._gufuncs, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.linalg._decomp_interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._hausdorff, scipy.spatial._distance_wrap, scipy.spatial.transform._rotation, scipy.spatial.transform._rigid_transform, scipy.optimize._direct, PIL._imagingft, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, _cffi_backend, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, xxhash._xxhash, pyarrow._acero, pyarrow._csv, pyarrow._substrait, pyarrow._dataset, pyarrow._dataset_orc, pyarrow._parquet_encryption, pyarrow._dataset_parquet_encryption, pyarrow._dataset_parquet, msgspec._core, _cbor2, setproctitle._setproctitle, zmq.backend.cython._zmq, pybase64._pybase64, cuda_utils, __triton_launcher (total: 161)
[36m(FSDPPolicyWorkerBase pid=959294)[0m [rank1]:[E1206 13:02:38.408721158 ProcessGroupNCCL.cpp:685] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=19962, OpType=_ALLGATHER_BASE, NumelIn=68125120, NumelOut=545000960, Timeout(ms)=600000) ran for 600005 milliseconds before timing out.
[36m(FSDPPolicyWorkerBase pid=959294)[0m [rank1]:[E1206 13:02:38.410314398 ProcessGroupNCCL.cpp:2252] [PG ID 0 PG GUID 0(default_pg) Rank 1]  failure detected by watchdog at work sequence id: 19962 PG status: last enqueued work: 19964, last completed work: 19961
[36m(FSDPPolicyWorkerBase pid=959294)[0m [rank1]:[E1206 13:02:38.410399804 ProcessGroupNCCL.cpp:732] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[36m(skyrl_entrypoint pid=952112)[0m [32m2025-12-06 13:03:10.478[0m | [1m[32mINFO    [0m | [36mskyrl_train.trainer[0m:[36mtrain_critic_and_policy[0m:[36m1038[0m - [1m[32mFinished: 'policy_train', time cost: 898.52s[0m
[36m(skyrl_entrypoint pid=952112)[0m [32m2025-12-06 13:03:10.483[0m | [1m[32mINFO    [0m | [36mskyrl_train.trainer[0m:[36mtrain[0m:[36m267[0m - [1m[32mFinished: 'train_critic_and_policy', time cost: 898.53s[0m
[36m(skyrl_entrypoint pid=952112)[0m [32m2025-12-06 13:03:10.486[0m | [1m[32mINFO    [0m | [36mskyrl_train.trainer[0m:[36mtrain[0m:[36m193[0m - [1m[32mFinished: 'step', time cost: 23727.05s[0m

[36m(AsyncVLLMInferenceEngine pid=953485)[0m [1;36m(EngineCore_DP0 pid=953841)[0;0m INFO 12-06 06:25:05 [block_pool.py:378] Successfully reset prefix cache
[36m(AsyncVLLMInferenceEngine pid=953485)[0m [1;36m(EngineCore_DP0 pid=953841)[0;0m INFO 12-06 12:33:42 [block_pool.py:378] Successfully reset prefix cache
[36m(AsyncVLLMInferenceEngine pid=953485)[0m [1;36m(EngineCore_DP0 pid=953841)[0;0m INFO 12-06 12:33:42 [block_pool.py:378] Successfully reset prefix cache
[36m(AsyncVLLMInferenceEngine pid=953485)[0m [1;36m(EngineCore_DP0 pid=953841)[0;0m INFO 12-06 06:27:42 [executor_base.py:205] It took 0.587088 seconds to wake up tags ['kv_cache'].
[36m(AsyncVLLMInferenceEngine pid=953485)[0m [1;36m(EngineCore_DP0 pid=953841)[0;0m [36m(RayWorkerWrapper pid=954392)[0m INFO 12-06 12:33:45 [cumem.py:228] CuMemAllocator: sleep freed 52.75 GiB memory in total, of which 0.00 GiB is backed up in CPU and the rest 52.75 GiB is discarded directly.
[36m(AsyncVLLMInferenceEngine pid=953485)[0m [1;36m(EngineCore_DP0 pid=953841)[0;0m [36m(RayWorkerWrapper pid=954392)[0m INFO 12-06 05:11:17 [gpu_worker.py:117] Sleep mode freed 69.28 GiB memory, 5.57 GiB memory is still in use.[32m [repeated 2x across cluster][0m
[36m(AsyncVLLMInferenceEngine pid=953485)[0m [1;36m(EngineCore_DP0 pid=953841)[0;0m [36m(RayWorkerWrapper pid=954248)[0m INFO 12-06 12:33:45 [cumem.py:228] CuMemAllocator: sleep freed 52.53 GiB memory in total, of which 0.00 GiB is backed up in CPU and the rest 52.53 GiB is discarded directly.
[36m(AsyncVLLMInferenceEngine pid=953485)[0m [1;36m(EngineCore_DP0 pid=953841)[0;0m [36m(RayWorkerWrapper pid=954392)[0m INFO 12-06 12:34:00 [gpu_worker.py:117] Sleep mode freed 61.54 GiB memory, 6.82 GiB memory is still in use.
[36m(AsyncVLLMInferenceEngine pid=953485)[0m [1;36m(EngineCore_DP0 pid=953841)[0;0m [36m(RayWorkerWrapper pid=954388)[0m INFO 12-06 12:33:46 [cumem.py:228] CuMemAllocator: sleep freed 52.53 GiB memory in total, of which 0.00 GiB is backed up in CPU and the rest 52.53 GiB is discarded directly.[32m [repeated 2x across cluster][0m
[36m(AsyncVLLMInferenceEngine pid=953485)[0m [1;36m(EngineCore_DP0 pid=953841)[0;0m [36m(RayWorkerWrapper pid=954248)[0m INFO 12-06 12:34:01 [gpu_worker.py:117] Sleep mode freed 62.74 GiB memory, 8.56 GiB memory is still in use.
[36m(AsyncVLLMInferenceEngine pid=953485)[0m [1;36m(EngineCore_DP0 pid=953841)[0;0m INFO 12-06 12:34:02 [executor_base.py:189] It took 20.015521 seconds to fall asleep.
[36m(AsyncVLLMInferenceEngine pid=953488)[0m [1;36m(EngineCore_DP0 pid=953826)[0;0m INFO 12-06 12:33:43 [block_pool.py:378] Successfully reset prefix cache[32m [repeated 2x across cluster][0m
[33m(raylet)[0m A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. Lease ID: 140000006e19258805c8555f61570e469502f4b732eeb39ab653a1d10c361ca6 Worker ID: 91dc7bc6b699be719f54e6992d98e61b930ffc3a90be0b9adce29202 Node ID: cd1213a83740a10a34cd25a11bb006b34272b8decaa0e0f50e7efd6b Worker IP address: 172.26.134.226 Worker port: 39485 Worker PID: 959049 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
[36m(AsyncVLLMInferenceEngine pid=953488)[0m [1;36m(EngineCore_DP0 pid=953826)[0;0m [36m(RayWorkerWrapper pid=954098)[0m INFO 12-06 12:33:47 [cumem.py:228] CuMemAllocator: sleep freed 52.53 GiB memory in total, of which 0.00 GiB is backed up in CPU and the rest 52.53 GiB is discarded directly.[32m [repeated 2x across cluster][0m
[36m(AsyncVLLMInferenceEngine pid=953488)[0m [1;36m(EngineCore_DP0 pid=953826)[0;0m [36m(RayWorkerWrapper pid=954097)[0m INFO 12-06 05:11:18 [gpu_worker.py:117] Sleep mode freed 68.47 GiB memory, 6.51 GiB memory is still in use.[32m [repeated 2x across cluster][0m
[36m(AsyncVLLMInferenceEngine pid=953488)[0m [1;36m(EngineCore_DP0 pid=953826)[0;0m [36m(RayWorkerWrapper pid=954097)[0m INFO 12-06 12:34:02 [gpu_worker.py:117] Sleep mode freed 60.07 GiB memory, 7.78 GiB memory is still in use.[32m [repeated 2x across cluster][0m
[36m(AsyncVLLMInferenceEngine pid=953488)[0m [1;36m(EngineCore_DP0 pid=953826)[0;0m [36m(RayWorkerWrapper pid=954087)[0m INFO 12-06 12:33:48 [cumem.py:228] CuMemAllocator: sleep freed 52.53 GiB memory in total, of which 0.00 GiB is backed up in CPU and the rest 52.53 GiB is discarded directly.[32m [repeated 2x across cluster][0m
[36m(AsyncVLLMInferenceEngine pid=953488)[0m [1;36m(EngineCore_DP0 pid=953826)[0;0m INFO 12-06 12:34:04 [executor_base.py:189] It took 21.142570 seconds to fall asleep.
Error executing job with overrides: ['trainer.algorithm.advantage_estimator=grpo', "data.train_data=['/workspace/data/sql/train.parquet']", "data.val_data=['/workspace/data/sql/validation.parquet']", 'trainer.policy.model.path=Qwen/Qwen2.5-Coder-7B-Instruct', 'trainer.epochs=30', 'trainer.placement.colocate_all=true', 'trainer.strategy=fsdp2', 'trainer.policy.fsdp_config.cpu_offload=false', 'trainer.ref.fsdp_config.cpu_offload=true', 'trainer.policy.optimizer_config.max_grad_norm=0.5', 'trainer.policy.sequence_parallel_size=1', 'trainer.placement.policy_num_gpus_per_node=8', 'trainer.placement.ref_num_gpus_per_node=8', 'generator.num_inference_engines=2', 'generator.inference_engine_tensor_parallel_size=4', 'trainer.train_batch_size=256', 'trainer.micro_forward_batch_size_per_gpu=8', 'trainer.micro_train_batch_size_per_gpu=1', 'trainer.max_prompt_length=6000', 'generator.max_input_length=29000', 'generator.sampling_params.max_generate_length=3000', 'trainer.policy.optimizer_config.lr=1.0e-6', 'trainer.policy_mini_batch_size=256', 'trainer.algorithm.use_kl_loss=false', 'trainer.ckpt_interval=60', 'trainer.hf_save_interval=30', 'trainer.dump_data_batch=true', 'generator.backend=vllm', 'generator.run_engines_locally=true', 'generator.weight_sync_backend=nccl', 'generator.async_engine=true', 'generator.batched=false', 'environment.env_class=text2sql', 'generator.use_conversation_multi_turn=false', 'generator.n_samples_per_prompt=5', 'generator.gpu_memory_utilization=0.7', 'generator.max_turns=6', 'generator.sampling_params.temperature=0.6', 'generator.sampling_params.top_p=0.95', 'generator.sampling_params.stop=["</sql>", "</solution>"]', 'generator.eval_sampling_params.stop=["</sql>", "</solution>"]', 'environment.skyrl_gym.text2sql.db_path=/workspace/data/db/data', 'trainer.logger=wandb', 'trainer.project_name=skyrlsql', 'trainer.run_name=skyrlsql_repro', 'trainer.resume_mode=latest', 'trainer.ckpt_path=/workspace/data/ckpts/skyrl_sql_7B_ckpt', 'trainer.eval_batch_size=1024', 'trainer.eval_before_train=true', 'trainer.eval_interval=5']
Traceback (most recent call last):
  File "/home/ray/SkyRL/skyrl-train/skyrl_train/entrypoints/main_base.py", line 331, in main
    ray.get(skyrl_entrypoint.remote(cfg))
  File "/home/ray/.cache/uv/builds-v0/.tmpi9PYmt/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/.cache/uv/builds-v0/.tmpi9PYmt/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/.cache/uv/builds-v0/.tmpi9PYmt/lib/python3.12/site-packages/ray/_private/worker.py", line 2961, in get
    values, debugger_breakpoint = worker.get_objects(
                                  ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/.cache/uv/builds-v0/.tmpi9PYmt/lib/python3.12/site-packages/ray/_private/worker.py", line 1026, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ActorDiedError): [36mray::skyrl_entrypoint()[39m (pid=952112, ip=172.26.134.226)
  File "/home/ray/SkyRL/skyrl-train/skyrl_train/entrypoints/main_base.py", line 322, in skyrl_entrypoint
    exp.run()
  File "/home/ray/SkyRL/skyrl-train/skyrl_train/entrypoints/main_base.py", line 315, in run
    trainer.train()
  File "/tmp/ray/session_2025-12-06_00-13-20_479069_938355/runtime_resources/working_dir_files/_ray_pkg_020c09aa138b4335/skyrl_train/trainer.py", line 268, in train
    status = self.train_critic_and_policy(training_input)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2025-12-06_00-13-20_479069_938355/runtime_resources/working_dir_files/_ray_pkg_020c09aa138b4335/skyrl_train/trainer.py", line 1040, in train_critic_and_policy
    policy_statuses = ray.get(self.policy_model.async_run_ray_method("mesh", "ppo_train", data))
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^
                                  ^^^^^^^^^^^^^^^^^^^
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
	class_name: FSDPPolicyWorkerBase
	actor_id: 67075fffce7f3ff41a873b8201000000
	pid: 959049
	namespace: 73ab4044-8e64-4c73-a095-0aa288e76642
	ip: 172.26.134.226
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

What I have tried

I’m now testing with smaller batch sizes to see if this is actually an OOM-driven issue that manifests as a NCCL timeout (e.g., train_batch_size, policy_mini_batch_size, and/or max_prompt_length) while enabling key NCCL debug environment variables eg TORCH_NCCL_TRACE_BUFFER_SIZE or NCCL_DEBUG=INFO.

Questions

  1. Does anyone have experience with such an error?
  2. Is reducing batch size / sequence lengths the right first direction?

Any help would be greatly appreciated! Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions