Skip to content

nsys profling reports are not produced by NeMo-RL #762

@jiuqiant

Description

@jiuqiant

I follow the Nemo-RL's official documentation "profile GPU with Nsys" to perform Nsys profiling on both a GCP VM with 8 H100 GPUs and a Slurm cluster on Google cloud. According to the log, the profiling is performed while the nsys profiling reports are not collectable due to a crash:

Steps/Code to reproduce bug

$ mkdir nemo_rl_training && cd nemo_rl_training && git clone https://github.com/NVIDIA-NeMo/RL

# Build and then run the nemo-rl release docker on a GCP MV with 8 H100:

docker run --rm -it --net host --shm-size=16g --ulimit memlock=-1 --ulimit stack=67108864 --gpus device=all -v ~/nemo_rl_training:/workspace/nemo_rl_training nemo-rl:latest /bin/bash

# Run the following in the docker

export HF_TOKEN=<YOUR TOKEN> && cd /workspace/nemo_rl_training/RL

NRL_NSYS_PROFILE_STEP_RANGE=1:9 NRL_NSYS_WORKER_PATTERNS="*policy*,*vllm*" uv run python examples/run_sft.py --config examples/configs/recipes/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v2.yaml logger.wandb_enabled=false

Log:

Installed 1 package in 1ms
Loaded configuration from: examples/configs/recipes/llm/sft-llama3.2-1b-1n8g-fsdp2tp1.v2.yaml
Overrides: ['logger.wandb_enabled=false']
Applied CLI overrides
Final config:
{'checkpointing': {'checkpoint_dir': 'results/sft-llama3.2-1b-1n8g-fsdp2tp1',
                   'enabled': True,
                   'higher_is_better': False,
                   'keep_top_k': 3,
                   'metric_name': 'val_loss',
                   'save_period': 10},

...

========================= Epoch 1/1 =========================

========================= Step 1/9 =========================
Starting GPU profiling for <nemo_rl.models.policy.lm_policy.Policy object at 0x7f92e2fd45f0> for step 1
▶ Preparing batch...
▶ Taking a training step...

📊 Training Results:
  • Loss: 1.6811

⏱️  Timing:
  • Total step time: 1.91s
  • data_processing: 0.00s (0.1%)

...
========================= Step 9/9 =========================
Stopping GPU profiling for <nemo_rl.models.policy.lm_policy.Policy object at 0x7f92e2fd45f0> for step 9
▶ Preparing batch...
▶ Taking a training step...
/workspace/nemo_rl_training/RL/nemo_rl/algorithms/sft.py:461: UserWarning: You asked to save checkpoints based on val_loss but the metric is not found in the save state. Saving most recent k checkpoints instead.
  warnings.warn(
Saving checkpoint for step 9...
/workspace/nemo_rl_training/RL/nemo_rl/utils/checkpoint.py:196: UserWarning: Metric val_loss not found in checkpoint history. Keeping most recent k checkpoints.
  warnings.warn(

📊 Training Results:
  • Loss: 0.4069

⏱️  Timing:
  • Total step time: 22.89s
  • checkpointing: 22.24s (97.2%)
  • data_processing: 0.00s (0.0%)
Stopping GPU profiling on exit for <nemo_rl.models.policy.lm_policy.Policy object at 0x7f92e2fd45f0> for step 1
(DTensorPolicyWorker pid=25927) No weights path provided. Starting from scratch (default policy init) [repeated 7x across cluster]
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
2025-07-25 20:51:25,563 INFO worker.py:1879 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 
[2025-07-25 20:51:27,695 C 366 366] actor_manager.cc:55:  Check failed: it != actor_handles_.end() Cannot find an actor handle of id, 633e19dda81ed59f3a0adb2101000000. This method should be called only when you ensure actor handles exists.
*** StackTrace Information ***
/app/nemo_rl_venv/lib/python3.12/site-packages/ray/_raylet.so(+0x14392da) [0x7f8f734b62da] ray::operator<<()
/app/nemo_rl_venv/lib/python3.12/site-packages/ray/_raylet.so(_ZN3ray6RayLogD1Ev+0x479) [0x7f8f734b8d59] ray::RayLog::~RayLog()
/app/nemo_rl_venv/lib/python3.12/site-packages/ray/_raylet.so(_ZNK3ray4core12ActorManager14GetActorHandleERKNS_7ActorIDE+0x19c) [0x7f8f72a61bbc] ray::core::ActorManager::GetActorHandle()
/app/nemo_rl_venv/lib/python3.12/site-packages/ray/_raylet.so(_ZN3ray4core12ActorManager13OnActorKilledERKNS_7ActorIDE+0x21) [0x7f8f72a62e91] ray::core::ActorManager::OnActorKilled()
/app/nemo_rl_venv/lib/python3.12/site-packages/ray/_raylet.so(_ZN3ray4core10CoreWorker9KillActorERKNS_7ActorIDEbb+0x33c) [0x7f8f72992d4c] ray::core::CoreWorker::KillActor()
/app/nemo_rl_venv/lib/python3.12/site-packages/ray/_raylet.so(+0x803ce7) [0x7f8f72880ce7] __pyx_pw_3ray_7_raylet_10CoreWorker_89kill_actor()
/root/.local/share/uv/python/cpython-3.12.10-linux-x86_64-gnu/bin/../lib/libpython3.12.so.1.0(_PyEval_EvalFrameDefault+0x31aeb) [0x7f938ba803bb] _PyEval_EvalFrameDefault
/root/.local/share/uv/python/cpython-3.12.10-linux-x86_64-gnu/bin/../lib/libpython3.12.so.1.0(PyObject_CallOneArg+0x61) [0x7f938b9e1dd1] PyObject_CallOneArg
/root/.local/share/uv/python/cpython-3.12.10-linux-x86_64-gnu/bin/../lib/libpython3.12.so.1.0(+0x4c05b8) [0x7f938bb2d5b8] slot_tp_finalize
/root/.local/share/uv/python/cpython-3.12.10-linux-x86_64-gnu/bin/../lib/libpython3.12.so.1.0(+0x3baf75) [0x7f938ba27f75] subtype_dealloc
/root/.local/share/uv/python/cpython-3.12.10-linux-x86_64-gnu/bin/../lib/libpython3.12.so.1.0(+0x376900) [0x7f938b9e3900] cell_dealloc
/root/.local/share/uv/python/cpython-3.12.10-linux-x86_64-gnu/bin/../lib/libpython3.12.so.1.0(+0x3b10ea) [0x7f938ba1e0ea] tupledealloc
/root/.local/share/uv/python/cpython-3.12.10-linux-x86_64-gnu/bin/../lib/libpython3.12.so.1.0(+0x37d835) [0x7f938b9ea835] func_clear
/root/.local/share/uv/python/cpython-3.12.10-linux-x86_64-gnu/bin/../lib/libpython3.12.so.1.0(+0x37d3c7) [0x7f938b9ea3c7] func_dealloc
/root/.local/share/uv/python/cpython-3.12.10-linux-x86_64-gnu/bin/../lib/libpython3.12.so.1.0(+0x32d6ae) [0x7f938b99a6ae] atexit_delete_cb.llvm.3312319065286271768
/root/.local/share/uv/python/cpython-3.12.10-linux-x86_64-gnu/bin/../lib/libpython3.12.so.1.0(+0x32d61d) [0x7f938b99a61d] atexit_cleanup.llvm.3312319065286271768
/root/.local/share/uv/python/cpython-3.12.10-linux-x86_64-gnu/bin/../lib/libpython3.12.so.1.0(Py_FinalizeEx+0x61) [0x7f938bbb2991] Py_FinalizeEx
/root/.local/share/uv/python/cpython-3.12.10-linux-x86_64-gnu/bin/../lib/libpython3.12.so.1.0(Py_RunMain+0x183) [0x7f938bbd52e3] Py_RunMain
/root/.local/share/uv/python/cpython-3.12.10-linux-x86_64-gnu/bin/../lib/libpython3.12.so.1.0(+0x56898f) [0x7f938bbd598f] pymain_main
/root/.local/share/uv/python/cpython-3.12.10-linux-x86_64-gnu/bin/../lib/libpython3.12.so.1.0(Py_BytesMain+0x2d) [0x7f938bbd5a4d] Py_BytesMain
/usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca) [0x7f938b37e1ca]
/usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b) [0x7f938b37e28b] __libc_start_main
/app/nemo_rl_venv/bin/python3(_start+0x29) [0x6000a9] _start

Expected behavior

We expect that program should run without crashing, and nsys profile reports should be available in "/tmp/ray/session_latest/logs/nsight/".

Environment overview (please complete the following information)

Environment location: A docker container running on a GCP VM

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions