Skip to content

docs: End-to-end timeline view with nsys#1114

Merged
chtruong814 merged 2 commits intoNVIDIA-NeMo:mainfrom
youngeunkwon0405:update_nsys_doc
Sep 18, 2025
Merged

docs: End-to-end timeline view with nsys#1114
chtruong814 merged 2 commits intoNVIDIA-NeMo:mainfrom
youngeunkwon0405:update_nsys_doc

Conversation

@youngeunkwon0405
Copy link
Copy Markdown
Contributor

@youngeunkwon0405 youngeunkwon0405 commented Sep 10, 2025

What does this PR do ?

This PR shows how to create a single timeline view of the End-to-end RL loop in the Nsight Systems.

The example figure is an LLaMa 3 8B Async GRPO run that the generation and the training are overlapped to each other.

image

FYI: @parthchadha @guyueh1

Issues

List issues that this PR closes (syntax):

CLOSES: #569

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • Documentation
    • Added guidance on analyzing the end-to-end RL loop using Nsight Systems’ multi-report view.
    • Explains how to open and view profiles from multiple workers in a single timeline to correlate policy and generation activity.
    • Included repeated placement of this guidance to make profiling instructions more discoverable across docs.

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Sep 10, 2025

Walkthrough

Added two identical documentation blocks to docs/nsys-profiling.md explaining how to analyze the end-to-end RL loop by loading multiple Nsight Systems reports (e.g., *policy_worker*.nsys-rep, *generation_worker*.nsys-rep) into a single multi-report timeline. No code or API changes.

Changes

Cohort / File(s) Summary of Changes
Documentation updates
docs/nsys-profiling.md
Inserted two duplicate sections titled “How to Analyze the End-to-End RL Loop All at Once,” describing Nsight Systems multi-report view and examples (\policy\_worker\.nsys-rep, \generation\_worker\.nsys-rep). No logic or API changes.

Sequence Diagram(s)

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Poem

Thump-thump, my paws tap docs in sync,
Two mirrored notes—no code to rethink.
Profiles align, timelines entwine,
Policy and generation share one line.
I nibble carrots, charts in view—one hop, two hops—multi-reports accrue! 🥕✨


📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0bf6f5e and ce909fe.

⛔ Files ignored due to path filters (1)
  • docs/assets/nsys-multi-report-view.png is excluded by !**/*.png
📒 Files selected for processing (1)
  • docs/nsys-profiling.md (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • docs/nsys-profiling.md
✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The current pull request title concisely and accurately describes the primary change by indicating that documentation has been added for an end-to-end timeline view using Nsight Systems, matching the main content of the changeset.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

@github-actions github-actions Bot added the Documentation Improvements or additions to documentation label Sep 10, 2025
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 915c79c and 0bf6f5e.

📒 Files selected for processing (1)
  • docs/nsys-profiling.md (1 hunks)

Comment thread docs/nsys-profiling.md
Copy link
Copy Markdown
Collaborator

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is awesome @youngeunkwon0405

does this close #569 @guyueh1 ?

if so, @youngeunkwon0405 can you add "Closes #569" in the PR description since it only closes issues if that language is added to the description

@terrykong
Copy link
Copy Markdown
Collaborator

@youngeunkwon0405 also can you add the figure to our docs, i think it's a helpful visual

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
@youngeunkwon0405
Copy link
Copy Markdown
Contributor Author

youngeunkwon0405 commented Sep 10, 2025

@terrykong Added the figure to the doc. The visualization in the doc will look like this.

image

For closing the issue #569, let me wait for @guyueh1 's confirmation.

@terrykong
Copy link
Copy Markdown
Collaborator

Also, was that unified profile created from an async run since the training and generation overlap? If so, did you also solve #922?

@youngeunkwon0405
Copy link
Copy Markdown
Contributor Author

Also, was that unified profile created from an async run since the training and generation overlap? If so, did you also solve #922?

Yes, the run is an async run. For the NVTX marker, we still need a fix. If you look at the profile screenshot, you can see that only the training side has the nvtx marker.

@guyueh1
Copy link
Copy Markdown
Contributor

guyueh1 commented Sep 16, 2025

#569 should be closed. The remained issue in 569 (verifying if nsys works for AsyncLLM) is followed-up in another issue.
approved

@youngeunkwon0405
Copy link
Copy Markdown
Contributor Author

Also, was that unified profile created from an async run since the training and generation overlap? If so, did you also solve #922?

Hi @terrykong , can we merge this MR? For the #922, I am working on it from a different branch, since it is a different issue.

@terrykong terrykong added this pull request to the merge queue Sep 17, 2025
@chtruong814 chtruong814 removed this pull request from the merge queue due to the queue being cleared Sep 18, 2025
@chtruong814 chtruong814 merged commit 61e235c into NVIDIA-NeMo:main Sep 18, 2025
30 checks passed
yfw pushed a commit that referenced this pull request Sep 23, 2025
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Holistic Profiling Tool in RL Workflow

5 participants