draft: feat: fused loss and logit to logprob conversion by jiemingz · Pull Request #994 · NVIDIA-NeMo/RL

jiemingz · 2025-08-27T02:48:59Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>

terrykong · 2025-08-27T04:44:22Z

thanks! what are the expected gains we can expect? also is this related to #496?

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>

jiemingz · 2025-08-28T14:14:22Z

Unrelated to #496 but we can expect the memory spikes seen at the loss functions to go away.

euronymous-aithal · 2025-09-17T04:12:14Z

@guyueh1 can you please review this ?

guyueh1

Overall like the idea of doing torch compile full graph on the logprob function, but since the softmax output is still kept, the memory will still be huge, we may need another path to completely remove the overhead; is this just saving the memory for mask tensor, log_softmax tensor, can you provide data how much it is saving?

also resolve the comments please

guyueh1 · 2025-09-22T17:32:10Z

 from torch.distributed.tensor import DTensor, distribute_tensor


-@torch.no_grad()


is it necessary to remove the @torch.no_grad()?

guyueh1 · 2025-09-22T17:42:35Z

+            masked_target_chunk = target - vocab_start_index
+            masked_target_chunk[target_mask_chunk] = 0
+
+            distributed_logprob_forward(


why is this not returning anything? and is the consequent line 196-217 necessary or should be removed?

guyueh1 · 2025-09-22T17:43:08Z

            logits = logits.to(dtype=torch.float32)

-            softmax_output = _compute_distributed_log_softmax(
+            softmax_output = _distributed_logprob_forward(


this name is not found, is it distributed_logprob_forward?

fused loss

cbc1c93

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>

fully fused logprob

d47e43c

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>

euronymous-aithal requested a review from guyueh1 September 17, 2025 04:12

guyueh1 reviewed Sep 22, 2025

View reviewed changes

guyueh1 mentioned this pull request Oct 1, 2025

[mcore] Perf optimization for loss and logit-to-logprob in long context GRPO #1247

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

draft: feat: fused loss and logit to logprob conversion#994

draft: feat: fused loss and logit to logprob conversion#994
jiemingz wants to merge 2 commits intomainfrom
jiemingz/loss_funcs

jiemingz commented Aug 27, 2025

Uh oh!

terrykong commented Aug 27, 2025

Uh oh!

jiemingz commented Aug 28, 2025

Uh oh!

euronymous-aithal commented Sep 17, 2025

Uh oh!

guyueh1 left a comment

Uh oh!

guyueh1 Sep 22, 2025

Uh oh!

guyueh1 Sep 22, 2025

Uh oh!

guyueh1 Sep 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		from torch.distributed.tensor import DTensor, distribute_tensor


		@torch.no_grad()

Conversation

jiemingz commented Aug 27, 2025

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

terrykong commented Aug 27, 2025

Uh oh!

jiemingz commented Aug 28, 2025

Uh oh!

euronymous-aithal commented Sep 17, 2025

Uh oh!

guyueh1 left a comment

Choose a reason for hiding this comment

Uh oh!

guyueh1 Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

guyueh1 Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

guyueh1 Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants