[Don't review] ChunkedCELoss by wwwjn · Pull Request #2937 · pytorch/torchtitan

wwwjn · 2026-04-10T22:58:26Z

Not ready for review

Implements chunked cross-entropy loss that splits the sequence dimension into N chunks, computing lm_head projection and CE loss per-chunk to avoid materializing the full [B, L, V] logits tensor at once. Key components: - ChunkedCELoss: wraps lm_head + ce_loss with chunked forward/backward - GradAccumulator: pre-allocated buffer for assembling chunk gradients - _no_reshard_after_backward: FSDP2 context to avoid N all-gathers - skip_lm_head kwarg on Decoder.forward() for the detach boundary - ChunkedCELossFactory: deferred initialization (model not available at build time) - Trainer integration with dedicated forward_backward_step branch

…CELoss - Add loss_num_chunks to TrainingConfig (default 1, no-op) - Trainer auto-wraps loss_fn in ChunkedCELossFactory when loss_num_chunks > 1 - Integration tests for FSDP, FSDP+TP(SP), FSDP+CP, FSDP+TP+CP, FSDP+compile

FSDP2's backward hooks are one-shot per forward pass. The previous approach of calling self.lm_head(h_chunk) triggered FSDP2's backward hooks during chunk backward, leaving no hooks for the decoder backward (h.backward(grad)), causing zero gradients on model parameters. Fix: Use F.linear(h_chunk, lm_weight) to bypass FSDP2 module hooks during chunk computation. Use (h * accumulated_grad).sum().backward() instead of h.backward(grad) to properly trigger FSDP2's hooks in a single backward pass.

Replace bare function + build_fn pattern with proper loss classes. CrossEntropyLoss and MSELoss encapsulate compilation logic internally. The old function names (cross_entropy_loss, mse_loss) remain as public API for backward compatibility. build_cross_entropy_loss and build_mse_loss now return class instances.

wwwjn · 2026-04-11T01:48:47Z

tests/integration_tests/features.py

+            "chunked_loss_fsdp+tp+cp",
+            ngpu=8,
+        ),
+        OverrideDefinitions(


Need to consolidate to one single compound test

wwwjn · 2026-04-11T01:51:29Z

torchtitan/components/loss.py



-def cross_entropy_loss(pred: torch.Tensor, labels: torch.Tensor) -> torch.Tensor:
+class CrossEntropyLoss:


Need a cleaner refactor for loss part

wwwjn · 2026-04-11T01:52:49Z

torchtitan/components/loss.py

+        """Initialize the gradient accumulator.
+
+        Args:
+            reference: Reference tensor to get shape, device, and dtype from.


pass the shape direclty

wwwjn · 2026-04-11T02:04:26Z

torchtitan/components/loss.py

+            h_chunk.grad = None
+            del scaled_chunk_loss, chunk_loss, logits
+
+        # Get the accumulated gradient and backward through the decoder.


Revisit this FSDP trigger bug

wwwjn · 2026-04-11T02:04:26Z

torchtitan/components/loss.py

+            # Use F.linear instead of self.lm_head(h_chunk) to bypass FSDP2's
+            # module forward/backward hooks. This ensures FSDP2's one-shot
+            # backward hooks remain available for the decoder backward below.
+            logits = torch.nn.functional.linear(h_chunk, lm_weight)


Refactor, avoid using functional but trigger FSDP hooks at correct time

wwwjn · 2026-04-11T02:14:45Z

torchtitan/components/loss.py

+        accumulated_grad = grad_accumulator.result()
+        assert accumulated_grad.dtype == torch.float32
+
+        decoder_loss = (hidden_states * accumulated_grad.to(hidden_states.dtype)).sum()


This too tricky, find how to trigger FSDP hooks explicitly

wwwjn · 2026-04-11T02:16:24Z

torchtitan/components/loss.py

+    """Factory for creating ChunkedCELoss after model construction.
+
+    Since ChunkedCELoss needs the model's lm_head, and the model is not available
+    at loss builder time, this factory is returned by build_chunked_cross_entropy_loss


Remove the build function

wwwjn added 4 commits April 10, 2026 15:40

pytorch-bot bot added the ciflow/8gpu label Apr 10, 2026

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 10, 2026

wwwjn changed the title ~~ChunkedCELoss~~ [Don't review] ChunkedCELoss Apr 10, 2026

wwwjn commented Apr 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Don't review] ChunkedCELoss#2937

[Don't review] ChunkedCELoss#2937
wwwjn wants to merge 4 commits intomainfrom
chunked-loss

wwwjn commented Apr 10, 2026

Uh oh!

wwwjn Apr 11, 2026

Uh oh!

wwwjn Apr 11, 2026

Uh oh!

wwwjn Apr 11, 2026

Uh oh!

wwwjn Apr 11, 2026

Uh oh!

wwwjn Apr 11, 2026

Uh oh!

wwwjn Apr 11, 2026

Uh oh!

wwwjn Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant



		def cross_entropy_loss(pred: torch.Tensor, labels: torch.Tensor) -> torch.Tensor:
		class CrossEntropyLoss:

Conversation

wwwjn commented Apr 10, 2026

Uh oh!

wwwjn Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

wwwjn Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

wwwjn Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

wwwjn Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

wwwjn Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

wwwjn Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

wwwjn Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant