TP support for reverse KL loss by oleksost · Pull Request #400 · ServiceNow/Fast-LLM

oleksost · 2025-12-02T19:50:58Z

TP support for reverse KL loss.

adds support for vocabulary parallel reverse KL loss calculation using torch (no fused implementataion).
Sequence parallel loss calculation is not supported to keep it simple (I don't think we use sequence parallel embeddings/head)
this also fixes a small bug in CE loss for when it is used for distillation

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

added _torch_reverse_kl_forward_backward in cross_entropy.py
added test_rkl_loss

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

📜 I have read and followed the contributing guidelines.
🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
🎉 The functionality is complete, and I have tested the changes.
📝 I have updated the documentation if needed.
⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration

🐋 I have updated the Docker configuration or dependencies, if applicable.
🔄 I have ensured compatibility with the existing setup after dependency changes.

Testing

🧪 I have added or updated tests to cover my changes.
✔️ New and existing tests pass locally with my changes.
🚦 I have tested these changes on GPUs and verified training stability.
🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

📊 I have run benchmarks where applicable to evaluate the performance impact.
✅ The benchmarks show no performance regression.
🚀 The benchmarks indicate a potential performance improvement.
⚠️ The benchmarks indicate a potential performance degradation.
📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

If there is any impact on performance, describe it and provide benchmark results, if applicable:

🗒️ Additional Notes

Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.

jlamypoirier · 2025-12-03T00:25:57Z

+        # then we average: 1/K sum_ranks (log Z - sum_i t_i * z_i)
+        # = log Z - 1/K sum_ranks (sum_i t_i * z_i)
+        # but sum_ranks (sum_i t_i * z_i) = sum_i t_i * z_i (over all vocab)
+        predicted_logits = predicted_logits * group.size()


This looks wrong, see previous comment. The previous version was tested and confirmed to work.

Was ist also tested with soft labels (i.ew. when targets are logits)? Without this scaling this new test does not pass.

The reason is that when here we average loss over ranks, we basically do 1/K sum_K (log (Z) - sum_i z_i t_i), where sum_i z_i t_i is local predicted_logits and K is number of ranks. Then what we we get is 1/K * K log (Z) - 1/K predicted_logits_global, so 1/K that scales global predicted_logits does mot cancel out without scaling it by K before.

Sorry I didn't realize this was for distillation only. This one is less robustly tested so errors are possible. But if I understand correctly we just need to replace the mean reduction below with a sum reduction on predicted_logits only?

Yeh, either of two

scale predicted_logits by group size and keep everything as is (i.e. still AVG reduction on loss)

or do SUM reduction on predicted_logits instead of AVG reduction on loss below

jlamypoirier

Looks good, but some suggestions on improving the tests

jlamypoirier · 2025-12-04T21:57:29Z

@@ -0,0 +1,185 @@
+import os


Please move to tests/functional

Also consider renaming to test_cross_entropy (to match the implementation file) and moving test_cross_entropy here.

jlamypoirier · 2025-12-04T22:00:31Z

+    torch.testing.assert_close(loss, ref_loss, atol=1e-6, rtol=1e-6)
+
+
+def _ce_vocab_tp_worker(rank: int, group: dist.ProcessGroup, use_mask: bool):


We might want to match the implementation and parametrization from test_cross_entropy

oleksost added 3 commits December 2, 2025 16:47

wip

1a16300

wip

28a47e4

test

13d0c7d

oleksost requested review from RaymondLi0, jlamypoirier and tscholak December 2, 2025 19:51

oleksost marked this pull request as ready for review December 2, 2025 19:52

oleksost added 2 commits December 2, 2025 21:04

comment

07d754b

tests + CE loss bug

9390f27

jlamypoirier reviewed Dec 3, 2025

View reviewed changes

Comment thread fast_llm/functional/cross_entropy.py Outdated

Comment thread fast_llm/functional/config.py Outdated

Comment thread fast_llm/models/gpt/conversion/llama.py

Comment thread tests/test_distillation_loss.py Outdated

jlamypoirier reviewed Dec 3, 2025

View reviewed changes

oleksost added 5 commits December 3, 2025 03:35

CE loss

a3e862c

clean ups

16dd5a5

clean up

d002b95

test

264aaf7

mark slow

3994a44

oleksost requested a review from jlamypoirier December 3, 2025 14:20

jlamypoirier approved these changes Dec 4, 2025

View reviewed changes

clean up

f19459f

oleksost merged commit cc90338 into main Dec 4, 2025
3 of 4 checks passed

oleksost deleted the rev_kl_tp branch December 4, 2025 23:23

jlamypoirier mentioned this pull request Dec 6, 2025

Ensure compatibility between models and datasets #402

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TP support for reverse KL loss #400

TP support for reverse KL loss #400
oleksost merged 11 commits intomainfrom
rev_kl_tp

oleksost commented Dec 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jlamypoirier Dec 3, 2025

Uh oh!

oleksost Dec 3, 2025 •

edited

Loading

Uh oh!

jlamypoirier Dec 4, 2025

Uh oh!

oleksost Dec 4, 2025 •

edited

Loading

Uh oh!

jlamypoirier left a comment

Uh oh!

jlamypoirier Dec 4, 2025

Uh oh!

jlamypoirier Dec 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		torch.testing.assert_close(loss, ref_loss, atol=1e-6, rtol=1e-6)


		def _ce_vocab_tp_worker(rank: int, group: dist.ProcessGroup, use_mask: bool):

Conversation

oleksost commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Type of change

📝 Changes

✅ Checklist

General

Dependencies and Configuration

Testing

Performance Impact

📊 Performance Impact Details

🗒️ Additional Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jlamypoirier Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

oleksost Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jlamypoirier Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

oleksost Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jlamypoirier left a comment

Choose a reason for hiding this comment

Uh oh!

jlamypoirier Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

jlamypoirier Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

oleksost commented Dec 2, 2025 •

edited

Loading

oleksost Dec 3, 2025 •

edited

Loading

oleksost Dec 4, 2025 •

edited

Loading