Skip to content

feat(trainer): log individual losses from loss_dict#45558

Closed
Abdeltoto wants to merge 2 commits intohuggingface:mainfrom
Abdeltoto:feat/trainer-log-loss-dict-31081
Closed

feat(trainer): log individual losses from loss_dict#45558
Abdeltoto wants to merge 2 commits intohuggingface:mainfrom
Abdeltoto:feat/trainer-log-loss-dict-31081

Conversation

@Abdeltoto
Copy link
Copy Markdown

What does this PR do?

Fixes #31081.

When a model returns auxiliary losses alongside the main loss (e.g. via a
loss_dict field in its ModelOutput), the Trainer currently only logs the
combined loss. That makes debugging multi-term objectives painful: you can
see total loss going down without knowing which term is actually moving.

This PR teaches the Trainer to also log each scalar term it finds in
outputs.loss_dict, plus any top-level *_loss scalar attribute on the
output, under namespaced keys like loss_dict_<name> and loss_<name>.
Behaviour is opt-in by virtue of the model itself: if no extra losses are
returned, nothing changes. The main loss value and all existing logs are
unchanged.

Implementation notes

  • New buffer self._aux_losses_accumulator on the Trainer, mirroring how
    _total_loss_scalar already works.
  • In training_step, after compute_loss(..., return_outputs=True), scalar
    tensor entries from outputs.loss_dict and from outputs.<name>_loss are
    detached, gradient-accumulation-scaled, and accumulated. Same DP mean and
    same num_items_in_batch normalization as the main loss, so the numbers
    are comparable.
  • In _maybe_log_save_evaluate, accumulators are gathered across processes
    with nested_gather (consistent with the main tr_loss path), averaged
    over the logging window, and added to logs. Then the buffers are reset.
  • Non-tensor / non-scalar values are silently ignored, so models that put
    arbitrary metadata in loss_dict won't crash the Trainer.

No public API change. No new dependency. The diff is mostly localized to two
methods in trainer.py.

Tests

tests/trainer/test_trainer.py::TrainerIntegrationTest::test_trainer_logs_auxiliary_losses_from_loss_dict

A small RegressionPreTrainedModelWithLossDict (added to
tests/trainer/trainer_test_utils.py) returns loss, loss_dict={'mse', 'l1'}, and a top-level extra_loss. The test runs a tiny Trainer.train()
and asserts the resulting log entries contain the expected
loss_dict_mse, loss_dict_l1, and loss_extra keys with finite,
positive values. Ran locally:

Ran 1 test in 0.245s
OK

Code Agent Policy

  • I confirm that this is not a pure code agent PR.

I used Cursor as a coding assistant (the commit trailer says
Made-with: Cursor), but I read every diff, ran the tests locally, wrote
the description myself, and own the change. Happy to iterate on review
feedback.

Before submitting

Who can review?

@SunMarc — Trainer maintainer per the template.

- Accumulate scalar terms from outputs.loss_dict and optional top-level *_loss fields
- Apply same DP mean and GA scaling as the main training loss before logging
- Clear auxiliary buffers each log step; add integration test with RegressionPreTrainedModelWithLossDict

Made-with: Cursor
@Rocketknight1
Copy link
Copy Markdown
Member

Sorry, we really don't want code agent PRs on old issues like this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Log multiple losses used along with the combined losses when a model returns a dictionary of losses.

2 participants