Loss scaling is incorrect when using gradient_accumulation_steps > 1

### 🐛 Describe the bug

Fixed it for this one specific case here, sadly don't have the time to put in a proper pr for all models for all cases: https://github.com/BitPhinix/Liger-Kernel/commit/c455526c4707e622d4e7518255b14e018e43228d

TLDR:

Ever since https://github.com/huggingface/transformers/pull/34191

Transformers expect all models that take in `kwargs` to scale the loss by `num_items_in_batch`.

Right now, the loss is effectively multiplied by the number of gradient accumulation steps steps when using liger kernels

### Reproduce

_No response_

### Versions

all

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss scaling is incorrect when using gradient_accumulation_steps > 1 #802

🐛 Describe the bug

Reproduce

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Loss scaling is incorrect when using gradient_accumulation_steps > 1 #802

Description

🐛 Describe the bug

Reproduce

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions