🐛 Describe the bug
Fixed it for this one specific case here, sadly don't have the time to put in a proper pr for all models for all cases: BitPhinix@c455526
TLDR:
Ever since huggingface/transformers#34191
Transformers expect all models that take in kwargs to scale the loss by num_items_in_batch.
Right now, the loss is effectively multiplied by the number of gradient accumulation steps steps when using liger kernels
Reproduce
No response
Versions
all
🐛 Describe the bug
Fixed it for this one specific case here, sadly don't have the time to put in a proper pr for all models for all cases: BitPhinix@c455526
TLDR:
Ever since huggingface/transformers#34191
Transformers expect all models that take in
kwargsto scale the loss bynum_items_in_batch.Right now, the loss is effectively multiplied by the number of gradient accumulation steps steps when using liger kernels
Reproduce
No response
Versions
all