Loss does not drop when using Liger Kernel at Qwen2.5

### 🐛 Describe the bug

I am trying to instruction tuning Qwen2.5-14B-Instruct with [Liger Kernel](https://github.com/linkedin/Liger-Kernel).

I know that the liger kernel is supported in the dev version of huggingface transformers. However, when training the Qwen2.5 model with Liger Kernel, the loss value does not drop. Not supported yet at Qwen2.5?

### Reproduce

Python Code Example : 

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen2.5-14B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

...

trainer = Trainer(
      model=model,
      args=training_args,
      train_dataset=train_dataset,
)
trainer.train()
```

Run Example : 
```bash
deepspeed --include localhost:0,1 --master_port 61000 train.py \
    --learning_rate=1e-5 \
    --lr_scheduler_type=cosine \
    --max_length=8192 \
    --per_device_train_batch_size=4 \
    --gradient_accumulation_steps=1 \
    --evaluation_strategy=no \
    --num_train_epochs=3 \
    --save_strategy=epoch \
    --logging_strategy=steps \
    --logging_steps=1 \
    --save_total_limit=1 \
    --remove_unused_columns=False \
    --dataloader_num_workers=16 \
    --warmup_ratio=0.03 \
    --gradient_checkpointing=True \
    --torch_compile=True \
    --optim=adafactor \
    --bf16 \
    --deepspeed=./config/zero3.json \
    --use_liger_kernel=True
```

### Versions

Environment Report:
-------------------
Operating System: Linux-5.15.0-1047-oracle-x86_64-with-glibc2.35
Python version: 3.10.14
PyTorch version: 2.4.0+cu121
CUDA version: 12.1
Triton version: 3.0.0
Transformers version: 4.45.0.dev0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss does not drop when using Liger Kernel at Qwen2.5 #257

🐛 Describe the bug

Reproduce

Versions

Environment Report:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Loss does not drop when using Liger Kernel at Qwen2.5 #257

Description

🐛 Describe the bug

Reproduce

Versions

Environment Report:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions