Hi,
Great work!
However, when I tried to reproduce the results reported in the paper, I got the following:
Evaluation loss and perplexity at step 11001 (60m):
Loss: 6.293065547943115
Perplexity: 540.8086654237502
which are quite different from those reported in the paper.

Here are the details of my script
torchrun --standalone --nproc_per_node 1 torchrun_main.py
--model_config configs/llama_60m.json
--lr 0.003
--peft_model sltrain
--optimizer adamw
--rank 128
--sp_ratio 0.03
--batch_size 256
--total_batch_size 512
--num_training_steps 11000
--warmup_steps 1100
--weight_decay 0
--dtype bfloat16
--eval_every 1000
--save_dir path/to/save
--lora_alpha 32
Are there any specific hyperparameters I should pay special attention to? (like lr?)
Thank you for your help!
Guinan
Hi,
Great work!
However, when I tried to reproduce the results reported in the paper, I got the following:
Evaluation loss and perplexity at step 11001 (60m):
Loss: 6.293065547943115
Perplexity: 540.8086654237502
which are quite different from those reported in the paper.
Here are the details of my script
torchrun --standalone --nproc_per_node 1 torchrun_main.py
--model_config configs/llama_60m.json
--lr 0.003
--peft_model sltrain
--optimizer adamw
--rank 128
--sp_ratio 0.03
--batch_size 256
--total_batch_size 512
--num_training_steps 11000
--warmup_steps 1100
--weight_decay 0
--dtype bfloat16
--eval_every 1000
--save_dir path/to/save
--lora_alpha 32
Are there any specific hyperparameters I should pay special attention to? (like lr?)
Thank you for your help!
Guinan