System Info
transformers version: 4.57.3 (cherry-pick commit 7e0ea6997411f2633712cec5c475b791efe69785)
- Platform: Linux-5.4.203-1-tlinux4-0011.spr.0001-x86_64-with-glibc2.38
- Python version: 3.10.19
- Huggingface_hub version: 0.36.0
- Safetensors version: 0.5.3
- Accelerate version: 1.12.0
- Accelerate config: not found
- DeepSpeed version: 0.18.2
- PyTorch version (accelerator?): 2.6.0+cu124 (CUDA)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: yes
- Using GPU in script?: yes
- GPU type: NVIDIA H800
Accelerate config:
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_hostfile: accelerate_configs/hostfile
deepspeed_multinode_launcher: pdsh
gradient_clipping: auto
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_process_ip: x.x.x.x
main_process_port: 8005
main_training_function: main
mixed_precision: 'no'
num_machines: 2
num_processes: 16
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
parallelism_config:
parallelism_config_sp_size: 4
parallelism_config_dp_replicate_size: 4
parallelism_config_sp_backend: deepspeed
parallelism_config_sp_seq_length_is_variable: true
parallelism_config_sp_attn_implementation: flash_attention_2
Who can help?
@SunMarc
Information
Tasks
Reproduction
When using trl SFTTrainer with --use_liger_kernel true, the training loop is broken. The outputs.logits is None and ForCausalLMLoss method call logits = logits.float() which breaks it.
|
loss = unwrapped_model.loss_function( |
|
logits=outputs.logits, |
|
labels=None, |
|
shift_labels=shift_labels, |
|
vocab_size=unwrapped_model.config.vocab_size, |
|
) |
Expected behavior
Deepspeed SP can work together with liger kernel enabled.
System Info
transformersversion: 4.57.3 (cherry-pick commit7e0ea6997411f2633712cec5c475b791efe69785)Accelerate config:
Who can help?
@SunMarc
Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction
When using trl
SFTTrainerwith--use_liger_kernel true, the training loop is broken. Theoutputs.logitsis None andForCausalLMLossmethod calllogits = logits.float()which breaks it.transformers/src/transformers/trainer.py
Lines 3950 to 3955 in c12dfdd
transformers/src/transformers/loss/loss_utils.py
Line 54 in c12dfdd
Expected behavior
Deepspeed SP can work together with liger kernel enabled.