Skip to content

use_liger_kernel is not compatible with sequence parallel #42414

@jue-jue-zi

Description

@jue-jue-zi

System Info

  • transformers version: 4.57.3 (cherry-pick commit 7e0ea6997411f2633712cec5c475b791efe69785)
  • Platform: Linux-5.4.203-1-tlinux4-0011.spr.0001-x86_64-with-glibc2.38
  • Python version: 3.10.19
  • Huggingface_hub version: 0.36.0
  • Safetensors version: 0.5.3
  • Accelerate version: 1.12.0
  • Accelerate config: not found
  • DeepSpeed version: 0.18.2
  • PyTorch version (accelerator?): 2.6.0+cu124 (CUDA)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: yes
  • Using GPU in script?: yes
  • GPU type: NVIDIA H800

Accelerate config:

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_hostfile: accelerate_configs/hostfile
  deepspeed_multinode_launcher: pdsh
  gradient_clipping: auto
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_process_ip: x.x.x.x
main_process_port: 8005
main_training_function: main
mixed_precision: 'no'
num_machines: 2
num_processes: 16
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
parallelism_config:
  parallelism_config_sp_size: 4
  parallelism_config_dp_replicate_size: 4
  parallelism_config_sp_backend: deepspeed
  parallelism_config_sp_seq_length_is_variable: true
  parallelism_config_sp_attn_implementation: flash_attention_2

Who can help?

@SunMarc

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

When using trl SFTTrainer with --use_liger_kernel true, the training loop is broken. The outputs.logits is None and ForCausalLMLoss method call logits = logits.float() which breaks it.

loss = unwrapped_model.loss_function(
logits=outputs.logits,
labels=None,
shift_labels=shift_labels,
vocab_size=unwrapped_model.config.vocab_size,
)

logits = logits.float()

Expected behavior

Deepspeed SP can work together with liger kernel enabled.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions