`use_liger_kernel` is not compatible with sequence parallel

### System Info

- `transformers` version: 4.57.3 (cherry-pick commit `7e0ea6997411f2633712cec5c475b791efe69785`)
- Platform: Linux-5.4.203-1-tlinux4-0011.spr.0001-x86_64-with-glibc2.38
- Python version: 3.10.19
- Huggingface_hub version: 0.36.0
- Safetensors version: 0.5.3
- Accelerate version: 1.12.0
- Accelerate config:    not found
- DeepSpeed version: 0.18.2
- PyTorch version (accelerator?): 2.6.0+cu124 (CUDA)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: yes
- Using GPU in script?: yes
- GPU type: NVIDIA H800

Accelerate config:

```yaml
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_hostfile: accelerate_configs/hostfile
  deepspeed_multinode_launcher: pdsh
  gradient_clipping: auto
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_process_ip: x.x.x.x
main_process_port: 8005
main_training_function: main
mixed_precision: 'no'
num_machines: 2
num_processes: 16
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
parallelism_config:
  parallelism_config_sp_size: 4
  parallelism_config_dp_replicate_size: 4
  parallelism_config_sp_backend: deepspeed
  parallelism_config_sp_seq_length_is_variable: true
  parallelism_config_sp_attn_implementation: flash_attention_2
```

### Who can help?

@SunMarc 

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

When using trl `SFTTrainer` with `--use_liger_kernel true`, the training loop is broken. The `outputs.logits` is None and `ForCausalLMLoss` method call `logits = logits.float()` which breaks it.

https://github.com/huggingface/transformers/blob/c12dfddba94c04fb6a5f5be4be879af51e354e2f/src/transformers/trainer.py#L3950-L3955

https://github.com/huggingface/transformers/blob/c12dfddba94c04fb6a5f5be4be879af51e354e2f/src/transformers/loss/loss_utils.py#L54

### Expected behavior

Deepspeed SP can work together with liger kernel enabled.

	loss = unwrapped_model.loss_function(
	logits=outputs.logits,
	labels=None,
	shift_labels=shift_labels,
	vocab_size=unwrapped_model.config.vocab_size,
	)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`use_liger_kernel` is not compatible with sequence parallel #42414

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

use_liger_kernel is not compatible with sequence parallel #42414

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`use_liger_kernel` is not compatible with sequence parallel #42414