support CP in native flash attention by sywangyi · Pull Request #12829 · huggingface/diffusers

sywangyi · 2025-12-12T06:01:35Z

What does this PR do?

General functionalities: @sayakpaul @yiyixuxu @DN6

native flash attention could support both Ulysses and ring attention

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>

sywangyi · 2025-12-12T06:39:11Z

since native attention only support Ulysses Attention, we need an attention worked for Ring Attention. xpu enable _scaled_dot_product_flash_attention in torch. so we could use it for ring attention

sywangyi · 2025-12-12T06:47:19Z

@yao-matrix

sayakpaul

Cool work! Could you also supplement a fully working code snippet?

sywangyi · 2025-12-12T07:19:39Z

yes, the PR also works for cuda.

import random

import numpy as np
import torch
from torch import distributed as dist

from diffusers import AutoencoderKLWan, ContextParallelConfig, WanPipeline
from diffusers.hooks.group_offloading import apply_group_offloading
from diffusers.utils import export_to_video


model_id="Wan-AI/Wan2.2-T2V-A14B-Diffusers"

def setup_distributed():
    if not dist.is_initialized():
        dist.init_process_group(backend="nccl")
    device = torch.device(f"cuda:{dist.get_rank()}")
    torch.cuda.set_device(device)
    return device


def set_seed_for_all_ranks(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    generator = torch.Generator(device="cuda")
    generator.manual_seed(seed)
    return generator


device = setup_distributed()
generator = set_seed_for_all_ranks(42)
onload_device = device
offload_device = torch.device("cpu")

vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
# group-offloading
pipe = WanPipeline.from_pretrained(
    model_id,
    vae=vae,
    torch_dtype=torch.bfloat16,
)
ring_degree = torch.distributed.get_world_size()
pipe.transformer.enable_parallelism(config=ContextParallelConfig(ring_degree=ring_degree))
pipe.transformer_2.enable_parallelism(config=ContextParallelConfig(ring_degree=ring_degree))
pipe.transformer.set_attention_backend("_native_flash")
pipe.transformer_2.set_attention_backend("_native_flash")
pipe.to("cuda")

pipe.vae.enable_tiling(tile_sample_min_height=480,tile_sample_min_width=960,tile_sample_stride_height=352,tile_sample_stride_width=640)
height = 704
width = 1280
num_frames = 24
num_inference_steps = 50
guidance_scale = 5.0


prompt = "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
negative_prompt = "色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩>残留，丑陋的，残缺的>，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画>面，杂>乱的背景，三条腿，背>景人很多，倒着走"

output = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=height,
    width=width,
    num_frames=num_frames,
    guidance_scale=guidance_scale,
    num_inference_steps=num_inference_steps,
    generator=generator,
).frames[0]
if torch.distributed.get_rank() == 0:
    export_to_video(output, "5bit2v_output.mp4", fps=24)
if dist.is_initialized():
    torch.distributed.destroy_process_group()

sywangyi · 2025-12-12T07:22:47Z

you could run it using torchrun --nproc-per-node 4 test.py, without the PR, the output is corrupted.

HuggingFaceDocBuilderDev · 2025-12-12T07:23:48Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

sayakpaul · 2025-12-12T07:28:23Z

without the PR, the output is corrupted.

Hmm, it should raise an error no?

diffusers/src/diffusers/models/attention_dispatch.py

Line 1935 in 1567243

def _native_flash_attention(

On main we don't have parallel support.

sywangyi · 2025-12-12T07:37:23Z

previous if the attention does not support CP, but application use CP. no error is raised and corrupted video is generated.
see the output wo the PR

with the PR

sayakpaul · 2025-12-12T07:39:02Z

Weird. Will check and fix this. Cc: @DN6

Signed-off-by: Wang, Yi <yi.a.wang@intel.com> Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

sayakpaul · 2025-12-12T09:06:45Z

Okay I tracked it down.

The order in which we're calling things matters. For example, if we do:

pipeline.transformer.set_attention_backend("_native_flash")
pipeline.transformer.enable_parallelism(config=cp_config)

It rightfully errors out:

[rank0]: ValueError: Context parallelism is enabled but the attention processor 'FluxAttnProcessor' is using backend '_native_flash' which does not support context parallelism. Please set a compatible attention backend: ['_native_cudnn', 'flash', 'native', 'sage'] using `model.set_attention_backend()` before calling `enable_parallelism()`.

But for any other combinations, it silently passes through. Will fix.

support CP in native flash attention

2bfa440

Signed-off-by: Wang, Yi <yi.a.wang@intel.com>

sayakpaul approved these changes Dec 12, 2025

View reviewed changes

Merge branch 'main' into flash_cp

73a36c0

sayakpaul merged commit 17c0e79 into huggingface:main Dec 12, 2025
10 of 11 checks passed

sayakpaul added the performance Anything related to performance improvements, profiling and benchmarking label Dec 12, 2025

sayakpaul added a commit that referenced this pull request Dec 12, 2025

support CP in native flash attention (#12829)

218b170

Signed-off-by: Wang, Yi <yi.a.wang@intel.com> Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

sayakpaul mentioned this pull request Dec 12, 2025

[core] gracefully error out when attn-backend x cp combo isn't supported. #12832

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support CP in native flash attention#12829

support CP in native flash attention#12829
sayakpaul merged 2 commits intohuggingface:mainfrom
sywangyi:flash_cp

sywangyi commented Dec 12, 2025

Uh oh!

sywangyi commented Dec 12, 2025

Uh oh!

sywangyi commented Dec 12, 2025

Uh oh!

sayakpaul left a comment

Uh oh!

sywangyi commented Dec 12, 2025

Uh oh!

sywangyi commented Dec 12, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Dec 12, 2025

Uh oh!

sayakpaul commented Dec 12, 2025

Uh oh!

sywangyi commented Dec 12, 2025

Uh oh!

sayakpaul commented Dec 12, 2025

Uh oh!

Uh oh!

sayakpaul commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sywangyi commented Dec 12, 2025

What does this PR do?

Uh oh!

sywangyi commented Dec 12, 2025

Uh oh!

sywangyi commented Dec 12, 2025

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

sywangyi commented Dec 12, 2025

Uh oh!

sywangyi commented Dec 12, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Dec 12, 2025

Uh oh!

sayakpaul commented Dec 12, 2025

Uh oh!

sywangyi commented Dec 12, 2025

Uh oh!

sayakpaul commented Dec 12, 2025

Uh oh!

Uh oh!

sayakpaul commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants