Fix LTX-2 image-to-video generation failure in two stages generation by Songrui625 · Pull Request #13187 · huggingface/diffusers

Songrui625 · 2026-02-26T12:37:36Z

What does this PR do?

Fix failure in LTX-2 image-to-video two stages generation.

LTX-2 image-to-video two stages generation sampling code.

import torch
from diffusers import FlowMatchEulerDiscreteScheduler
from diffusers import LTX2ImageToVideoPipeline, LTX2LatentUpsamplePipeline
from diffusers.pipelines.ltx2.latent_upsampler import LTX2LatentUpsamplerModel
from diffusers.pipelines.ltx2.utils import DISTILLED_SIGMA_VALUES, STAGE_2_DISTILLED_SIGMA_VALUES
from diffusers.pipelines.ltx2.export_utils import encode_video
from diffusers.utils import load_image

device = "cuda:0"
random_seed = 42
generator = torch.Generator(device).manual_seed(random_seed)
model_path = "/data00/models/LTX-2"

pipe = LTX2ImageToVideoPipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16)
pipe = pipe.to(device)

image_path = "/data00/ltx2_i2v_input.png"
image = load_image(image_path)

prompt = 'A close-up shot of a young waitress in a retro 1950s diner, her warm brown eyes meeting the camera with a gentle smile. She wears a black polka-dot dress with an elegant cream lace collar, her reddish-brown hair styled in an elaborate updo with delicate curls framing her freckled face. Soft, warm light from overhead fixtures illuminates her features as she stands behind a yellow counter. The camera begins slightly to her side, then slowly pushes in toward her face, revealing the subtle rosy blush on her cheeks. In the blurred background, the soft teal walls and a glowing red "Diner" sign create a nostalgic atmosphere. The ambient sounds of clinking dishes, distant conversations, and the gentle hum of a jukebox fill the air. She tilts her head slightly and says in a friendly, warm voice: "Welcome to Rosie\'s. What can I get for you today?" The mood is inviting, timeless, and full of classic American diner charm.'
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"

frame_rate = 24.0
video_latent, audio_latent = pipe(
    image=image,
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=768,
    height=512,
    num_frames=121,
    frame_rate=frame_rate,
    num_inference_steps=8,
    sigmas=None,
    guidance_scale=4.0,
    output_type="latent",
    return_dict=False,
)

latent_upsampler = LTX2LatentUpsamplerModel.from_pretrained(
    model_path,
    subfolder="latent_upsampler",
    torch_dtype=torch.bfloat16,
)
upsample_pipe = LTX2LatentUpsamplePipeline(vae=pipe.vae, latent_upsampler=latent_upsampler)
# upsample_pipe.enable_model_cpu_offload(device=device)
upsample_pipe = upsample_pipe.to(device)
upscaled_video_latent = upsample_pipe(
    latents=video_latent,
    output_type="latent",
    return_dict=False,
)[0]

pipe.load_lora_weights(
    "/data00/models/LTX-2", adapter_name="stage_2_distilled", weight_name="ltx-2-19b-distilled-lora-384.safetensors"
)
pipe.set_adapters("stage_2_distilled", 1.0)
# VAE tiling is usually necessary to avoid OOM error when VAE decoding
pipe.vae.enable_tiling()
# Change scheduler to use Stage 2 distilled sigmas as is
new_scheduler = FlowMatchEulerDiscreteScheduler.from_config(
    pipe.scheduler.config, use_dynamic_shifting=False, shift_terminal=None
)
pipe.scheduler = new_scheduler

# Stage 2 inference with distilled LoRA and sigmas
video, audio = pipe(
    latents=upscaled_video_latent,
    audio_latents=audio_latent,
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=3,
    noise_scale=STAGE_2_DISTILLED_SIGMA_VALUES[0], # renoise with first sigma value https://github.com/Lightricks/LTX-2/blob/main/packages/ltx-pipelines/src/ltx_pipelines/distilled.py#L178
    sigmas=STAGE_2_DISTILLED_SIGMA_VALUES,
    generator=generator,
    guidance_scale=1.0,
    output_type="np",
    return_dict=False,
)

encode_video(
    video[0],
    fps=frame_rate,
    audio=audio[0].float().cpu(),
    audio_sample_rate=pipe.vocoder.config.output_sampling_rate, # should be 24000
    output_path="video.mp4"
)

It gots error as below:

Traceback (most recent call last):
  File "/app/ltx2_i2v.py", line 70, in <module>
    video, audio = pipe(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/ltx2/pipeline_ltx2_image2video.py", line 1045, in __call__
    latents, conditioning_mask = self.prepare_latents(
  File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/ltx2/pipeline_ltx2_image2video.py", line 709, in prepare_latents
    latents = self._create_noised_state(latents, noise_scale * (1 - conditioning_mask), generator)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/ltx2/pipeline_ltx2_image2video.py", line 622, in _create_noised_state
    noised_latents = noise_scale * noise + (1 - noise_scale) * latents
RuntimeError: The size of tensor a (24) must match the size of tensor b (48) at non-singleton dimension 4

In LTX-2's two-stage image-to-video generation task, specifically after the upsampling step, a shape mismatch occurs between the latents and the conditioning_mask, which causes an error in function _create_noised_state.

After applying this patch, the previously mentioned error is fixed.

ltx2_video2.mp4

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sayakpaul
@DN6

sayakpaul

Thanks! Could you also add a simple test case for this?

In LTX-2's two-stage image-to-video generation task, specifically after the upsampling step, a shape mismatch occurs between the `latents` and the `conditioning_mask`, which causes an error in function `_create_noised_state`. Fix it by creating the `conditioning_mask` based on the shape of the `latents`.

dg845

Thanks for the PR! I agree with #13187 (review) that adding a test case for this would be useful.

HuggingFaceDocBuilderDev · 2026-02-27T02:18:28Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

dg845 · 2026-02-27T02:18:52Z

@bot /style

github-actions · 2026-02-27T02:19:15Z

Style bot fixed some files and pushed the changes.

Songrui625 · 2026-02-27T04:45:18Z

@sayakpaul @dg845 Hi, I pushed the unit test. Please take a review again. Thanks!

dg845 · 2026-02-27T05:58:01Z

tests/pipelines/ltx2/test_ltx2_image2video.py

+        upsampler = LTX2LatentUpsamplerModel(
+            in_channels=in_channels,
+        )


Suggested change

upsampler = LTX2LatentUpsamplerModel(

in_channels=in_channels,

)

upsampler = LTX2LatentUpsamplerModel(

in_channels=in_channels,

mid_channels=32,

num_blocks_per_stage=1,

)

Would it be possible to use a smaller latent upsampler so that the test_two_stages_inference_with_upsampler test is less heavy? Maybe something like the suggestion above?

nice catch! updated.

dg845 · 2026-02-27T08:26:58Z

@bot /style

github-actions · 2026-02-27T08:27:28Z

Style bot fixed some files and pushed the changes.

Songrui625 force-pushed the fix-ltx2-i2v-2stages branch from c07e9bf to 12e0305 Compare February 26, 2026 12:39

sayakpaul requested a review from dg845 February 26, 2026 12:47

sayakpaul reviewed Feb 26, 2026

View reviewed changes

dg845 approved these changes Feb 27, 2026

View reviewed changes

Add unit test for LTX-2 i2v two stages inference with upsampler

389bb94

Songrui625 force-pushed the fix-ltx2-i2v-2stages branch from 490fc13 to 389bb94 Compare February 27, 2026 04:43

dg845 reviewed Feb 27, 2026

View reviewed changes

Downscaling the upsampler in LTX-2 image-to-video unit test

83e7b90

Apply style fixes

391fb3d

dg845 merged commit 40e9645 into huggingface:main Feb 27, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix LTX-2 image-to-video generation failure in two stages generation#13187

Fix LTX-2 image-to-video generation failure in two stages generation#13187
dg845 merged 4 commits intohuggingface:mainfrom
Songrui625:fix-ltx2-i2v-2stages

Songrui625 commented Feb 26, 2026 •

edited

Loading

Uh oh!

sayakpaul left a comment

Uh oh!

dg845 left a comment

Uh oh!

HuggingFaceDocBuilderDev commented Feb 27, 2026

Uh oh!

dg845 commented Feb 27, 2026

Uh oh!

github-actions bot commented Feb 27, 2026 •

edited

Loading

Uh oh!

Songrui625 commented Feb 27, 2026

Uh oh!

dg845 Feb 27, 2026

Uh oh!

Songrui625 Feb 27, 2026

Uh oh!

dg845 commented Feb 27, 2026

Uh oh!

github-actions bot commented Feb 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Songrui625 commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Who can review?

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

dg845 left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Feb 27, 2026

Uh oh!

dg845 commented Feb 27, 2026

Uh oh!

github-actions bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Songrui625 commented Feb 27, 2026

Uh oh!

dg845 Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Songrui625 Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

dg845 commented Feb 27, 2026

Uh oh!

github-actions bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Songrui625 commented Feb 26, 2026 •

edited

Loading

github-actions bot commented Feb 27, 2026 •

edited

Loading

github-actions bot commented Feb 27, 2026 •

edited

Loading