Skip to content

Fix LTX-2 image-to-video generation failure in two stages generation#13187

Merged
dg845 merged 4 commits intohuggingface:mainfrom
Songrui625:fix-ltx2-i2v-2stages
Feb 27, 2026
Merged

Fix LTX-2 image-to-video generation failure in two stages generation#13187
dg845 merged 4 commits intohuggingface:mainfrom
Songrui625:fix-ltx2-i2v-2stages

Conversation

@Songrui625
Copy link
Contributor

@Songrui625 Songrui625 commented Feb 26, 2026

What does this PR do?

Fix failure in LTX-2 image-to-video two stages generation.

LTX-2 image-to-video two stages generation sampling code.

import torch
from diffusers import FlowMatchEulerDiscreteScheduler
from diffusers import LTX2ImageToVideoPipeline, LTX2LatentUpsamplePipeline
from diffusers.pipelines.ltx2.latent_upsampler import LTX2LatentUpsamplerModel
from diffusers.pipelines.ltx2.utils import DISTILLED_SIGMA_VALUES, STAGE_2_DISTILLED_SIGMA_VALUES
from diffusers.pipelines.ltx2.export_utils import encode_video
from diffusers.utils import load_image

device = "cuda:0"
random_seed = 42
generator = torch.Generator(device).manual_seed(random_seed)
model_path = "/data00/models/LTX-2"

pipe = LTX2ImageToVideoPipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16)
pipe = pipe.to(device)

image_path = "/data00/ltx2_i2v_input.png"
image = load_image(image_path)

prompt = 'A close-up shot of a young waitress in a retro 1950s diner, her warm brown eyes meeting the camera with a gentle smile. She wears a black polka-dot dress with an elegant cream lace collar, her reddish-brown hair styled in an elaborate updo with delicate curls framing her freckled face. Soft, warm light from overhead fixtures illuminates her features as she stands behind a yellow counter. The camera begins slightly to her side, then slowly pushes in toward her face, revealing the subtle rosy blush on her cheeks. In the blurred background, the soft teal walls and a glowing red "Diner" sign create a nostalgic atmosphere. The ambient sounds of clinking dishes, distant conversations, and the gentle hum of a jukebox fill the air. She tilts her head slightly and says in a friendly, warm voice: "Welcome to Rosie\'s. What can I get for you today?" The mood is inviting, timeless, and full of classic American diner charm.'
negative_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"

frame_rate = 24.0
video_latent, audio_latent = pipe(
    image=image,
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=768,
    height=512,
    num_frames=121,
    frame_rate=frame_rate,
    num_inference_steps=8,
    sigmas=None,
    guidance_scale=4.0,
    output_type="latent",
    return_dict=False,
)

latent_upsampler = LTX2LatentUpsamplerModel.from_pretrained(
    model_path,
    subfolder="latent_upsampler",
    torch_dtype=torch.bfloat16,
)
upsample_pipe = LTX2LatentUpsamplePipeline(vae=pipe.vae, latent_upsampler=latent_upsampler)
# upsample_pipe.enable_model_cpu_offload(device=device)
upsample_pipe = upsample_pipe.to(device)
upscaled_video_latent = upsample_pipe(
    latents=video_latent,
    output_type="latent",
    return_dict=False,
)[0]

pipe.load_lora_weights(
    "/data00/models/LTX-2", adapter_name="stage_2_distilled", weight_name="ltx-2-19b-distilled-lora-384.safetensors"
)
pipe.set_adapters("stage_2_distilled", 1.0)
# VAE tiling is usually necessary to avoid OOM error when VAE decoding
pipe.vae.enable_tiling()
# Change scheduler to use Stage 2 distilled sigmas as is
new_scheduler = FlowMatchEulerDiscreteScheduler.from_config(
    pipe.scheduler.config, use_dynamic_shifting=False, shift_terminal=None
)
pipe.scheduler = new_scheduler

# Stage 2 inference with distilled LoRA and sigmas
video, audio = pipe(
    latents=upscaled_video_latent,
    audio_latents=audio_latent,
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=3,
    noise_scale=STAGE_2_DISTILLED_SIGMA_VALUES[0], # renoise with first sigma value https://github.com/Lightricks/LTX-2/blob/main/packages/ltx-pipelines/src/ltx_pipelines/distilled.py#L178
    sigmas=STAGE_2_DISTILLED_SIGMA_VALUES,
    generator=generator,
    guidance_scale=1.0,
    output_type="np",
    return_dict=False,
)

encode_video(
    video[0],
    fps=frame_rate,
    audio=audio[0].float().cpu(),
    audio_sample_rate=pipe.vocoder.config.output_sampling_rate, # should be 24000
    output_path="video.mp4"
)

It gots error as below:

Traceback (most recent call last):
  File "/app/ltx2_i2v.py", line 70, in <module>
    video, audio = pipe(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/ltx2/pipeline_ltx2_image2video.py", line 1045, in __call__
    latents, conditioning_mask = self.prepare_latents(
  File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/ltx2/pipeline_ltx2_image2video.py", line 709, in prepare_latents
    latents = self._create_noised_state(latents, noise_scale * (1 - conditioning_mask), generator)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/ltx2/pipeline_ltx2_image2video.py", line 622, in _create_noised_state
    noised_latents = noise_scale * noise + (1 - noise_scale) * latents
RuntimeError: The size of tensor a (24) must match the size of tensor b (48) at non-singleton dimension 4

In LTX-2's two-stage image-to-video generation task, specifically after the upsampling step, a shape mismatch occurs between the latents and the conditioning_mask, which causes an error in function _create_noised_state.

After applying this patch, the previously mentioned error is fixed.

ltx2_video2.mp4

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sayakpaul
@DN6

@sayakpaul sayakpaul requested a review from dg845 February 26, 2026 12:47
Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Could you also add a simple test case for this?

In LTX-2's two-stage image-to-video generation task, specifically after
the upsampling step, a shape mismatch occurs between the `latents` and
the `conditioning_mask`, which causes an error in function
`_create_noised_state`.

Fix it by creating the `conditioning_mask` based on the shape of the
`latents`.
Copy link
Collaborator

@dg845 dg845 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I agree with #13187 (review) that adding a test case for this would be useful.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@dg845
Copy link
Collaborator

dg845 commented Feb 27, 2026

@bot /style

@github-actions
Copy link
Contributor

github-actions bot commented Feb 27, 2026

Style bot fixed some files and pushed the changes.

@Songrui625
Copy link
Contributor Author

@sayakpaul @dg845 Hi, I pushed the unit test. Please take a review again. Thanks!

Comment on lines +179 to +181
upsampler = LTX2LatentUpsamplerModel(
in_channels=in_channels,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
upsampler = LTX2LatentUpsamplerModel(
in_channels=in_channels,
)
upsampler = LTX2LatentUpsamplerModel(
in_channels=in_channels,
mid_channels=32,
num_blocks_per_stage=1,
)

Would it be possible to use a smaller latent upsampler so that the test_two_stages_inference_with_upsampler test is less heavy? Maybe something like the suggestion above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch! updated.

@dg845
Copy link
Collaborator

dg845 commented Feb 27, 2026

@bot /style

@github-actions
Copy link
Contributor

github-actions bot commented Feb 27, 2026

Style bot fixed some files and pushed the changes.

@dg845 dg845 merged commit 40e9645 into huggingface:main Feb 27, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants