huggingface · Abhinay1997 · Feb 22, 2023 · Feb 26, 2023 · Feb 27, 2023 · Feb 27, 2023
diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
@@ -200,6 +200,8 @@
       title: Tiny AutoEncoder
     - local: api/models/transformer2d
       title: Transformer2D
+    - local: api/models/transformer3d
+      title: Transformer3D
     - local: api/models/transformer_temporal
       title: Transformer Temporal
     - local: api/models/prior_transformer
@@ -316,6 +318,8 @@
       title: Text-to-video
     - local: api/pipelines/text_to_video_zero
       title: Text2Video-Zero
+    - local: api/pipelines/tune_a_video
+      title: Tune-A-Video
     - local: api/pipelines/unclip
       title: UnCLIP
     - local: api/pipelines/latent_diffusion_uncond

diff --git a/docs/source/en/api/models/transformer3d.md b/docs/source/en/api/models/transformer3d.md
@@ -0,0 +1,11 @@
+# Transformer3D
+
+The Transformer2D model extended for video-like data.
+
+## Transformer3DModel
+
+[[autodoc]] Transformer3DModel
+
+## Transformer3DModelOutput
+
+[[autodoc]] models.transformer_3d.Transformer3DModelOutput
diff --git a/docs/source/en/api/pipelines/tune_a_video.mdx b/docs/source/en/api/pipelines/tune_a_video.mdx
@@ -0,0 +1,122 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+-->
+
+# Tune-A-Video
+
+## Overview
+
+[Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation](https://arxiv.org/abs/2212.11565) by Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, Mike Zheng Shou
+The abstract of the paper is the following:
+
+*To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator. Despite their promising results, such paradigm is computationally expensive. In this work, we propose a new T2V generation setting—One-Shot Video Tuning, where only one text-video pair is presented. Our model is built on state-of-the-art T2I diffusion models pre-trained on massive image data. We make two key observations: 1) T2I models can generate still images that represent verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we introduce Tune-A-Video, which involves a tailored spatio-temporal attention mechanism and an efficient one-shot tuning strategy. At inference, we employ DDIM inversion to provide structure guidance for sampling. Extensive qualitative and numerical experiments demonstrate the remarkable ability of our method across various applications.*
+
+Resources:
+
+* [GitHub repository](https://github.com/showlab/Tune-A-Video)
+* [🤗 Spaces](https://huggingface.co/spaces/Tune-A-Video-library/Tune-A-Video-Training-UI)
+
+## Available Pipelines:
+
+| Pipeline | Tasks | Demo
+|---|---|:---:|
+| [TuneAVideoPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/tune_a_video/pipeline_tune_a_video.py) | *Text-to-Video Generation* | [🤗 Spaces](https://huggingface.co/spaces/Tune-A-Video-library/Tune-A-Video-inference)
+
+## Usage example 
+
+### Loading with a pre-existing Text2Image checkpoint
+```python
+import torch
+from diffusers import TuneAVideoPipeline, DDIMScheduler, UNet3DConditionModel
+from diffusers.utils import export_to_video
+from PIL import Image
+
+# Use any pretrained Text2Image checkpoint based on stable diffusion
+pretrained_model_path = "nitrosocke/mo-di-diffusion"
+unet = UNet3DConditionModel.from_pretrained(
+    "Tune-A-Video-library/df-cpt-mo-di-bear-guitar", subfolder="unet", torch_dtype=torch.float16
+).to("cuda")
+
+pipe = TuneAVideoPipeline.from_pretrained(pretrained_model_path, unet=unet, torch_dtype=torch.float16).to("cuda")
+
+prompt = "A princess playing a guitar, modern disney style"
+generator = torch.Generator(device="cuda").manual_seed(42)
+
+video_frames = pipe(prompt, video_length=3, generator=generator, num_inference_steps=50, output_type="np").frames
+
+# Saving to gif.
+pil_frames = [Image.fromarray(frame) for frame in video_frames]
+duration = len(pil_frames) / 8
+pil_frames[0].save(
+    "animation.gif",
+    save_all=True,
+    append_images=pil_frames[1:],  # append rest of the images
+    duration=duration * 1000,  # in milliseconds
+    loop=0,
+)
+
+# Saving to video
+video_path = export_to_video(video_frames)
+```
+
+### Loading a saved Tune-A-Video checkpoint
+```python
+import torch
+from diffusers import DiffusionPipeline, DDIMScheduler
+from diffusers.utils import export_to_video
+from PIL import Image
+
+pipe = DiffusionPipeline.from_pretrained(
+    "Tune-A-Video-library/df-cpt-mo-di-bear-guitar", torch_dtype=torch.float16
+).to("cuda")
+
+prompt = "A princess playing a guitar, modern disney style"
+generator = torch.Generator(device="cuda").manual_seed(42)
+
+video_frames = pipe(prompt, video_length=3, generator=generator, num_inference_steps=50, output_type="np").frames
+
+# Saving to gif.
+pil_frames = [Image.fromarray(frame) for frame in video_frames]
+duration = len(pil_frames) / 8
+pil_frames[0].save(
+    "animation.gif",
+    save_all=True,
+    append_images=pil_frames[1:],  # append rest of the images
+    duration=duration * 1000,  # in milliseconds
+    loop=0,
+)
+
+# Saving to video
+video_path = export_to_video(video_frames)
+```
+
+Here are some sample outputs: 
+
+<table>
+    <tr>
+        <td><center>
+        A princess playing a guitar, modern disney style
+        <br>
+        <img src="https://huggingface.co/Tune-A-Video-library/df-cpt-mo-di-bear-guitar/resolve/main/samples/princess.gif"
+            alt="A princess playing a guitar, modern disney style"
+            style="width: 300px;" />
+        </center></td>
+    </tr>
+</table>
+
+## Available checkpoints 
+
+* [Tune-A-Video-library/df-cpt-mo-di-bear-guitar](https://huggingface.co/Tune-A-Video-library/df-cpt-mo-di-bear-guitar)
+
+## TuneAVideoPipeline
+[[autodoc]] TuneAVideoPipeline
+	- all
+	- __call__
diff --git a/src/diffusers/__init__.py b/src/diffusers/__init__.py
@@ -84,6 +84,7 @@
             "T2IAdapter",
             "T5FilmDecoder",
             "Transformer2DModel",
+            "Transformer3DModel",
             "UNet1DModel",
             "UNet2DConditionModel",
             "UNet2DModel",
@@ -268,6 +269,7 @@
             "StableUnCLIPPipeline",
             "TextToVideoSDPipeline",
             "TextToVideoZeroPipeline",
+            "TuneAVideoPipeline",
             "UnCLIPImageVariationPipeline",
             "UnCLIPPipeline",
             "UniDiffuserModel",
@@ -443,6 +445,7 @@
             T2IAdapter,
             T5FilmDecoder,
             Transformer2DModel,
+            Transformer3DModel,
             UNet1DModel,
             UNet2DConditionModel,
             UNet2DModel,
@@ -606,6 +609,7 @@
             StableUnCLIPPipeline,
             TextToVideoSDPipeline,
             TextToVideoZeroPipeline,
+            TuneAVideoPipeline,
             UnCLIPImageVariationPipeline,
             UnCLIPPipeline,
             UniDiffuserModel,

diff --git a/src/diffusers/models/__init__.py b/src/diffusers/models/__init__.py
@@ -30,6 +30,7 @@
     _import_structure["prior_transformer"] = ["PriorTransformer"]
     _import_structure["t5_film_transformer"] = ["T5FilmDecoder"]
     _import_structure["transformer_2d"] = ["Transformer2DModel"]
+    _import_structure["transformer_3d"] = ["Transformer3DModel"]
     _import_structure["transformer_temporal"] = ["TransformerTemporalModel"]
     _import_structure["unet_1d"] = ["UNet1DModel"]
     _import_structure["unet_2d"] = ["UNet2DModel"]
@@ -55,6 +56,7 @@
         from .prior_transformer import PriorTransformer
         from .t5_film_transformer import T5FilmDecoder
         from .transformer_2d import Transformer2DModel
+        from .transformer_3d import Transformer3DModel
         from .transformer_temporal import TransformerTemporalModel
         from .unet_1d import UNet1DModel
         from .unet_2d import UNet2DModel

diff --git a/src/diffusers/models/resnet.py b/src/diffusers/models/resnet.py
@@ -757,7 +757,196 @@ def forward(self, input_tensor, temb, scale: float = 1.0):
         return output_tensor
 
 
-# unet_rl.py
+class Upsample3D(nn.Module):
+    """A 3D upsampling layer. Reshapes the input tensor to video like tensor, applies upsampling conv,
+    converts it back to the original shape.
+
+    Parameters:
+        channels (`int`):
+            number of channels in the inputs and outputs.
+        out_channels (`int`, optional):
+            number of output channels. Defaults to `channels`.
+    """
+
+    def __init__(self, channels, out_channels=None):
+        super().__init__()
+        self.channels = channels
+        self.out_channels = out_channels or channels
+
+        self.conv = nn.Conv2d(self.channels, self.out_channels, 3, padding=1)
+
+    def forward(self, hidden_states, output_size=None):
+        if hidden_states.shape[1] != self.channels:
+            raise ValueError(
+                f"Expected hidden_states tensor at dimension 1 to match the number of channels. Expected: {self.channels} but passed: {hidden_states.shape[1]}"
+            )
+
+        # Cast to float32 to as 'upsample_nearest2d_out_frame' op does not support bfloat16
+        dtype = hidden_states.dtype
+        if dtype == torch.bfloat16:
+            hidden_states = hidden_states.to(torch.float32)
+
+        # upsample_nearest_nhwc fails with large batch sizes. see https://github.com/huggingface/diffusers/issues/984
+        if hidden_states.shape[0] >= 64:
+            hidden_states = hidden_states.contiguous()
+
+        # if `output_size` is passed we force the interpolation output
+        # size and do not make use of `scale_factor=2`
+        if output_size is None:
+            hidden_states = F.interpolate(hidden_states, scale_factor=[1.0, 2.0, 2.0], mode="nearest")
+        else:
+            hidden_states = F.interpolate(hidden_states, size=output_size, mode="nearest")
+
+        # If the input is bfloat16, we cast back to bfloat16
+        if dtype == torch.bfloat16:
+            hidden_states = hidden_states.to(dtype)
+
+        # Inflate
+        video_length = hidden_states.shape[2]
+        # b c f h w -> (b f) c h w
+        hidden_states = hidden_states.movedim((0, 1, 2, 3, 4), (0, 2, 1, 3, 4))
+        hidden_states = hidden_states.flatten(0, 1)
+
+        hidden_states = self.conv(hidden_states)
+        # Deflate
+        # (b f) c h w -> b c f h w (f=video_length)
+        hidden_states = hidden_states.reshape([-1, video_length, *hidden_states.shape[1:]])
+        hidden_states = hidden_states.movedim((0, 1, 2, 3, 4), (0, 2, 1, 3, 4))
+
+        return hidden_states
+
+
+class Downsample3D(nn.Module):
+    """A 3D downsampling layer. Reshapes the input tensor to video like tensor, applies conv,
+    converts it back to the original shape.
+
+    Parameters:
+        channels (`int`):
+            number of channels in the inputs and outputs.
+        out_channels (`int`, optional):
+            number of output channels. Defaults to `channels`.
+    """
+
+    def __init__(self, channels, out_channels=None, padding=1):
+        super().__init__()
+        self.channels = channels
+        self.out_channels = out_channels or channels
+        self.conv = nn.Conv2d(self.channels, self.out_channels, 3, stride=2, padding=padding)
+
+    def forward(self, hidden_states):
+        video_length = hidden_states.shape[2]
+        # b c f h w -> (b f) c h w
+        hidden_states = hidden_states.movedim((0, 1, 2, 3, 4), (0, 2, 1, 3, 4))
+        hidden_states = hidden_states.flatten(0, 1)
+        # Conv
+        hidden_states = self.conv(hidden_states)
+        # (b f) c h w -> b c f h w (f=video_length)
+        hidden_states = hidden_states.reshape([-1, video_length, *hidden_states.shape[1:]])
+        hidden_states = hidden_states.movedim((0, 1, 2, 3, 4), (0, 2, 1, 3, 4))
+
+        return hidden_states
+
+
+class ResnetBlock3D(nn.Module):
+    r"""
+    A Resnet block. Used specifically for video like data.
+
+    Parameters:
+        in_channels (`int`): The number of channels in the input.
+        out_channels (`int`, *optional*, default to be `None`):
+            The number of output channels for the first conv2d layer. If None, same as `in_channels`.
+        dropout (`float`, *optional*, defaults to `0.0`): The dropout probability to use.
+        temb_channels (`int`, *optional*, default to `512`): the number of channels in timestep embedding.
+        groups (`int`, *optional*, default to `32`): The number of groups to use for the first normalization layer.
+        eps (`float`, *optional*, defaults to `1e-6`): The epsilon to use for the normalization.
+        non_linearity (`str`, *optional*, default to `"swish"`): the activation function to use.
+        output_scale_factor (`float`, *optional*, default to be `1.0`): the scale factor to use for the output.
+    """
+
+    def __init__(
+        self,
+        *,
+        in_channels,
-        *,
-        in_channels,
+        in_channels,
+        *,
-        *,
-        in_channels,
+        in_channels,
+        *,
+        out_channels=None,
+        dropout=0.0,
+        temb_channels=512,
+        groups=32,
+        eps=1e-6,
+        non_linearity="swish",
+        output_scale_factor=1.0,
+    ):
+        super().__init__()
+        self.in_channels = in_channels
+        out_channels = in_channels if out_channels is None else out_channels
+        self.out_channels = out_channels
+        self.output_scale_factor = output_scale_factor
+
+        self.norm1 = torch.nn.GroupNorm(num_groups=groups, num_channels=in_channels, eps=eps, affine=True)
+        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=1, padding=1)
+
+        self.time_emb_proj = torch.nn.Linear(temb_channels, out_channels)
+
+        self.norm2 = torch.nn.GroupNorm(num_groups=groups, num_channels=out_channels, eps=eps, affine=True)
+        self.dropout = torch.nn.Dropout(dropout)
+        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1)
+
+        self.nonlinearity = get_activation(non_linearity)
+
+        self.use_in_shortcut = self.in_channels != self.out_channels
+        self.conv_shortcut = None
+        if self.use_in_shortcut:
+            self.conv_shortcut = nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=1, padding=0)
+
+    def forward(self, input_tensor, temb):
+        hidden_states = input_tensor
+
+        hidden_states = self.norm1(hidden_states)
+        hidden_states = self.nonlinearity(hidden_states)
+
+        video_length = hidden_states.shape[2]
+        # b c f h w -> (b f) c h w
+        hidden_states = hidden_states.movedim((0, 1, 2, 3, 4), (0, 2, 1, 3, 4))
+        hidden_states = hidden_states.flatten(0, 1)
+        hidden_states = self.conv1(hidden_states)
+        # (b f) c h w -> b c f h w (f=video_length
+        hidden_states = hidden_states.reshape([-1, video_length, *hidden_states.shape[1:]])
+        hidden_states = hidden_states.movedim((0, 1, 2, 3, 4), (0, 2, 1, 3, 4))
+
+        if temb is not None:
+            temb = self.time_emb_proj(self.nonlinearity(temb))[:, :, None, None, None]
+
+        hidden_states = hidden_states + temb
+
+        hidden_states = self.norm2(hidden_states)
+
+        hidden_states = self.nonlinearity(hidden_states)
+
+        hidden_states = self.dropout(hidden_states)
+
+        video_length = hidden_states.shape[2]
+        # b c f h w -> (b f) c h w
+        hidden_states = hidden_states.movedim((0, 1, 2, 3, 4), (0, 2, 1, 3, 4))
+        hidden_states = hidden_states.flatten(0, 1)
+        hidden_states = self.conv2(hidden_states)
+        # (b f) c h w -> b c f h w (f=video_length)
+        hidden_states = hidden_states.reshape([-1, video_length, *hidden_states.shape[1:]])
+        hidden_states = hidden_states.movedim((0, 1, 2, 3, 4), (0, 2, 1, 3, 4))
+
+        if self.conv_shortcut is not None:
+            video_length = input_tensor.shape[2]
+            # "b c f h w -> (b f) c h w"
+            input_tensor = input_tensor.movedim((0, 1, 2, 3, 4), (0, 2, 1, 3, 4))
+            input_tensor = input_tensor.flatten(0, 1)
+            input_tensor = self.conv_shortcut(input_tensor)
+            # "(b f) c h w -> b c f h w"; f=video_length
+            input_tensor = input_tensor.reshape([-1, video_length, *input_tensor.shape[1:]])
+            input_tensor = input_tensor.movedim((0, 1, 2, 3, 4), (0, 2, 1, 3, 4))
+
+        output_tensor = (input_tensor + hidden_states) / self.output_scale_factor
+
+        return output_tensor
+
+
 def rearrange_dims(tensor: torch.Tensor) -> torch.Tensor:
     if len(tensor.shape) == 2:
         return tensor[:, :, None]