Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
239 commits
Select commit Hold shift + click to select a range
ee82252
port files from TuneAVideo repo
Abhinay1997 Feb 22, 2023
4ff86f1
Replace einops with torch ops
Abhinay1997 Feb 26, 2023
a665a25
Fix imports
Abhinay1997 Feb 27, 2023
1ada882
Fix missing import
Abhinay1997 Feb 27, 2023
c8a9780
Fix missing import in diffusers.__init__
Abhinay1997 Feb 27, 2023
5222dc9
Merge branch 'main' into tune_a_video_port
Abhinay1997 Feb 27, 2023
dc6896a
debugging import issue on colab
Abhinay1997 Feb 27, 2023
851eafa
Add reshape_heads_to_batch_dim & reshape_batch_dim_to_heads
Abhinay1997 Feb 27, 2023
f8464fb
Testing SparseCasualAttention ported as a AttnProcessor
Abhinay1997 Mar 5, 2023
5b9420a
Debug logs
Abhinay1997 Mar 5, 2023
38352fc
Bug fix
Abhinay1997 Mar 5, 2023
8bcc552
bug fix
Abhinay1997 Mar 5, 2023
e6d8a36
Missing import
Abhinay1997 Mar 5, 2023
96d3f69
debug cross attention
Abhinay1997 Mar 15, 2023
2943cfb
[FIX] bug in reshaping
Abhinay1997 Mar 18, 2023
8050650
Refactoring + Formatting fixes
Abhinay1997 Mar 20, 2023
60903a5
Formatting + Refactoring
Abhinay1997 Mar 20, 2023
8bbd946
Merge branch 'main' into tune_a_video_port
sayakpaul Mar 20, 2023
967cc9a
some minor refactors.
sayakpaul Mar 20, 2023
72d7909
Merge TuneAVideo & Text2VideoSD
Abhinay1997 Mar 31, 2023
b2f31e2
Update to match main
Abhinay1997 Mar 31, 2023
b6e5798
Fix method calls
Abhinay1997 Mar 31, 2023
ceda00b
extra arg removed
Abhinay1997 Mar 31, 2023
a4f3fb3
Fix method call params
Abhinay1997 Mar 31, 2023
9248ca1
debug
Abhinay1997 Mar 31, 2023
4a1c144
debug conv input
Abhinay1997 Mar 31, 2023
846a328
debugging transformer 3d
Abhinay1997 Mar 31, 2023
2a9f5c6
debugging
Abhinay1997 Mar 31, 2023
a96fd08
bug fix :)
Abhinay1997 Mar 31, 2023
c22137e
Bug fixes
Abhinay1997 Mar 31, 2023
e1b25d5
use the right attribute
Abhinay1997 Mar 31, 2023
f460df9
Transformer3DModel can't handle cross_attention_kwargs
Abhinay1997 Mar 31, 2023
65e3b7e
debug
Abhinay1997 Mar 31, 2023
8008b61
bug fix
Abhinay1997 Mar 31, 2023
018b8b8
debugging SparseCausal Attention
Abhinay1997 Mar 31, 2023
a9d6d47
Debugging
Abhinay1997 Mar 31, 2023
81e3a5b
debug
Abhinay1997 Mar 31, 2023
b2d59af
Testing transformer input
Abhinay1997 Mar 31, 2023
422791e
debug convnet hidden state shapes
Abhinay1997 Mar 31, 2023
7996cee
debug logs
Abhinay1997 Apr 1, 2023
a21f771
undo transfomer change
Abhinay1997 Apr 1, 2023
3df07fb
calculate transformer ip inside loop
Abhinay1997 Apr 1, 2023
efd9dcc
Bug fix
Abhinay1997 Apr 1, 2023
dfaf7b6
conditional reshaping
Abhinay1997 Apr 1, 2023
665c676
Change default use_linear_proj
Abhinay1997 Apr 3, 2023
ec19172
Add missing temporal conv layer
Abhinay1997 Apr 3, 2023
2effa4d
Formatting changes
Abhinay1997 Apr 3, 2023
1960565
Remove debug logs
Abhinay1997 Apr 3, 2023
f62c2d6
Merge pull request #1 from Abhinay1997/resolve_conflicts
Abhinay1997 Apr 3, 2023
9d764d0
Merge conflict resolved.
Abhinay1997 Apr 4, 2023
e160d6a
Merge branch 'main' into tune_a_video_port
Abhinay1997 Apr 4, 2023
54e3867
Move Transformer3DModel to its own module
Abhinay1997 Apr 4, 2023
9ec0031
Merge remote-tracking branch 'origin/tune_a_video_port' into tune_a_v…
Abhinay1997 Apr 4, 2023
e03a59f
Merge branch 'main' into tune_a_video_port
Abhinay1997 Apr 6, 2023
bd03a9e
fix missing import
Abhinay1997 Apr 6, 2023
660579e
Make output of tune a video similar to Text2VideoSD
Abhinay1997 Apr 7, 2023
b2e2ebf
Add docstrings for Transformer3DModel.
Abhinay1997 Apr 8, 2023
28e0b53
Merge branch 'main' into tune_a_video_port
Abhinay1997 Apr 9, 2023
a81ad8c
Adds docs, formatting changes and enable_vae_tiling
Abhinay1997 Apr 9, 2023
e4e9f88
Add TextualInversionLoaderMixin support
Abhinay1997 Apr 9, 2023
ebffd82
Update docs + Add tests
Abhinay1997 Apr 9, 2023
2178bcf
Formatting + missing imports
Abhinay1997 Apr 9, 2023
fe976f9
Merge branch 'main' into tune_a_video_port
patrickvonplaten Apr 10, 2023
a43346b
Merge branch 'main' into tune_a_video_port
Abhinay1997 May 4, 2023
033d6ec
Remove TuneAVideoAttnProcessor from deprecated module
Abhinay1997 May 10, 2023
ddfe0c3
Remove unused kwargs from unet3d init
Abhinay1997 May 10, 2023
3021b05
Remove mid_block_type kwarg in unet3d
Abhinay1997 May 10, 2023
33774d2
Remove dual_cross_attention kwarg - unet3d,unet3d blocks
Abhinay1997 May 10, 2023
14e0ae9
Remove InflatedConv3d dependency - Upsample3D
Abhinay1997 May 11, 2023
9124f99
Remove InflatedConv3D - Downsample3D
Abhinay1997 May 11, 2023
1408140
Remove InflatedConv3d - Resnet3D
Abhinay1997 May 11, 2023
239ff3f
Remove InflatedConv3d dependency - UNet3D
Abhinay1997 May 11, 2023
cfe4ab0
Bug fix - incorrect condition to switch to TuneAVideo
Abhinay1997 May 11, 2023
b60d651
Remove InflatedConv3d class
Abhinay1997 May 11, 2023
278dc36
Remove unused kwarg - unet_3d_condition
Abhinay1997 May 11, 2023
95885a8
Remove class embeddings in unet3d.
Abhinay1997 May 15, 2023
b3bb4a6
Remove unused kwarg - use_conv_transpose in Upsample3D
Abhinay1997 May 15, 2023
0cda1d0
make style
Abhinay1997 May 15, 2023
1a03d0b
Merge pull request #3 from Abhinay1997/final_debug
Abhinay1997 May 15, 2023
1c56363
Merge branch 'main' into final_debug
Abhinay1997 May 15, 2023
2ce47f1
Merge pull request #4 from Abhinay1997/final_debug
Abhinay1997 May 15, 2023
e463a57
Remove README.md
Abhinay1997 May 18, 2023
ae528d1
Removed deprecation checks in init
Abhinay1997 May 18, 2023
32e817d
make style
Abhinay1997 May 18, 2023
bf0c330
Merge branch 'main' into tune_a_video_port
Abhinay1997 May 18, 2023
d691344
Merge branch 'main' into tune_a_video_port
Abhinay1997 May 21, 2023
9962a3e
Merge branch 'main' into tune_a_video_port
Abhinay1997 May 22, 2023
366ac03
from_pretrained_2d fixes for TuneAvideo
Abhinay1997 May 22, 2023
6a67890
Add test npy tensor hosted on hugignface internal
Abhinay1997 May 22, 2023
65e8714
make style
Abhinay1997 May 22, 2023
d2d9c27
Merge branch 'main' into tune_a_video_port
Abhinay1997 May 24, 2023
c728e46
Merge branch 'main' into tune_a_video_port
Abhinay1997 May 27, 2023
ff9277f
Merge branch 'main' into tune_a_video_port
Abhinay1997 May 29, 2023
3ad8d2b
[wip] custom unet blocks
Abhinay1997 Jun 5, 2023
35bc2dc
[wip] custom unet 3d blocks
Abhinay1997 Jun 5, 2023
8884912
Remove unnecessary vars
Abhinay1997 Jun 5, 2023
c20e573
update get_up_block, get_down_block to use custom blocks
Abhinay1997 Jun 5, 2023
fd06295
Enable loading unet with custom blocks.
Abhinay1997 Jun 5, 2023
e00fb86
Bug fix ? Circular import
Abhinay1997 Jun 5, 2023
50363f8
Remove dual_cross_attention kwarg from custom unet blocks
Abhinay1997 Jun 5, 2023
e6759c0
Custom unet block changes
Abhinay1997 Jun 6, 2023
5be8330
Remove unnecessary code
Abhinay1997 Jun 6, 2023
61032cd
Move attn processor to pipeline.
Abhinay1997 Jun 6, 2023
e549407
Set the attention processor on init
Abhinay1997 Jun 6, 2023
9db033b
make style
Abhinay1997 Jun 6, 2023
8cdbca9
remove unused xformers method
Abhinay1997 Jun 7, 2023
2a2af8f
Add docs to BasicSparseTransformerBlock
Abhinay1997 Jun 7, 2023
738c50c
Merge pull request #6 from Abhinay1997/custom_unet_blocks
Abhinay1997 Jun 7, 2023
af53edf
Resolve conflicts with main
Abhinay1997 Jun 7, 2023
cc83a61
Fix failing test -> incorrect copied from text.
Abhinay1997 Jun 7, 2023
1c01a66
copied from fix
Abhinay1997 Jun 7, 2023
ee67a25
make fix-copies
Abhinay1997 Jun 8, 2023
86206b0
Add tune a video docs to toc
Abhinay1997 Jun 8, 2023
44c17bb
Add sample code - tuneavideo docs
Abhinay1997 Jun 9, 2023
9e99fa7
toc bug-fix + make style
Abhinay1997 Jun 9, 2023
a79011e
fix tests for tuneavideo
Abhinay1997 Jun 9, 2023
0495f77
make style
Abhinay1997 Jun 9, 2023
86e5f84
fix test imports
Abhinay1997 Jun 9, 2023
950ee03
Merge branch 'main' into tune_a_video_port
Abhinay1997 Jun 9, 2023
7112d02
Make video_length an optional param in __call__
Abhinay1997 Jun 9, 2023
3963d9c
undo optional type on video len - didn't fix test
Abhinay1997 Jun 9, 2023
8fbd857
update to use Tune-A-Video ckpt in tests
Abhinay1997 Jun 9, 2023
7d6a153
Add prompt_embeds, negative_prompt_embeds and cross_attn_kwargs to __…
Abhinay1997 Jun 9, 2023
ab527f9
correct check_inputs
Abhinay1997 Jun 9, 2023
29c4cb0
make style
Abhinay1997 Jun 9, 2023
21f389a
update example checkpoint in docs
Abhinay1997 Jun 9, 2023
b2e46f3
remove redundant code
Abhinay1997 Jun 9, 2023
f064ead
Testing tune_a_video test fail
Abhinay1997 Jun 9, 2023
673e628
make style
Abhinay1997 Jun 9, 2023
466b519
bug fix
Abhinay1997 Jun 9, 2023
8b039a4
Merge branch 'debug_tests' into tune_a_video_port
Abhinay1997 Jun 9, 2023
fc4e610
Replace CrossAttention with Attention + undo test changes
Abhinay1997 Jun 9, 2023
20672c0
skip unsupported feature test
Abhinay1997 Jun 10, 2023
4958ba1
Merge branch 'debug_tests' into tune_a_video_port
Abhinay1997 Jun 10, 2023
cd23221
make style
Abhinay1997 Jun 10, 2023
72205d0
Merge branch 'main' into tune_a_video_port
Abhinay1997 Jun 10, 2023
4aff39d
update doc links
Abhinay1997 Jun 12, 2023
32016a8
Removes the weight renaming for conv
Abhinay1997 Jun 12, 2023
4e5f85d
Removes the weight renaming for conv
Abhinay1997 Jun 12, 2023
66434dd
Removes the weight renaming for conv
Abhinay1997 Jun 12, 2023
bb5094b
Merge branch 'tune_a_video_port' of https://github.com/Abhinay1997/di…
Abhinay1997 Jun 12, 2023
769e15d
remove name=conv in resnet
Abhinay1997 Jun 12, 2023
9f50db6
Mish refactoring
Abhinay1997 Jun 12, 2023
888ef44
fix sample code in docs
Abhinay1997 Jun 12, 2023
046eb83
Merge pull request #10 from Abhinay1997/resnet_fixes
Abhinay1997 Jun 12, 2023
f3aee02
Fix link to space
Abhinay1997 Jun 12, 2023
4c2ff4e
Merge branch 'tune_a_video_port' of https://github.com/Abhinay1997/di…
Abhinay1997 Jun 12, 2023
34e90f6
make style
Abhinay1997 Jun 12, 2023
322da01
Merge branch 'main' into tune_a_video_port
Abhinay1997 Jun 12, 2023
991f43b
Merge branch 'main' into tune_a_video_port
Abhinay1997 Aug 4, 2023
00154ae
Replace attn_num_head_channels with num_attention_heads
Abhinay1997 Aug 4, 2023
39c4888
Remove dependency on deleted cross_attention.py
Abhinay1997 Aug 4, 2023
1c895ec
Merge branch 'main' into tune_a_video_port
Abhinay1997 Aug 9, 2023
c7a62d1
Add missing docs
Abhinay1997 Aug 9, 2023
01de3a0
Validation is done on attention_head_dim. Not num_attention_heads
Abhinay1997 Aug 9, 2023
ad54258
Remove center_input_sample (not being used yet.)
Abhinay1997 Aug 9, 2023
cc49b1a
Remove 'resnet_time_scale_shift'. Not needed yet.
Abhinay1997 Aug 9, 2023
3786a7f
Remove 'only_cross_attention'. Not needed yet.
Abhinay1997 Aug 9, 2023
d3047d6
Remove 'upcast_attention'. Not needed yet
Abhinay1997 Aug 9, 2023
40bdf87
make style
Abhinay1997 Aug 9, 2023
dff9536
[Fix] transformer3d.md
Abhinay1997 Aug 9, 2023
a202f81
Remove _execution_device and enable_sequential_cpu_offload from pipel…
Abhinay1997 Aug 9, 2023
ec1d3ae
Add missing entry in toctree Transformer3D
Abhinay1997 Aug 9, 2023
32d68d3
make fix-copies
Abhinay1997 Aug 9, 2023
51c1e2d
[FIX] transformer3d.md
Abhinay1997 Aug 9, 2023
a58ee65
Merge branch 'main' into tune_a_video_port
Abhinay1997 Aug 10, 2023
ad9f32e
Merge branch 'main' into tune_a_video_port
Abhinay1997 Aug 23, 2023
aa5d4ac
Merge branch 'main' into tune_a_video_port
Abhinay1997 Aug 28, 2023
7c598f6
Merge branch 'main' into tune_a_video_port
Abhinay1997 Sep 5, 2023
66cadd7
Remove 'use_linear_projection' from transformer3d
Abhinay1997 Sep 6, 2023
a2ae452
Remove 'add_upsample' and 'add_downsample' from Inflated unet blocks.
Abhinay1997 Sep 6, 2023
49a3002
Revert "Remove 'add_upsample' and 'add_downsample' from Inflated unet…
Abhinay1997 Sep 6, 2023
64225be
Remove 'upcast_attention' as its always False.
Abhinay1997 Sep 6, 2023
fa7236f
cross_attention_dim is always present for tuneavideo blocks
Abhinay1997 Sep 6, 2023
84de858
cross_attention_dim is always present for tuneavideo blocks
Abhinay1997 Sep 6, 2023
7fdff31
Remove 'only_cross_attention' for unet blocks and 3d transformer.
Abhinay1997 Sep 6, 2023
52b74a4
Remove only_cross_attention
Abhinay1997 Sep 6, 2023
c602867
Fix docstrings
Abhinay1997 Sep 7, 2023
1b91d31
Remove 'use_conv' from Downsample3D
Abhinay1997 Sep 7, 2023
ba8336b
Upsample3D use_conv is always True
Abhinay1997 Sep 7, 2023
47769e0
Remove `num_embeds_ada_norm` and `attention_bias` from init for trans…
Abhinay1997 Sep 7, 2023
560b7a6
Remove cross_attention_kwargs as it doesn't flow to BasicSparseTransf…
Abhinay1997 Sep 7, 2023
957393a
Bug fix. `use_ada_layer_norm`
Abhinay1997 Sep 7, 2023
6fb6f19
UNetMidBlockInflated3DCrossAttn remove cross_attn_kwargs
Abhinay1997 Sep 7, 2023
cf89b7e
Fix docs
Abhinay1997 Sep 7, 2023
3ce3b05
Attribute Showlab/TuneAVideo
Abhinay1997 Sep 7, 2023
c5cf07c
Don't alter SpectogramDiffusionPipeline init
Abhinay1997 Sep 7, 2023
2517e57
ResnetBlock3D: removes `pre_norm` and `use_in_shortcut` args
Abhinay1997 Sep 7, 2023
277624c
ResnetBlock3D remove pre_norm
Abhinay1997 Sep 7, 2023
8b90210
Remove `time_embedding_norm` and `resnet_time_scale_shift` as they ar…
Abhinay1997 Sep 7, 2023
5589a69
Bug Fix: UNetMidBlock3DCrossAttn don't pass `resnet_time_scale_shift`
Abhinay1997 Sep 7, 2023
ed6480e
Remove `conv_shortcut` arg from ResnetBlock3D
Abhinay1997 Sep 7, 2023
fc8b890
`groups_out` is always same as `groups` ResnetBlock3D
Abhinay1997 Sep 8, 2023
756c182
Add docstrings to resnet blocks
Abhinay1997 Sep 8, 2023
b1da435
Remove unused args in docstring BasicSparseTransformerBlock
Abhinay1997 Sep 8, 2023
49441f3
Remove from_pretrained_2d for UNet3DConditionModel. Not used in infer…
Abhinay1997 Sep 8, 2023
1f074e3
Remove comment.
Abhinay1997 Sep 9, 2023
94a42bf
Replace `_encode_prompt` with `encode_prompt` in line with #4140
Abhinay1997 Sep 9, 2023
8bf0172
Merge pull request #12 from Abhinay1997/tuneavid_refactor
Abhinay1997 Sep 9, 2023
0d92621
Merge branch 'main' into tune_a_video_port
Abhinay1997 Sep 9, 2023
131b498
Remove `dummy_torch_and_note_seq_objects.py` -> unrelated to PR.
Abhinay1997 Sep 9, 2023
5c2b4a7
make-style
Abhinay1997 Sep 9, 2023
cbfe5a6
Remove name='op' from Downsample3D
Abhinay1997 Sep 9, 2023
2298572
Remove `use_linear_projection` from unet
Abhinay1997 Sep 10, 2023
8ad40e1
Compatability with changes from #4829 lazy-loading
Abhinay1997 Sep 12, 2023
1ea5fc4
Doc fixes
Abhinay1997 Sep 13, 2023
f712505
Merge branch 'main' into tune_a_video_port
Abhinay1997 Sep 13, 2023
c85486c
Fix imports in test_tune_a_video.py
Abhinay1997 Sep 13, 2023
d9f96e3
Merge branch 'tune_a_video_port' of https://github.com/Abhinay1997/di…
Abhinay1997 Sep 13, 2023
97ecea6
make style
Abhinay1997 Sep 13, 2023
918da5a
Merge branch 'main' into tune_a_video_port
Abhinay1997 Sep 13, 2023
3249c4d
Merge branch 'main' into tune_a_video_port
Abhinay1997 Sep 14, 2023
1788b7a
Fix merge conflict in __init__:
Abhinay1997 Sep 22, 2023
5b36028
Merge branch 'tune_a_video_port' of https://github.com/Abhinay1997/di…
Abhinay1997 Sep 22, 2023
584540f
Fix conflict in __init__
Abhinay1997 Sep 23, 2023
bc3d7da
Changes to BasicSparseTransformerBlock
Abhinay1997 Sep 23, 2023
4d28536
Remove assert based checks + make style
Abhinay1997 Sep 23, 2023
c859e21
Reduce `num_inference_steps` for full test.
Abhinay1997 Sep 23, 2023
56efe3e
[Fix] Import typo
Abhinay1997 Sep 23, 2023
de3189f
make style
Abhinay1997 Sep 23, 2023
2efc982
Update test script to use right verification tensor
Abhinay1997 Sep 23, 2023
1c9862e
Skip tests that don't support custom attn processor
Abhinay1997 Sep 24, 2023
b01388b
Merge branch 'main' into tune_a_video_port
Abhinay1997 Sep 24, 2023
3708620
make style
Abhinay1997 Sep 26, 2023
45f69db
Merge branch 'tune_a_video_port' of https://github.com/Abhinay1997/di…
Abhinay1997 Sep 26, 2023
92de87d
Fix missing transformer3d import
Abhinay1997 Sep 28, 2023
cfd1d38
Merge branch 'main' into tune_a_video_port
Abhinay1997 Oct 1, 2023
12c3b57
Offload all models after video generation
Abhinay1997 Oct 3, 2023
c75daf2
Remove redundant offload method
Abhinay1997 Oct 3, 2023
9d11a5e
Remove einops comments from transformer3d model
Abhinay1997 Oct 3, 2023
c84f55e
Revert changes and retain `num_attention_heads` instead of `attention…
Abhinay1997 Oct 3, 2023
1a0da2d
Fix missing import
Abhinay1997 Oct 3, 2023
ad23710
make style
Abhinay1997 Oct 3, 2023
76387d9
Merge branch 'main' into tune_a_video_port
Abhinay1997 Oct 3, 2023
0019408
make rearrange comments framework agnostic
Abhinay1997 Oct 4, 2023
efcce2c
Resolve merge conflicts:
Abhinay1997 Oct 19, 2023
1c4b843
make fix-copies
Abhinay1997 Oct 19, 2023
2d17ffc
replace np.abs with `nump_cosine_similarity_distance`
Abhinay1997 Oct 24, 2023
515572e
Add missing import
Abhinay1997 Oct 24, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -200,6 +200,8 @@
title: Tiny AutoEncoder
- local: api/models/transformer2d
title: Transformer2D
- local: api/models/transformer3d
title: Transformer3D
- local: api/models/transformer_temporal
title: Transformer Temporal
- local: api/models/prior_transformer
Expand Down Expand Up @@ -316,6 +318,8 @@
title: Text-to-video
- local: api/pipelines/text_to_video_zero
title: Text2Video-Zero
- local: api/pipelines/tune_a_video
title: Tune-A-Video
- local: api/pipelines/unclip
title: UnCLIP
- local: api/pipelines/latent_diffusion_uncond
Expand Down
11 changes: 11 additions & 0 deletions docs/source/en/api/models/transformer3d.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Transformer3D

The Transformer2D model extended for video-like data.

## Transformer3DModel

[[autodoc]] Transformer3DModel

## Transformer3DModelOutput

[[autodoc]] models.transformer_3d.Transformer3DModelOutput
122 changes: 122 additions & 0 deletions docs/source/en/api/pipelines/tune_a_video.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
<!--Copyright 2023 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->

# Tune-A-Video

## Overview

[Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation](https://arxiv.org/abs/2212.11565) by Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, Mike Zheng Shou
The abstract of the paper is the following:

*To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator. Despite their promising results, such paradigm is computationally expensive. In this work, we propose a new T2V generation setting—One-Shot Video Tuning, where only one text-video pair is presented. Our model is built on state-of-the-art T2I diffusion models pre-trained on massive image data. We make two key observations: 1) T2I models can generate still images that represent verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we introduce Tune-A-Video, which involves a tailored spatio-temporal attention mechanism and an efficient one-shot tuning strategy. At inference, we employ DDIM inversion to provide structure guidance for sampling. Extensive qualitative and numerical experiments demonstrate the remarkable ability of our method across various applications.*

Resources:

* [GitHub repository](https://github.com/showlab/Tune-A-Video)
* [🤗 Spaces](https://huggingface.co/spaces/Tune-A-Video-library/Tune-A-Video-Training-UI)

## Available Pipelines:

| Pipeline | Tasks | Demo
|---|---|:---:|
| [TuneAVideoPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/tune_a_video/pipeline_tune_a_video.py) | *Text-to-Video Generation* | [🤗 Spaces](https://huggingface.co/spaces/Tune-A-Video-library/Tune-A-Video-inference)

## Usage example

### Loading with a pre-existing Text2Image checkpoint
```python
import torch
from diffusers import TuneAVideoPipeline, DDIMScheduler, UNet3DConditionModel
from diffusers.utils import export_to_video
from PIL import Image

# Use any pretrained Text2Image checkpoint based on stable diffusion
pretrained_model_path = "nitrosocke/mo-di-diffusion"
unet = UNet3DConditionModel.from_pretrained(
"Tune-A-Video-library/df-cpt-mo-di-bear-guitar", subfolder="unet", torch_dtype=torch.float16
).to("cuda")

pipe = TuneAVideoPipeline.from_pretrained(pretrained_model_path, unet=unet, torch_dtype=torch.float16).to("cuda")

prompt = "A princess playing a guitar, modern disney style"
generator = torch.Generator(device="cuda").manual_seed(42)

video_frames = pipe(prompt, video_length=3, generator=generator, num_inference_steps=50, output_type="np").frames

# Saving to gif.
pil_frames = [Image.fromarray(frame) for frame in video_frames]
duration = len(pil_frames) / 8
pil_frames[0].save(
"animation.gif",
save_all=True,
append_images=pil_frames[1:], # append rest of the images
duration=duration * 1000, # in milliseconds
loop=0,
)

# Saving to video
video_path = export_to_video(video_frames)
```

### Loading a saved Tune-A-Video checkpoint
```python
import torch
from diffusers import DiffusionPipeline, DDIMScheduler
from diffusers.utils import export_to_video
from PIL import Image

pipe = DiffusionPipeline.from_pretrained(
"Tune-A-Video-library/df-cpt-mo-di-bear-guitar", torch_dtype=torch.float16
).to("cuda")

prompt = "A princess playing a guitar, modern disney style"
generator = torch.Generator(device="cuda").manual_seed(42)

video_frames = pipe(prompt, video_length=3, generator=generator, num_inference_steps=50, output_type="np").frames

# Saving to gif.
pil_frames = [Image.fromarray(frame) for frame in video_frames]
duration = len(pil_frames) / 8
pil_frames[0].save(
"animation.gif",
save_all=True,
append_images=pil_frames[1:], # append rest of the images
duration=duration * 1000, # in milliseconds
loop=0,
)

# Saving to video
video_path = export_to_video(video_frames)
```

Here are some sample outputs:

<table>
<tr>
<td><center>
A princess playing a guitar, modern disney style
<br>
<img src="https://huggingface.co/Tune-A-Video-library/df-cpt-mo-di-bear-guitar/resolve/main/samples/princess.gif"
alt="A princess playing a guitar, modern disney style"
style="width: 300px;" />
</center></td>
</tr>
</table>

## Available checkpoints

* [Tune-A-Video-library/df-cpt-mo-di-bear-guitar](https://huggingface.co/Tune-A-Video-library/df-cpt-mo-di-bear-guitar)

## TuneAVideoPipeline
[[autodoc]] TuneAVideoPipeline
- all
- __call__
4 changes: 4 additions & 0 deletions src/diffusers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,7 @@
"T2IAdapter",
"T5FilmDecoder",
"Transformer2DModel",
"Transformer3DModel",
"UNet1DModel",
"UNet2DConditionModel",
"UNet2DModel",
Expand Down Expand Up @@ -268,6 +269,7 @@
"StableUnCLIPPipeline",
"TextToVideoSDPipeline",
"TextToVideoZeroPipeline",
"TuneAVideoPipeline",
"UnCLIPImageVariationPipeline",
"UnCLIPPipeline",
"UniDiffuserModel",
Expand Down Expand Up @@ -443,6 +445,7 @@
T2IAdapter,
T5FilmDecoder,
Transformer2DModel,
Transformer3DModel,
UNet1DModel,
UNet2DConditionModel,
UNet2DModel,
Expand Down Expand Up @@ -606,6 +609,7 @@
StableUnCLIPPipeline,
TextToVideoSDPipeline,
TextToVideoZeroPipeline,
TuneAVideoPipeline,
UnCLIPImageVariationPipeline,
UnCLIPPipeline,
UniDiffuserModel,
Expand Down
2 changes: 2 additions & 0 deletions src/diffusers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
_import_structure["prior_transformer"] = ["PriorTransformer"]
_import_structure["t5_film_transformer"] = ["T5FilmDecoder"]
_import_structure["transformer_2d"] = ["Transformer2DModel"]
_import_structure["transformer_3d"] = ["Transformer3DModel"]
_import_structure["transformer_temporal"] = ["TransformerTemporalModel"]
_import_structure["unet_1d"] = ["UNet1DModel"]
_import_structure["unet_2d"] = ["UNet2DModel"]
Expand All @@ -55,6 +56,7 @@
from .prior_transformer import PriorTransformer
from .t5_film_transformer import T5FilmDecoder
from .transformer_2d import Transformer2DModel
from .transformer_3d import Transformer3DModel
from .transformer_temporal import TransformerTemporalModel
from .unet_1d import UNet1DModel
from .unet_2d import UNet2DModel
Expand Down
191 changes: 190 additions & 1 deletion src/diffusers/models/resnet.py
Original file line number Diff line number Diff line change
Expand Up @@ -757,7 +757,196 @@ def forward(self, input_tensor, temb, scale: float = 1.0):
return output_tensor


# unet_rl.py
class Upsample3D(nn.Module):
"""A 3D upsampling layer. Reshapes the input tensor to video like tensor, applies upsampling conv,
converts it back to the original shape.

Parameters:
channels (`int`):
number of channels in the inputs and outputs.
out_channels (`int`, optional):
number of output channels. Defaults to `channels`.
"""

def __init__(self, channels, out_channels=None):
super().__init__()
self.channels = channels
self.out_channels = out_channels or channels

self.conv = nn.Conv2d(self.channels, self.out_channels, 3, padding=1)

def forward(self, hidden_states, output_size=None):
if hidden_states.shape[1] != self.channels:
raise ValueError(
f"Expected hidden_states tensor at dimension 1 to match the number of channels. Expected: {self.channels} but passed: {hidden_states.shape[1]}"
)

# Cast to float32 to as 'upsample_nearest2d_out_frame' op does not support bfloat16
dtype = hidden_states.dtype
if dtype == torch.bfloat16:
hidden_states = hidden_states.to(torch.float32)

# upsample_nearest_nhwc fails with large batch sizes. see https://github.com/huggingface/diffusers/issues/984
if hidden_states.shape[0] >= 64:
hidden_states = hidden_states.contiguous()

# if `output_size` is passed we force the interpolation output
# size and do not make use of `scale_factor=2`
if output_size is None:
hidden_states = F.interpolate(hidden_states, scale_factor=[1.0, 2.0, 2.0], mode="nearest")
else:
hidden_states = F.interpolate(hidden_states, size=output_size, mode="nearest")

# If the input is bfloat16, we cast back to bfloat16
if dtype == torch.bfloat16:
hidden_states = hidden_states.to(dtype)

# Inflate
video_length = hidden_states.shape[2]
# b c f h w -> (b f) c h w
hidden_states = hidden_states.movedim((0, 1, 2, 3, 4), (0, 2, 1, 3, 4))
hidden_states = hidden_states.flatten(0, 1)

hidden_states = self.conv(hidden_states)
# Deflate
# (b f) c h w -> b c f h w (f=video_length)
hidden_states = hidden_states.reshape([-1, video_length, *hidden_states.shape[1:]])
hidden_states = hidden_states.movedim((0, 1, 2, 3, 4), (0, 2, 1, 3, 4))

return hidden_states


class Downsample3D(nn.Module):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing docstrings.

"""A 3D downsampling layer. Reshapes the input tensor to video like tensor, applies conv,
converts it back to the original shape.

Parameters:
channels (`int`):
number of channels in the inputs and outputs.
out_channels (`int`, optional):
number of output channels. Defaults to `channels`.
"""

def __init__(self, channels, out_channels=None, padding=1):
super().__init__()
self.channels = channels
self.out_channels = out_channels or channels
self.conv = nn.Conv2d(self.channels, self.out_channels, 3, stride=2, padding=padding)

def forward(self, hidden_states):
video_length = hidden_states.shape[2]
# b c f h w -> (b f) c h w
hidden_states = hidden_states.movedim((0, 1, 2, 3, 4), (0, 2, 1, 3, 4))
hidden_states = hidden_states.flatten(0, 1)
# Conv
hidden_states = self.conv(hidden_states)
# (b f) c h w -> b c f h w (f=video_length)
hidden_states = hidden_states.reshape([-1, video_length, *hidden_states.shape[1:]])
hidden_states = hidden_states.movedim((0, 1, 2, 3, 4), (0, 2, 1, 3, 4))

return hidden_states


class ResnetBlock3D(nn.Module):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing docstrings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the main difference to the exsiting resnet class here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResnetBlock3D doesn't support AdaGroupNorm, only torch.nn.GroupNorm; the convolution used is InflatedConv3d instead of torch.nn.conv2d.

Didn't want to merge it with the 2D block, because then ResnetBlock2D also has to have additional parameters. If you're ok with that then I can add some parameters to make it flexible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this better as it encourages disentanglement between the modules and separation of concerns.

r"""
A Resnet block. Used specifically for video like data.

Parameters:
in_channels (`int`): The number of channels in the input.
out_channels (`int`, *optional*, default to be `None`):
The number of output channels for the first conv2d layer. If None, same as `in_channels`.
dropout (`float`, *optional*, defaults to `0.0`): The dropout probability to use.
temb_channels (`int`, *optional*, default to `512`): the number of channels in timestep embedding.
groups (`int`, *optional*, default to `32`): The number of groups to use for the first normalization layer.
eps (`float`, *optional*, defaults to `1e-6`): The epsilon to use for the normalization.
non_linearity (`str`, *optional*, default to `"swish"`): the activation function to use.
output_scale_factor (`float`, *optional*, default to be `1.0`): the scale factor to use for the output.
"""

def __init__(
self,
*,
in_channels,
Comment on lines +868 to +869
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*,
in_channels,
in_channels,
*,

no?

out_channels=None,
dropout=0.0,
temb_channels=512,
groups=32,
eps=1e-6,
non_linearity="swish",
output_scale_factor=1.0,
):
super().__init__()
self.in_channels = in_channels
out_channels = in_channels if out_channels is None else out_channels
self.out_channels = out_channels
self.output_scale_factor = output_scale_factor

self.norm1 = torch.nn.GroupNorm(num_groups=groups, num_channels=in_channels, eps=eps, affine=True)
self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=1, padding=1)

self.time_emb_proj = torch.nn.Linear(temb_channels, out_channels)

self.norm2 = torch.nn.GroupNorm(num_groups=groups, num_channels=out_channels, eps=eps, affine=True)
self.dropout = torch.nn.Dropout(dropout)
self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1)

self.nonlinearity = get_activation(non_linearity)

self.use_in_shortcut = self.in_channels != self.out_channels
self.conv_shortcut = None
if self.use_in_shortcut:
self.conv_shortcut = nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=1, padding=0)

def forward(self, input_tensor, temb):
hidden_states = input_tensor

hidden_states = self.norm1(hidden_states)
hidden_states = self.nonlinearity(hidden_states)

video_length = hidden_states.shape[2]
# b c f h w -> (b f) c h w
hidden_states = hidden_states.movedim((0, 1, 2, 3, 4), (0, 2, 1, 3, 4))
hidden_states = hidden_states.flatten(0, 1)
hidden_states = self.conv1(hidden_states)
# (b f) c h w -> b c f h w (f=video_length
hidden_states = hidden_states.reshape([-1, video_length, *hidden_states.shape[1:]])
hidden_states = hidden_states.movedim((0, 1, 2, 3, 4), (0, 2, 1, 3, 4))

if temb is not None:
temb = self.time_emb_proj(self.nonlinearity(temb))[:, :, None, None, None]

hidden_states = hidden_states + temb

hidden_states = self.norm2(hidden_states)

hidden_states = self.nonlinearity(hidden_states)

hidden_states = self.dropout(hidden_states)

video_length = hidden_states.shape[2]
# b c f h w -> (b f) c h w
hidden_states = hidden_states.movedim((0, 1, 2, 3, 4), (0, 2, 1, 3, 4))
hidden_states = hidden_states.flatten(0, 1)
hidden_states = self.conv2(hidden_states)
# (b f) c h w -> b c f h w (f=video_length)
hidden_states = hidden_states.reshape([-1, video_length, *hidden_states.shape[1:]])
hidden_states = hidden_states.movedim((0, 1, 2, 3, 4), (0, 2, 1, 3, 4))

if self.conv_shortcut is not None:
video_length = input_tensor.shape[2]
# "b c f h w -> (b f) c h w"
input_tensor = input_tensor.movedim((0, 1, 2, 3, 4), (0, 2, 1, 3, 4))
input_tensor = input_tensor.flatten(0, 1)
input_tensor = self.conv_shortcut(input_tensor)
# "(b f) c h w -> b c f h w"; f=video_length
input_tensor = input_tensor.reshape([-1, video_length, *input_tensor.shape[1:]])
input_tensor = input_tensor.movedim((0, 1, 2, 3, 4), (0, 2, 1, 3, 4))

output_tensor = (input_tensor + hidden_states) / self.output_scale_factor

return output_tensor


def rearrange_dims(tensor: torch.Tensor) -> torch.Tensor:
if len(tensor.shape) == 2:
return tensor[:, :, None]
Expand Down
Loading